(Received: 21-Feb.-2024, Revised: 5-Apr.-2024 , Accepted: 24-Apr.-2024)
The challenging endeavour of text-to-video creation requires transforming text descriptions into realistic and cohesive videos. This field of study has made substantial progress in recent years, with the development of diffusion models and generative adversarial networks (GANs). This study examines the most modern text-to-video generation models, as well as the various steps involved in text-to-video generation, including temporal coherence, video generation and text encoding. We additionally emphasise the challenges involved with text-to-video generation, as well as recent advances to overcome these issues. The most frequently used datasets and metrics in this field are also analyzed and reviewed.

[1] A. Singh, "A Survey of AI Text-to-Image and AI Text-to-Video Generators," arXiv preprint, arXiv: 2311.06329, Nov. 2023.

[2] Z. Xing et al., "A Survey on Video Diffusion Models," arXiv preprint, arXiv: 2310.10647, October 2023.

[3] I. J. Goodfellow et al., "Generative Adversarial Nets," Advances in Neural Information Processing Systems, arXiv: 1406.2661, pp. 2672–2680, 2014.

[4] T. Unterthiner et al., "Towards Accurate Generative Models of Video: A New Metric & Challenges," arXiv preprint, arXiv: 1812.01717, 2018.

[5] L. Khachatryan et al., "Text2Video-zero: Text-to-image Diffusion Models are Zero-shot Video Generators," arXiv preprint, arXiv: 2303.13439, March 2023.

[6] J. Mullan et al., "Hotshot-XL," [Online], Available:, October 2023.

[7] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan and S. Ganguli, "Deep Unsupervised Learning Using Nonequilibrium Thermodynamics," arXiv preprint, arXiv:1503.03585, 2015.

[8] Y. Song and S. Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution," arXiv preprint, arXiv: 1907.05600, July 2019.

[9] U. Singer et al., "Make-a-video: Text-to-video Generation without Text-video Data," arXiv preprint, arXiv: 2209.14792, 2022.

[10] J. Ho et al., "Imagen Video: High Definition Video Generation with Diffusion Models," arXiv preprint, arXiv: 2210.02303, 2022.

[11] C. Saharia, "Photorealistic Text-to-image Diffusion Models with Deep Language Understanding," Proc. of the 36th Conf. on Neural Information Processing Systems (NeurIPS 2022), arXiv: 2205.11487, 2022.

[12] D. Junhao Zhang et al., "Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-video Generation," arXiv preprint, arXiv: 2309.15818, September 2023.

[13] Daquan Zhou et al., "MagicVideo: Efficient Video Generation with Latent Diffusion Models," arXiv preprint, arXiv: 2211.11018, November 2022.

[14] J. An et al., "Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-video Generation," arXiv preprint, arXiv: 2304.08477, April 2023.

[15] J.Wang et al., "ModelScope Text-to-video Technical Report," arXiv preprint, arXiv: 2308.06571, August 2023.

[16] Z. Luo et al., "VideoFusion: Decomposed Diffusion Models for High-quality Video Generation," Proc. of the 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 10209-10218, 2023.

[17] S. Hong, J. Seo, S. Hong, H. Shin and S. Kim, "Large Language Models Are Frame-level Directors for Zero-shot Text-to-video Generation," arXiv preprint, arXiv: 2305.14330, May 2023.

[18] H. Huang, Y. Feng, C. Shi, L. Xu, J. Yu and S. Yang, "Free-Bloom: Zero-shot Text-to-video Generator with LLM Director and LDM Animator," arXiv preprint, arXiv: 2309.14494, September 2023.

[19] A. Sauer, T. Karras, S. Laine, A. Geiger and T. Aila, "Stylegan-t: Unlocking the Power of GANs for Fast Large-scale Text-to-image Synthesis," arXiv preprint, arXiv: 2301.09515, 2023.

[20] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras and Y. Choi, "Clipscore: A Reference-free Evaluation Metric for Image Captioning," arXiv preprint, arXiv: 2104.08718, 2022.

[21] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu and M. Chen, "Hierarchical Text-conditional Image Generation with Clip Latents," arXiv preprint, arXiv: 2204.06125, 2022.

[22] Y. Tewel, R. Gal, G. Chechik and Y. Atzmon, "Key-locked Rank One Editing for Text-to-image Personalization," arXiv preprint, arXiv: 2305.01644, 2023.

[23] C. Vondrick, H. Pirsiavash and A. Torralba, "Generating Videos with Scene Dynamics," arXiv preprint, arXiv: 1609.02612, 2016.

[24] S. Tulyakov, M.-Y. Liu, X. Yang and J. Kautz, "MoCoGan: Decomposing Motion and Content for Video Generation," Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, DOI: 10.1109/CVPR.2018.00165, Salt Lake City, USA, 2018.

[25] X. Sun, H. Xu and K Saenko, "TwoStreamVAN: Improving Motion Modeling in Video Generation," arXiv preprint, arXiv: 1812.01037, 2020.

[26] K. Kim et al., "Continuous-time Video Generation via Learning Motion Dynamics with Neural ODE," arXiv preprint, arXiv: 2112.10960, 2021.

[27] H. Fei, S. Wu, W. Ji, H. Zhang and T.-S. Chua, "Dysen-VDM: Empowering Dynamics-aware Text-to-video Diffusion with Large Language Models," arXiv preprint, arXiv: 2308.13812, 2023.

[28] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi and D. J. Fleet, "Video Diffusion Models," arXiv preprint, arXiv: 2204.03458, 2022.

[29] A. Vaswani et al., "Attention Is All You Need," arXiv preprint, arXiv: 1706.03762, 2017.

[30] S Niklaus, L Mai and F. Liu, "Video Frame Interpolation via Adaptive Convolution," arXiv preprint, arXiv: 1703.07514, 2017.

[31] X. Cheng and Z. Chen, "Video Frame Interpolation via Deformable Separable Convolution," Proc. of the AAAI Conf. on Artificial Intelligence, vol. 34, pp. 10607–10614, 2020.

[32] X. Cheng and Z. Chen, "Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution," IEEE Trans. on Pattern Analysis And Machine Intell., vol. 44, no. 10, pp. 7029-7045, 2021.

[33] D. Mahajan, F.-C. Huang, W. Matusik, R. Ramamoorthi and P. Belhumeur, "Moving Gradients: A Path-based Method for Plausible Image Interpolation," ACM Transactions on Graphics, vol. 28, no. 3, Article no.: 42, pp 1–11, 2009.

[34] B. Yan and Y. Chen, "Low Complexity Image Interpolation Method Based on Path Selection," Journal of Visual Communication and Image Representation, vol. 24, pp. 661–668, 2013.

[35] Y. Fan, N. Yoda, T. Igarashi and H. Ma, "Path-based Image Sequence Interpolation Guided by Feature Points," Proc. of the 2016 IEEE Int. Conf. on Image Processing (ICIP), DOI: 10.1109/ICIP.2016.7532421, Phoenix, USA, 2016.

[36] T. Jayashankar, P. Moulin, T. Blu and C. Gilliam, "Lap-based Video Frame Interpolation," Proc. of the 2019 IEEE International Conference on Image Processing (ICIP), DOI: 10.1109/ICIP.2019.8803484, Taipei, Taiwan, 2019.

[37] J. van Amersfoort et al., "Frame Interpolation with Multi-scale Deep Loss Functions and Generative Adversarial Networks," arXiv preprint, arXiv: 1711.06045, 2019.

[38] S. Wen et al., "Generating Realistic Videos from Keyframes with Concatenated GANs," IEEE Trans. on Circuits and Systems for Video Tech., vol. 29, pp. 2337–2348, 2019.

[39] J. Xiao and X. Bi, "Multi-scale Attention Generative Adversarial Networks for Video Frame Interpolation," IEEE Access, vol. 8, pp. 94842–94851, 2020.

[40] P. Didyk, P. Sitthi-Amorn, W. Freeman, F. Durand and W. Matusik, "Joint View Expansion and Filtering for Automultiscopic 3D Displays 3D Stereo Content Multiview Content," ACM Trans. on Graphics, vol. 32, no. 6, Article no. 221, pp. 1–8, 2013.

[41] S. Meyer, O. Wang, H. Zimmer, M. Grosse and A. Sorkine-Hornung, "Phase-based Frame Interpolation for Video," Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), DOI: 10.1109/CVPR.2015.7298747, Boston, USA, 2015.

[42] S. Meyer, A. Djelouah, B. McWilliams, A. Sorkine-Hornung, M. Gross and C. Schroers, "Phasenet for Video Frame Interpolation," arXiv preprint, arXiv: 1804.00884, 2018.

[43] H. E. Ahn, J. Jeong, J. Woo Kim, S. Kwon and J. Yoo, "A Fast 4K Video Frame Interpolation Using a Multi-scale Optical Flow Reconstruction Network," Symmetry, vol. 11, no. 10, Article no. 1251, 2019.

[44] S. Ye Kim, J. Oh and M. Kim, "FISR: Deep Joint Frame Interpolation and Super-resolution with a Multi-scale Temporal Loss," Proc. of the 34th AAAI Conf. on Artificial Intelligence (AAAI-20), pp. 11278- 11286, 2019.

[45] W. Bao, W.-S. Lai, X. Zhang, Z. Gao and M.-H. Yang, "MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, 2018.

[46] W. Bao, W.-S. Lai, C. Ma, X. Zhang, Z. Gao and M.-H. Yang, "Depth-aware Video Frame Interpolation," arXiv preprint, arXiv: 1904.00830, 2019.

[47] M. Choi, J. Choi, S. Baik, T. H. Kim and K. Mu Lee, "Scene-adaptive Video Frame Interpolation via Meta-learning," arXiv preprint, arXiv: 2004.00779, pp.9444-9453, 2020.

[48] M. Choi, H. Kim, B. Han, N. Xu and K. Mu Lee, "Channel Attention Is All You Need for Video Frame Interpolation," Proc. of the 34th AAAI Conf. on Artificial Intelligence (AAAI-20), pp. 10663- 10671, 2020.

[49] K. Soomro, A. R. Zamir and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild," Proc. of the 1st Int. Workshop on Action Recognition with Large Number of Classes, arXiv: 1212.0402, 2012.

[50] J. Xu, T. Mei, T. Yao and Y. Rui, "MSR-VTT: A Large Video Description Dataset for Bridging Video and Language," Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), DOI: 10.1109/CVPR.2016.571, Las Vegas, USA, 2016.

[51] M. Bain, A. Nagrani, G. Varol and A. Zisserman, "Frozen in Time: A Joint Video and Image Encoder for End-to-end Retrieval," arXiv preprint, arXiv: 2104.00650, 2021.

[52] H. Xue et al., "Advancing High-resolution Video-language Representation with Large-scale Video Transcriptions," Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), DOI: 10.1109/CVPR52688.2022.00498, pp. 5026-5035, 2021.

[53] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier and A. Zisserman, "A Short Note about Kinetics-600," arXiv preprint, arXiv: 1808.01340, 2018.

[54] F. Ebert, C. Finn, A. X. Lee and S. Levine, "Self-supervised Visual Planning with Temporal Skip Connections," Proc. of the 1st Conf. on Robot Learning (CoRL 2017), Mountain View, USA, pp. 1-13, 2017.

[55] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter, "Gans Trained by a Two Time-scale Update Rule Converge to a Local Nash Equilibrium," Proc. of the 31st Conf. on Neural Information Processing Systems (NIPS 2017), pp. 1-12, Long Beach, USA, 2018.

[56] Github, "Google-research," [Online], Available:

[57] I. Skorokhodov, S. Tulyakov and M. Elhoseiny, "StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2," Proc. of the IEEE CVPR 2022, pp. 3626-3636, 2021.