About

Hi, I am currently a graduate student at Zhejiang University, CVNext Lab advised by Gaoang Wang. I obtained my B.S. degree in Software Engineering from Dalian University of Technology. Previously, I was a research intern at Microsoft Research Asia. My research primarily in large multimodal models (LMMs) for video understanding, and generative models.

Education

M.S. Sep. 2023 - Mar. 2026 (expected) Zhejiang University, Hangzhou, China. Master's in Artificial Intelligence
B.S. Sep. 2019 - Jun. 2023 Dalian University of Technology (DLUT) , Dalian, China. B.S. in Software Engineering

Experiences

Nov. 2023 -- May. 2024, Media Computing Group, Microsoft Research Lab - Asia, Beijing, China
Research Intern, Work on text-to-image generation

News

We submit some baseline (AuroraCap, MovieChat, MovieChat-Onevision) and some benchmark (VDC, MovieChat-1K) to lmms-eval (LLaVA team), which enables quick evaluation with just a single line of code.
We release a new version of MovieChat, which use LLaVA-OneVision as the base model instead of the original VideoLLaMA. The new version is available on Github.
Our paper Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis is released, and the code is available at website.
Our paper AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark is released, and the code is available at website.
We finish CVPR 2024 Long-form Video Understanding Challenge @ LOVEU and presented a summary of the competition results offline.
Finish my research internship at Microsoft Research Asia (MSRA), Beijing.
We are hosting CVPR 2024 Long-form Video Understanding Challenge @ LOVEU
Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is released, and the code is available at website.
Our paper MovieChat: From Dense Token to Sparse Memory in Long Video Understanding is accepted by Computer Vision and Pattern Recognition (CVPR), 2024.

Selected Publications and Manuscripts

* Equal contribution. † Project lead. ‡ Corresponding author.

Also see Google Scholar.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai*†, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning

Preperint, 2024

[ Paper] [Website] [Model] [Benchmark] [Code]

AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

Fantasy: Transformer Meets Transformer in Text-to-Image Generation

Enxin Song*, Wenhao Chai*, Xun Guo, Gaoang Wang, Jenq-Neng Hwang, Yan Lu

Preperint, 2024

[ Paper]

Fantasy, an efficient text-to-image generation model marrying the decoder-only Large Language Models (LLMs) and transformer-based masked image modeling (MIM).

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

CVPR, 2024

[ Paper] [Website] [Benchmark] [Code]

MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

Professional Service

Conference and Journal Refereeing: ICLR2025, TMM2024, PRCV2023
Workshop Organization: Workshop on Long-form Video Understanding at CVPR 2024

Teaching Assistant

ZJUI Senior Design - Sping 2024
Teaching Assistant (TA), with Prof. Gaoang Wang

Selected Honors & Awards

National Scholarship, 2024 (Zhejiang University)
National Scholarship, 2021 (Dalian University of Technology)

Top