"Foundation models in the visual domain with deeper understanding capabilities (potentially bringing a 'GPT-3 moment' for vision) are expected to emerge within the next 1-2 years."
Interview | Tang Xiaoyin, Executive Editor of CSDN & The New Programmer
Guest | Duan Nan, Step Ahead Tech Fellow
Editor | Zhang Hongyue
Produced by | AI Technology Base Camp (ID: rgznai100)
In this wave of AI-driven visual content innovation, Duan Nan, Tech Fellow at Step Ahead and former Senior Researcher at Microsoft Research Asia, stands at the forefront of exploration. His team open-sourced two important video generation models in February and March this year: the 30B parameter text-to-video model Step-Video-T2V, and the 30B parameter image-to-video model Step-Video-TI2V trained on it, which have garnered widespread attention in the field of AI video generation.
Duan Nan clearly points out that although current video generation technologies (such as Diffusion models) can produce impressive visual segments, we may be touching the 'ceiling' of their capabilities. A truly revolutionary breakthrough in video and even multimodal foundation models with deep understanding capabilities is still in its gestation period.
Duan Nan, Tech Fellow at Step Ahead, leads a research team building language- and video-centric multimodal foundation models. Previously, he was a Principal Senior Researcher and Research Manager of the Natural Language Computing Team at Microsoft Research Asia (2012-2024). Dr. Duan is an adjunct doctoral supervisor at the University of Science and Technology of China and Xi'an Jiaotong University, and an adjunct professor at Tianjin University. His research focuses on natural language processing, code intelligence, multimodal foundation models, agents, and more.
At the 2025 Global Machine Learning Technology Conference (ML-Summit) held on April 18-19, Duan Nan delivered a keynote speech on "Progress, Challenges, and Future of Video Generation Foundation Models" and accepted an in-depth live interview with CSDN afterwards.
Duan Nan predicted that foundation models in the visual domain with deeper understanding capabilities (potentially bringing a "GPT-3 moment" for vision) are expected to emerge within the next 1-2 years.
Why does he have this judgment? In this information-rich dialogue, Duan Nan shared several core insights regarding the future of video generation and multimodal AI:
Uniqueness of Video Scaling Law: Unlike language models, the Scaling Law performance of current Diffusion video models (even reaching 30B parameters) in generalization capability is not significant, but their memory capability is strong. Medium-sized parameters (e.g., 15B) may achieve a better balance between efficiency and performance.
Beyond "Generation" to "Understanding": Current mainstream video generation is similar to "text-to-visual translation," which has limits. The real breakthrough lies in models needing deep visual understanding capabilities, not just pixel generation. This requires a shift in the learning paradigm, from "mapping learning" to "causal prediction learning" similar to language models.
AR and Diffusion Fusion: The future model architecture trend may be a fusion of Autoregressive and Diffusion models, aiming to combine their advantages to better serve the understanding and generation of video and multimodal content.
Data is Still the Cornerstone and Bottleneck: High-quality, large-scale, diverse natural data (rather than over-reliance on synthetic data for basic training) is crucial for building powerful foundation models. The complexity and cost of data processing and annotation are huge challenges.
The "Few-Shot Learning" Moment for Vision: The key capability of the next generation of visual foundation models will be strong Few-Shot Learning ability, enabling them to quickly adapt to and solve new visual tasks, similar to the transformation GPT-3 brought to NLP.
Usability and Influence are Equally Important: Technological innovation is important, but the ease of use of the model and whether it can be practically used by a wide range of developers and creators are key metrics of its influence, and also goals research needs to consider.
The Future of AI and Embodied Intelligence: Advances in video understanding capabilities will provide core perception capabilities for AI applications that need to interact with the physical world, such as embodied intelligence and robotics.
This interview will take you deep into cutting-edge thinking, technical bottlenecks, and future blueprints in the field of video generation and even multimodal AI. Whether you are an AI researcher, developer, or observer curious about future technology, you will gain profound insights from it.
Below is the official interview with Mr. Duan Nan: (Text has been appropriately optimized for easier reading by the editor)
CSDN: We are joined by the long-awaited Mr. Duan Nan, who is now serving as Tech Fellow at Step Ahead. Mr. Duan, please say hello to everyone and introduce yourself.
Duan Nan: Hello everyone, my name is Duan Nan. I am currently working at Step Ahead, mainly responsible for video generation related projects. Before this, I worked at Microsoft Research Asia for more than ten years, focusing on natural language processing research. Today, I am very honored to communicate with you in this live format, which is a first for me.
CSDN: Is this your first time participating in a live broadcast?
Duan Nan: Yes, it really is the first time.
CSDN: Then we are very honored that Mr. Duan's first live broadcast is dedicated to the CSDN live stream.
Duan Nan: It's my honor.
CSDN: I noticed that your title at Step Ahead is "Tech Fellow", which is relatively uncommon in startups, usually more used in foreign companies. Could you explain the thinking behind this title?
Duan Nan: You don't need to pay too much attention to the form of the title. I am essentially still a researcher, continuing to delve into areas I am interested in, just on a different work platform.
CSDN: Mr. Duan gave a presentation on "Progress, Challenges, and Future of Video Generation Foundation Models" at the Global Machine Learning Technology Conference, which is also his latest work prepared with extra effort. Could you briefly introduce the core content of the speech, especially the key points you hope everyone will pay attention to?
Duan Nan: Today's report is a phased summary of the projects I've been working on at Step Ahead over the past year. When I was at Microsoft Research Asia, my research interests gradually shifted from natural language processing, multilingualism, and code intelligence to multimodal AI. At Step Ahead, I combined my previous exploration in visual video generation with the company's needs and built it from scratch.
The report mainly introduced the two models we open-sourced in February and March: the 30B parameter text-to-video model Step-Video-T2V, and the 30B parameter image-to-video model Step-Video-TI2V trained based on it. This report is relatively conventional, mainly reviewing all aspects of the current SOTA (State-of-the-Art) models in this direction, including model architecture design, data processing flow, training efficiency optimization, etc.
Through the development of models from 4B to 30B, I realized that the current generation of AIGC-based video generation model paradigm may have limitations. The end of the report also briefly mentioned some thoughts and plans for the future.
CSDN: You mentioned that the report was conventional and didn't overly highlight technical innovations in research. Could you share some of the technological innovations in the field of AI in the past five years that you consider milestones?
Duan Nan: From my standards, significant innovations in the AI field in the past five years include:
BERT Model: It greatly enhanced the representation capabilities of natural language. After that, the NLP field formed a tripartite situation: encoder (like BERT), encoder-decoder (like T5), and pure decoder (like GPT).
GPT-3 Model: The few-shot learning capability demonstrated when data and parameter scale reached a certain level was a milestone, basically establishing the direction of model architecture.
InstructGPT/ChatGPT: Through instruction alignment and reinforcement learning (RLHF), models were able to follow instructions extremely well. This is another significant milestone, basically laying the foundation for the NLP paradigm.
DeepSeek Series Models: Domestically, DeepSeek has produced a series of excellent models (such as Math, Code, V series, and R1). They not only have excellent performance but also are practical for widespread use, which is remarkable.
Sora Model: In the field of multimodal generation, the appearance of Sora truly made video generation a focus.
GPT-4o/Gemini 2.5: These types of models truly pushed the unified understanding of images and text to a new height, which is very crucial.
CSDN: You believe that the current work is still some distance from the effect brought by Sora and others, but building a solid foundation is a prerequisite for moving in that direction. Could you share some of the pitfalls you encountered and lessons learned in infrastructure building (Infra) to provide some reference for other teams?
Duan Nan: Besides the efforts of our team members, this project also received strong support from the company's database and system teams. I'll share some experiences from three aspects: model, data, and system:
Model Level
Full Attention: In the early stages, we tried a structure that separated spatial and temporal aspects and then stacked them. Later, we found that the Full Attention mechanism allows for sufficient interaction of information within the model, greatly improving motion range. This is now a consensus.
Architecture Selection (DIT + Cross Attention vs MMDIT): We chose DIT plus Cross Attention, and similar architectures are used by Meta's Movie Gen and Alibaba's Wanxiang (Wan). Some closed-source models or large companies may prefer MMDIT (integrating text and visual information earlier). Theoretically, MMDIT might be better for instruction control, but we chose the former also considering the compatibility of the model for future evolution towards visual foundation models. This is not the optimal choice, each has pros and cons.
Model Size (30B): Choosing 30B was to explore the relationship between model size and effect. The conclusion is that the Scaling Law of Diffusion models in the 4B to 30B range does not show as significant improvement in generalization ability as language models, but their memory ability is very strong. For balancing efficiency and performance, around 15B might be a good choice. If exploring AGI or the upper limit of models and resources are sufficient, further tuning or trying larger models is possible.
Data Level
Data processing is crucial. This includes video segmentation, watermark/subtitle processing, content description, aesthetic score, motion score, clarity score, camera shake, camera language annotation, etc., all requiring huge effort and hands-on work.
System Level
Having a strong system team support is very crucial. I would also like to thank Step Ahead's system team here; they are very strong, and I learned a lot from them. Their support for the project was vital.
CSDN: In the practice of multimodal models, if you had to choose the most difficult and crucial step, without which the entire model project cannot proceed, what would it be?
Duan Nan: That depends on the preconditions. If resources are sufficient, data is the most difficult. If resources are relatively limited, then both data and systems become very difficult. From the model algorithm itself, if not specifically emphasizing the next generation or novelty, the model architecture for most topics in the mainstream AI field is relatively clear. Above these architectures, there are many details in training, tuning, and inference. For projects with relatively high certainty, it seems that the importance of systems and data may be greater than that of the algorithm itself.
CSDN: You mentioned that you initially questioned the effect of the 30B parameter model, but after practice, you felt that medium-sized parameters might be sufficient. Will you continue to explore larger parameter models in the future?
Duan Nan: Yes, but there's a precondition. I said medium-sized models are OK because at Step Ahead, we need to consider application-level challenges, which is the balance between efficiency and quality.
But from another perspective, I believe there is an upper limit to this generation of Diffusion models. To move forward, video models need to follow physical laws more strongly and not just do generation. Successful models in the NLP field gained stronger understanding capabilities through generation; generation is just a way to display results. The video domain should also be like this, enabling visual models to have stronger visual understanding capabilities through a similar paradigm. This capability in NLP might require parameters of tens of billions or more to exhibit in-context learning.
Current video generation models are trained on "text description -> visual video," which is similar to machine translation a decade ago. Successful NLP models learn causal and contextual relationships in information by predicting the next token.
Therefore, regarding model size, the reason for exploring larger models and why I chose the DIT+Cross Attention structure is because I believe video has the opportunity to become a model that unifies understanding and generation in the visual domain, like large language models, and can seamlessly integrate with language. This is the direction our team is currently exploring.
CSDN: You just mentioned the challenges video generation will face in the next one to two years and your thoughts on the next generation of models. What exploration progress in these directions from industry and academia do you think is worth paying attention to? Or, what solutions have you observed? Also, regarding the Scaling Law issue you mentioned later.
Duan Nan: In terms of unified multimodal understanding and generation models, one major direction currently is the fusion of Autoregressive and Diffusion. We tried converting visual signals to discrete tokens at Microsoft earlier, but found that it significantly degraded generation quality. Therefore, using continuous representations for visual understanding and generation is a relatively correct direction.
Currently, Diffusion is still SOTA in pure visual generation, but successful NLP models are mostly Autoregressive. The direction I personally favor is: the fusion of Autoregressive and Diffusion.
Integrating video into this framework brings new challenges. Generating one image frame doesn't accumulate errors much; but with videos lasting hundreds or even thousands of frames, pure AR methods will have serious error accumulation.
AR models predict token by token, which is extremely inefficient, especially for video. Sparse mechanisms in NLP (MoE, MRA, etc.) may be applied to visual generation and understanding models in the future.
Ensuring consistency, motion laws, and training/inference efficiency for long videos are all huge challenges.
CSDN: When I use video generation tools, I often feel that the generation speed is slow and waiting time is long. Although it's much faster than manual video production, how to further improve speed and quality while extending the generation duration should be a core problem for you, right?
Duan Nan: Yes. Just like the development of translation technology, from being mastered by a few to being available to everyone. Video generation is undergoing a similar process, lowering the threshold for content creation. How to enable creators to obtain high-quality results at lower costs and faster is the direction we need to work towards. I believe what happened in the field of language models will also happen in the visual field, and the next generation of large models will be able to better support high-quality content creation in the future.
The core is inference speed and quality assurance. Currently, some good generation examples are more like the model having seen similar distributions of content in the training data, forming a "subconscious" reaction.
CSDN: You mentioned the two open-sourced Step-Video models earlier. Could you introduce their effects? And what kind of feedback have you received from the community, academia, or industry after open-sourcing them?
Duan Nan: Our two models have their own characteristics:
Text-to-Video Model Step-Video-T2 (30B): It enhanced video motion, mainly through data and training strategies. It performs well in sports movements and adherence to physical laws. When released in late January/early February this year, compared to mainstream models at home and abroad, it should be SOTA among open-source models and has distinctive features in some dimensions.
Image-to-Video Model Step-Video-TI2V (30B): Since it was trained on a large amount of anime data in the early stages, the quality in this style is very good. We have also compared it with products like Wondershare.
CSDN: How large is your team currently? Does it include all parts: model, data, and system?
Duan Nan: Including interns, it's about a dozen people. There were fewer people when working on this project. The data and system parts are supported by colleagues from other teams.
CSDN: What is the main feedback from the community?
Duan Nan: The biggest feedback is that the model is too large (30B) for average AIGC creators to handle.
This indeed gave me an insight: a comprehensive and usable model has a larger download volume in the application community than a model that pursues the upper limit. Models should not only pursue the upper limit but also consider usability, making them accessible to developers and creators. This is something I didn't consider much before, as I was more concerned with the model's upper limit and ultimate capabilities, which relates to whether a next-generation model is needed.
CSDN: So in the future, will you explore the upper limit upwards and also consider usability downwards? Will you work on both large and small models?
Duan Nan: Yes, large models need corresponding small models. This is a compromise between the upper limit and applications. Moreover, the achievements of large models are crucial for improving the quality of small models, which will also happen in the video field.
However, from my personal perspective, I will focus more on the next generation of model architecture for video understanding and generation, and multimodal understanding and generation. I may explore the architecture on small models first, verify it, and then consider scaling up.
CSDN: You summarized six major challenges in your speech. How do these differ from the challenges in video understanding you just mentioned?
Duan Nan: If focusing on AIGC, pursuing efficiency, controllability, editability, and high-quality data is particularly important. This is about building better models based on the current foundation, requiring continuous refinement of data and model modules (VAE, Encoder, DIT, post-training SFT/RLHF/DPO, etc.).
But from the overall AI perspective, visual foundation models need stronger understanding capabilities, which requires a change in the learning paradigm. I don't think the Diffusion learning method is likely to learn general understanding capabilities; it needs to do autoregressive prediction learning like NLP.
Once this paradigm shift occurs, issues like efficiency and alignment might be set aside for a while. I believe that for foundation models, it must be data-driven, not fake data-driven, and cannot be synthetic data. Therefore, we need to focus more on the data selection for foundation models (naturally accumulated massive data), the learning paradigm (borrowing from language models but adapting to vision). Visual representation, generation methods (not necessarily predicting tokens), how to evaluate visual understanding capabilities, etc., are all huge challenges. The visual field may be in the stage after BERT and before GPT-3 in NLP, and will then go through a process similar to GPT-3 to ChatGPT.
CSDN: If synthetic data cannot be used to train foundation models, won't this cause significant problems in practice? How do you handle that?
Duan Nan: It's indeed a big problem. We can learn from the path from NLP to multimodal: first build a large language model on NLP, then connect visual information, and fine-tune the unimodal model into a multimodal one using a small amount of image-text alignment data.
Although we lack a large amount of natural image-text alignment data, there is a lot of pure text, pure image, and pure video data. I believe we can first build a foundation model like a language model under a certain unimodal (such as vision), enhance its own capabilities, and then perform cross-modal fine-tuning. At that time, the required amount of alignment data will be much smaller. This is a complementary path different from end-to-end native multimodal.
CSDN: If we analogize the development of NLP from BERT to GPT, what stage do you think video generation is currently at? When is it expected to reach a moment similar to ChatGPT?
Duan Nan: It's far from there. I feel that foundation models in the visual domain will emerge within the next one to two years. First, similar models targeting video content will come out; second, combined with multimodal AI, they will provide crucial visual understanding capabilities for existing understanding tasks, as well as for current hot topics like embodied intelligence, agents, and robotics. If this step is done well, it will be an important cornerstone for the next stage of applications and research.
CSDN: So you think the development of video generation foundation models will be combined with directions like embodied intelligence in the future?
Duan Nan: From the perspective of AGI, the goal is to create an "intelligent agent" that far surpasses humans in certain dimensions but generally possesses human functions. Humans receive information sequentially, similar to video. Therefore, the development of visual understanding is mainly to provide more powerful temporal visual understanding capabilities for future intelligent agents (embodied intelligence, robots, etc.).
From the perspective of AIGC, in the future, everyone may be able to put themselves into movies and create with people they want to collaborate with.
Currently, AIGC has several trends:
Video generation length is increasing, enhancing narrative;
Editing capability is continuously improving, enhancing controllability;
Reference-based image/video generation is developing rapidly, allowing everyone to be the protagonist in the future.
CSDN: Are the six major challenges you shared arranged in a certain order (e.g., by difficulty)?
Duan Nan: They are arranged from a pragmatic to a medium-to-long-term perspective. The pragmatic aspect is the data level; further is the application level, considering efficiency, instruction following, and multi-turn editing interaction; going further, in my opinion, it's not just AIGC, but the development of AI itself, such as world models.
CSDN: So world models are related to the final (or crucial) node of AIGC that everyone hopes to achieve. Regarding these six challenges, does your team have corresponding optimization or improvement plans in the technical roadmap?
Duan Nan: Yes, there are plans. On one hand, we will accumulate more solid experience in basic modules (data annotation, video representation, model architecture), continuously iterate and optimize, and keep improving like a product. On the other hand, we will invest a small amount of resources in future exploration. We cannot just be followers; we need to try to do some innovative things, even if the probability is low.
CSDN: In your final summary on the Future, you mentioned changes in model paradigm, learning paradigm, and model capability. Does this relate to the real innovation you hope to achieve? Could you share your basic ideas?
Duan Nan:
Change in Model Architecture Paradigm: Moving from pure Diffusion models towards the fusion of Autoregressive and Diffusion.
Change in Learning Paradigm: Shifting from mapping learning from text to video to predictive learning of causal relationships like language models.
Change in Capabilities: From the AIGC perspective, it's generation capability, but its generalization is not as good as language models. The strongest capability of a foundation model should be few-shot learning, which is the ability to quickly solve a new type of task with a small number of new task samples. Analogous to vision, in the future, you might show the model a few examples of special effects (like an object exploding), and it can directly output a similar effect without extra training.
CSDN: These changes you envision sound very long-term.
Duan Nan: Many things are developing rapidly. Before November 2022, I thought I could work on NLP for a lifetime, but then the situation changed rapidly. So these seemingly long-term things, perhaps simplified or intermediate stages, might appear quickly.
CSDN: How quickly is this "quickly"? What important things do you estimate will happen within one to two years?
Duan Nan: I personally feel it's one to two years. Important things include: will a moment similar to GPT-3 appear in the visual domain? Can multimodal models truly unify text, images, and videos? If these can be achieved, it will be remarkable, and everyone will really have to think about what to do next.
CSDN: After you "disappeared" for a year, you've reappeared. Could you share the top three most profound lessons you learned during this year? What were the changes in your cognition, and what remained unchanged?
Duan Nan:
Skill Stack Expansion: In the past, I might have focused too much on algorithms and so-called innovation itself, neglecting the importance of data and systems in large projects. This year, I gained experience in this area.
Usability: Projects should not only pursue academic limits but also consider usability, especially in different environments. Influential research, in this era, must be usable by people.
Cognitive Change: I have a deeper understanding of the relationship between technological innovation and widespread application.
Unchanged: My pursuit of technology itself has never changed. In the broad direction, I believe some things will eventually happen, and the goal of moving in that direction has not changed.
CSDN: In the rapidly changing era of large models, technological breakthroughs are unpredictable. Amidst this uncertainty, what do you think is certain?
Duan Nan: As someone who has been in research for many years, I believe some macroscopic trends are certain. Although adjustments will be made depending on the platform and stage, the goal of moving towards the broad direction will not change.
CSDN: In the multimodal field, what do you think will ultimately be achieved?
Duan Nan: The unification of understanding and generation for language and vision. In the future, people will be more convenient in using devices to perceive content other than text (images, environment), and will also be better able to create content for social, work, or hobbies. There will be more opportunities for everyone to be a self-media creator. I attended an annual conference before and saw content creators building very complex pipelines, which made me believe that creative people will integrate and use technology; it's very impressive.
CSDN: At the beginning of the year, everyone thought the text field was relatively mature, and multimodal results were not yet obvious. Do you think this result will appear in 2025 or 2026? Could you be more specific?
Duan Nan: I feel that in the next year, at least the understanding and generation of images and text, like GPT-4o, will be done very well and can solve many practical problems, such as small businesses creating advertisements with images and text.
Going further:
Application Level: New AI applications are currently uncertain; there might be developments in the future.
Model Level: Multimodal models will develop towards the physical world, perceiving vision better, such as action understanding. There will be more and more solid results in this area.
CSDN: Someone in the live stream is asking what AI assistants Mr. Duan uses? What are your AI usage habits?
Duan Nan: I use some of them. Including Step Ahead's own "Step Ahead AI" assistant, DeepSeek, etc. Because I worked at Microsoft, I also kept some habits of using ChatGPT.
CSDN: What was your work status like over the past year? How much overtime did you work?
Duan Nan: I think it's called overtime when it's passive, and not overtime when it's active. People on our team are self-driven and don't need to be specifically asked.
CSDN: This means everyone is voluntarily invested, feeling like they've encountered many pitfalls while also feeling like it's something they want to do.
Duan Nan: Yes, that's right.
CSDN: Thank you very much for Mr. Duan's sharing. I hope you can come out and communicate with everyone more often in the future.
Duan Nan: Okay, thank you everyone.
The 2025 Global Machine Learning Technology Conference Shanghai Station has successfully concluded. This conference revolved around the forefront development trends and practical applications of AI, focusing on 12 major topics including the evolution of large language model technology, AI agents, embodied intelligence, DeepSeek technology analysis, and industry practices. Over 60 heavyweight guests from top global tech companies and academic institutions gathered together to comprehensively present the technical trends and application frontiers in the AI field.
Scan the QR code below to receive the PPT for the "2025 Global Machine Learning Technology Conference Shanghai Station" for free.