Interpretation of Seed1.5-VL Technical Report

ByteDance recently released a powerful closed-source multimodal large language model, Seed1.5-VL. Its technical report is remarkably frank and worth reading. In this article, I will guide you through a step-by-step close reading of this technical report, following the original order of writing.

Overview

https://arxiv.org/abs/2505.07062

Seed1.5-VL consists of a visual encoder with 532M parameters and an MoE LLM with 20B active parameters. It achieved SOTA results in 38 out of 60 benchmarks for multimodal vision-language models. It has demonstrated extremely strong performance in GUI, video understanding, and visual reasoning. Currently, Seed1.5-VL is a commercial model, with paid API access available, but it is not open source.

Model Architecture

First, Seed1.5-VL’s model architecture is still a typical VLM construction: a native dynamic resolution Seed-ViT as the image encoder, similar to Qwen2-VL, using 2D RoPE positional encoding, followed by an MLP Adapter, and finally connected to an autoregressive LLM. (Regarding the input resolution of visual encoders, please refer to this account’s high-resolution MLLM series: Towards High-Resolution VLM (11): VILA-HD)

Fixed resolution faces numerous problems in practical applications, especially for tasks like OCR that require image details, where performance can be greatly affected. To address the challenges of image input resolution, this paper developed the native-resolution visual encoder Seed-ViT.

The Seed-ViT pre-training process is divided into three stages: (1) Masked Image Modeling (MIM) with 2D RoPE, (2) Native-resolution Contrastive Learning, and (3) Omni-modal Pre-training.

In the first stage, the training objective is to enhance visual perception capabilities for visual geometry and structural awareness through MIM. We use EVA02-CLIP-E as the teacher model, while the student model is randomly initialized according to the architecture defined in Table 1. During training, 75% of image patches and corresponding RoPE positional encodings are randomly masked, and CLIP features generated by the teacher are used as the reconstruction target. This process is optimized using the cosine similarity loss between the student and teacher outputs. The authors found that differences in visual positional embeddings between student and teacher models do not harm performance, as the teacher uses learnable positional embeddings while the student uses 2D RoPE. Instead, 2D RoPE endows the student with powerful native dynamic resolution recognition capabilities. As we scaled this MIM process, the VLM's abilities in chart/document understanding and OCR significantly improved.

In the contrastive learning stage, the visual encoder is initialized with our MIM-trained student model, while the text encoder is initialized with the text encoder from EVA-02-CLIP-E. For each given image-text pair, we use attention pooling to pool the block features extracted by the visual encoder into a 1280d image embedding. Alignment between image and text embeddings is then achieved by jointly optimizing SigLIP Loss and SuperClass Loss.

The final omni-modal pre-training stage adopts the MiCo framework, building aligned tuples from video data that include video frames, audio, visual captions, and audio captions. ViT encodes video frames and audio, while a separate text encoder processes captions. By aligning these embeddings, ViT learns a unified omni-modal representation. Although this stage consumes only 4.8% of the total training data tokens during ViT pre-training, it significantly improves ViT's performance on image and video understanding tasks.

In handling video inputs, Seed1.5-VL introduces dynamic sampling resolution to efficiently process videos of varying lengths and information densities, with a maximum budget of 81920 tokens per video segment, allowing for flexible use of higher resolution for fewer frames, or lower resolution to accommodate more frames in longer videos.

Pre-training Data Engineering

As we all know, apart from infrastructure, the core of large model algorithms lies in "data engineering." Although often disparaged as "data cleaning" and looked down upon by scholars proficient in formula derivation and circuit diagram drawing, it is undeniable that data engineering directly determines the upper and lower limits of a model's capabilities. Let's first look at how Seed1.5-VL performed data engineering during the pre-training phase.

Seed1.5-VL's pre-training corpus used 3 trillion (3T) tokens. It's important to note that top large language models typically use 10-30T tokens for pre-training; for downstream multimodal pre-training, 3T tokens is astonishingly high.

General image-text pairs, used to inject visual knowledge, are balanced to some extent for the long-tail distribution of knowledge, ensuring sufficient training iterations for rare visual concepts. This rebalancing strategy is crucial in pre-training.

To verify this observation, researchers conducted sandbox experiments using the Biotrove dataset:

Random-46M: Randomly selected 46 million samples from the training set.

Max1k-46M: Selected 46 million samples, with a maximum of 1000 samples per species, ensuring the inclusion of rare species.

Max100-15M: Selected 15 million samples, with a maximum of 100 samples per species, increasing the relative exposure of rare species.

Experimental results showed that the Random-46M configuration performed poorly in rare species recognition. In contrast, limiting the maximum number of samples for common species (Max1k-46M) significantly improved the performance for rare species. Further limiting the representation of common species (Max100-15M) enhanced the memorization of rare species but adversely affected the recognition of common species. Therefore, effectively acquiring visual knowledge requires maintaining diverse examples of common visual concepts while ensuring sufficient training iterations for rare visual concepts.

OCR data. OCR tasks have become a fiercely contested area for multimodal large models, greatly expanding the application scenarios of MLLMs. A large amount of OCR annotated data and synthetic data was used in training Seed1.5-VL.

The authors constructed an OCR training dataset containing over 1 billion samples, covering documents, scene text, tables, charts, and flowcharts, as shown in the figure above.

Grounding and Counting Task Data. Three main data types were utilized: bounding box annotations, point annotations, and counting data.

3D Spatial Sparse Understanding Data. To enable the model to understand 3D space from a single image, data was constructed for the following three tasks: relative depth ordering, absolute depth estimation, and 3D localization.

Video Data. Includes general video understanding data, temporal localization and retrieval data, and video stream data (interleaved Q&A, real-time comments, etc.).

STEM Data (Science, Technology, Engineering, Mathematics). 3.2 million high-quality educational localization samples were collected, covering 300 categories including mathematics, physics, chemistry, and biology. 10 million structured tables in different formats were synthesized, 4.5 million chemical structure diagrams were generated, and 1.5 million synthetic coordinate system diagrams (including function graphs and positional graphs) were created. Specific subset K12 description data: 100,000 manually annotated descriptions of educational images, 1 million visual question answering (VQA) pairs, 1 million machine-generated descriptions, and hundreds of thousands of geometric descriptions. Over 100 million K12 level practice problems were processed. Tens of millions of Chinese adult education problems and millions of image-related problems were supplemented. A hybrid collection strategy was adopted: manual annotation, automated synthesis, and strict quality control. This ensured multimodal coverage (text, visual, charts) and covered core STEM fields such as mathematics, physics, and chemistry.

GUI Data. Also the most common application scenario for MLLMs, i.e., GUI manipulation. To support powerful GUI perception, grounding, and reasoning, the authors created a large-scale dataset spanning web, application, and desktop environments. Each screenshot is paired with structured metadata elements—type, bounding box, text, and depth—collected through automated parsing and human-assisted exploration.

Pre-training Recipe

The model comprises three main modules: a visual encoder, an MLP adapter, and a language model. Before the visual-language model (VLM) pre-training phase, the visual encoder is trained independently. The language model is initialized from an internally pre-trained model with approximately 20 billion active parameters. This language model employs a decoder-only MoE architecture and has been trained on a massive corpus containing trillions of high-quality pure text tokens. Our VLM pre-training method is divided into three distinct stages:

Stage 0: Aligning the visual encoder with the language model by only training the MLP adapter while freezing both the visual encoder and the language model. Skipping this stage leads to slightly higher loss and slightly worse performance.

Stage 1: All model parameters are trainable. This stage focuses on knowledge accumulation, mastering the model's visual grounding and OCR capabilities by training on a 3-trillion-token multimodal corpus, primarily composed of captions, interleaved image-text, visual grounding, and OCR data. Empirically, adding a small amount of pure text tokens (e.g., 5%) helps maintain the model's language capabilities. Furthermore, adding a small amount of instruction-following data yields more reliable evaluation results, thereby separating pre-training development from post-training.

Stage 2: We create a more balanced data mix across different tasks and add data from new domains (such as video understanding, programming, and 3D spatial understanding). Additionally, we increase the sequence length from 32,768 to 131,072 to better accommodate long dependencies in videos and the modeling of complex reasoning problems. As in Stage 1, all model parameters are trainable.

Post-training

The post-training stage, through a combination of Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), endowed Seed1.5-VL with powerful instruction-following and reasoning capabilities. This process begins with an SFT model trained on cold-start data. A critical component is the data pipeline, which continuously collects difficult and diverse prompts, improving SFT data through rejection sampling and feeding it into RL. Post-training proceeds iteratively: the SFT model is gradually enhanced by refining the RL model's learned outcomes on diverse prompts. This iterative improvement continues until the prompt pool is exhausted and performance metrics converge. Ultimately, this process yields Seed1.5-VL, capable of generating both quick, concise replies and in-depth answers with Long Chain-of-Thought (LongCoT).

The Supervised Fine-tuning (SFT) stage is crucial for equipping Seed1.5-VL with foundational instruction-following and reasoning capabilities before reinforcement learning. The SFT dataset consists of two main parts, each targeting different capabilities. The first part is general instruction data, training Seed1.5-VL to handle diverse and complex instructions, with a focus on generating concise and accurate responses. The second part is Long Chain-of-Thought (LongCoT) data, focusing on generating detailed, step-by-step reasoning processes. This data is generated through prompt engineering and rejection sampling.

To further enhance the model's performance, we also incorporated an additional 30,000 high-quality data samples sourced from the research community. These samples were filtered from our carefully curated open-source library containing approximately 1.5 million entries. Initially, we used a proprietary image-text embedding model to cluster image-text pairs into task-specific categories. This clustering allowed the dataset to maintain high diversity across various tasks. Subsequently, we leveraged a well-trained SFT model, aligned with human preferences, to perform multiple simulations on this sampled subset. The generated responses were filtered using an LLM as a judge, with original ground truth as a reference, to assess the correctness of the model's generated responses. Building on this, we further employed a reward model to filter out responses from the retained results that best aligned with human preferences, thereby obtaining the final rejection sampling fine-tuning data. Ultimately, we compressed the amount of open-source data in the SFT dataset from 1.5 million to about 30,000 high-quality samples. The remaining open-source data was used earlier during the pre-training phase.

For the RLHF stage, human-annotated preference data was collected to train the reward model. A 5-level rating system was used to compare candidate model responses, and preference strength was used to refine synthetic data.

Our online reinforcement learning implementation adopts a PPO algorithm variant, with reward signals derived from the reward model's probability of generating answer tokens. During PPO training, the reward model refers to either the ground truth answers or the top N answers from the SFT model.

Evaluation

Seed-VIT is a small yet high-performing visual encoder.

Seed1.5-VL ultimately achieved SOTA on numerous VQA benchmarks.

A quick advertisement: My friend's book, "Bao Bao Algorithm Notes," has just been released! In today's impetuous environment, this book is one of the few excellent works, and I believe it will certainly help job seekers and enthusiasts interested in the large model industry! (Also available for purchase in this account's store!)

Click 👇 to follow "Siyuan Data Science"

👇 Give a "Like" and "In View"!

Interpretation of Seed1.5-VL Technical Report

Share Short URL