Xiaohongshu Open-Sources First Multimodal Large Model, dots.vlm1, Performance Rivals SOTA!

Xinzhiyuan Report

Editor: Ding Hui

[Xinzhiyuan Guide] Xiaohongshu, known for "grass-planting" (recommending products/experiences), is increasing its self-developed technology efforts, open-sourcing three models within two months! The recently open-sourced multimodal large model, dots.vlm1, built on a self-developed visual encoder, has been tested to perceive color blind tests, solve Sudoku, tackle college entrance exam math problems, and write Li Bai-style poetry with a single sentence. Its visual understanding and reasoning capabilities are approaching the closed-source model Gemini 2.5 Pro.

The AI circle recently can only be described as a battle of gods, incredibly competitive.

OpenAI finally released open-source models, Claude upgraded from Opus 4 to 4.1, and Google's release of Genie 3, which generates game worlds, sparked heated discussion in the community.

On the domestic model front, just a few days ago, the top 10 open-source models on HuggingFace were all from China.

Domestic models dominating the top 10 and gpt-oss surging to first place after open-sourcing.

But if you observe these top-ranked open-source models closely, you'll find a "phenomenon": most of these models are text models, lacking multimodal capabilities.

OpenAI's first open-source models are also text models.

If you're talking about models with "multimodal" capabilities that are also "easy to use" and open-source, there really aren't many.

While a group of text models are in a fierce battle, Xiaohongshu's Humane Intelligence Lab (hi lab) quietly open-sourced the visual language model dots.vlm1 yesterday, bringing an unexpected surprise to VLM.

Why should we pay attention to a visual language model open-sourced by an unknown team?

One reason is that hi lab's dots.ocr document parsing model, open-sourced last week, climbed to seventh place on Huggingface's hot list. Its base model is a "small model" with 1.7 billion parameters, yet it still achieved industry-leading SOTA performance, successfully attracting our attention.

This team is clearly serious about their work!

A closer look at the team's structure and vision reveals that "hi lab" was formed by the merger and upgrade of Xiaohongshu's internal large model technology and application product teams. In hi lab's official introduction, it particularly emphasized "focusing R&D on diverse forms of intelligence." They hope to continuously expand the possibilities of human-computer interaction by integrating various forms of intelligence such as interpersonal intelligence, spatial intelligence, musical intelligence, and humanistic care. Their belief in and commitment to multimodality are evident.

And dots.vlm1 is the first multimodal large model developed and open-sourced by Xiaohongshu's hi lab.

This model is built upon hi lab's fully self-developed 1.2 billion-parameter NaViT visual encoder and the DeepSeek V3 large language model. It performs remarkably well in visual understanding and reasoning tasks, approaching SOTA levels, while still maintaining competitiveness in pure text tasks.

On major visual evaluation benchmarks such as MMMU/MathVision/OCR Reasoning, dots.vlm1's overall performance is close to current leading models like Gemini 2.5 Pro and Seed-VL1.5 Thinking, demonstrating strong image-text understanding and reasoning capabilities.
On typical text reasoning tasks (e.g., AIME, GPQA, LiveCodeBench), dots.vlm1's performance is roughly equivalent to DeepSeek-R1-0528, showing some generality in math and code abilities, but still has gaps in more diverse reasoning tasks like GPQA.

Overall, dots.vlm1's visual multimodal capabilities are approaching SOTA level.

Github Repo: https://github.com/rednote-hilab/dots.vlm1

Huggingface Model: https://huggingface.co/rednote-hilab/dots.vlm1.inst

Demo: https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo

In actual tests, we found that dots.vlm1's performance far exceeded expectations in various aspects, whether it's spatial relationship understanding, complex chart reasoning, OCR recognition, college entrance exam problem evaluation, STEM problems, or poetry writing.

Impressive Real-World Performance

First, let's look at spatial understanding, such as this diagram showing common object spatial relationships.

To prevent the model from skipping the true understanding process by relying on semantics, two relationships were randomly masked, and dots.vlm1 was asked to define the spatial relationships between objects.

The model successfully identified and accurately provided the relationships "between" and "above".

Facing complex charts, dots.vlm1 also demonstrated powerful analytical capabilities.

For example, it was asked to extract models with scores between 50-59 and model names containing the letter "P" from the chart below.

dots.vlm1 could synchronize multiple logical judgments during the thinking process. This multi-chain complex reasoning shows that dots.vlm1 can not only "see" but also "think."

Similarly, even with Sudoku problems, dots.vlm1 perfectly completed the solution.

The model's first step is to format the problem for easier subsequent calculations.

Then it begins step-by-step calculations and checks. It can be seen that dots.vlm1 converted the Sudoku problem in the image into a vector description, which is indeed a clever approach.

During the long thinking process, we also discovered a "eureka moment" similar to DeepSeek's, where dots.vlm1 even uttered a human-like "Yes!" at some stage.

However, after carefully reviewing the thinking process, it was found that during the first step of vectorization conversion, the '6' at position (3,8) was incorrectly identified at (3,9). Yet, the model still "strictly followed Sudoku rules" and ultimately forced the '6' at position (6,9) to become '8'.

This reasoning process is quite powerful! It means the model is truly thinking and reasoning.

Solving this Sudoku problem took a very long time, and the key is that despite such a long thinking process, the model did not interrupt.

dots.vlm1's image recognition capability is also very strong, whether for common, obscure, or even images difficult for humans to identify.

For example, the classic red-green color blindness number test.

dots.vlm1 answered all of them correctly in one go, with accurate recognition of both colors and shapes.

Another common problem for VLMs is "counting," where the model needs to identify the types and quantities of objects in an image.

These problems are simple for humans, but not so easy for VLMs.

In such "object search" tasks, VLM performance rapidly declines as the number of targets in the scene increases.

It can be seen that when the number of objects exceeds 6, the accuracy of VLM drops sharply.

dots.vlm1 successfully completed the quantity identification for the top-left, bottom-left, and top-right sections; for the bottom-right, which is difficult even for humans to count accurately, dots.vlm1 still tried to approximate during its thinking process.

Next, let's look at its reasoning ability.

For example, if you are traveling to the Palace Museum with a group of 8 people (7 adults and one 12-year-old child), and you plan to visit the Central Axis, the Three Great Halls, and the Treasure Gallery, which service should you purchase to save the most money?

dots.vlm1 quickly integrated all information and calculated the optimal solution, showing great detail. The model recognized that the Central Axis already includes the Three Great Halls, choosing the Central Axis + Treasure Gallery option.

This "math calculation" seems a bit simple, so let's try a more complex one: this year's college entrance examination math problem.

dots.vlm1 first accurately identified the "blurred" content, and the solution process it provided was excellent.

The results were well-formatted, and the answer to each question was highlighted with a box.

Besides visual and reasoning abilities, we also tried asking many "obscure" questions.

For example, "What type of animal is a 'chevrotain'? What is its main habitat? Provide two references."

This model could accurately answer questions about this obscure animal.

dots.vlm1 also excels at identifying cultural relics.

Even with an image merging three objects, dots.vlm1 could accurately segment the image and provide correct results.

It also provided the excavation site and detailed introduction of the relics.

We can even use dots.vlm1 for administrative aptitude tests.

After complex reasoning, the model provided the correct answer: "The direction is horizontal and the symbols at both ends are the same."

Finally, let's test VLM's original strength: its text capabilities.

"Mimic Li Bai's poetry style and write a poem describing what the 'Poem Cloud' mentioned by Liu Cixin in 'The Poem Cloud' truly is."

It's truly impressive; dots.vlm1's thinking process immediately grasped the essence of the problem: "simultaneously handling sci-fi concepts and Tang poetry charm."

What do you think of this poem?

Alas! The Poem Cloud is vast as the starry sea, hard to capture a speck of mortal tenderness!

It's truly beautifully written!

《Ode to the Poem Cloud》

Divine craft carved chaos, waving forth the Poem Cloud!

Billions of stars turn to dense ink, the Milky Way pours forth as a jade stream.

Exhausting endless forms of Chinese characters, arranging and combining to enact spring and autumn.

Ascending to the azure sky, descending to the yellow springs, gathering myriad ancient thoughts and sorrows.

Li Bai, Du Fu, Su Shi, Xin Qiji, all encompassed, every pearl-like word revealing starry eyes.

This cloud is neither mist nor smoke, but civilization's ultimate quest.

Quantum entanglement contains true meaning, the sea of probability allows for rise and fall.

Swinging the brush, ink disperses like clouds and smoke, though divine work is done, its essence is hard to retain.

Alas! The Poem Cloud is vast as the starry sea, hard to capture a speck of mortal tenderness!

dots.vlm1 Technical Architecture

dots.vlm1 consists of three core components: a fully self-developed 1.2 billion-parameter NaViT visual encoder, a lightweight MLP adapter, and the DeepSeek V3 MoE large language model.

This architecture is trained through a three-stage process:

First Stage: Visual Encoder Pre-training: The NaViT encoder is trained from scratch, aiming to maximize perception capabilities on diverse visual data. Generally, whether the encoder is self-developed is a dividing line for VLM model performance. dots.vlm1 re-validates this point.
Second Stage: VLM Pre-training: The visual encoder is jointly trained with the DeepSeek V3 LLM, using large-scale, diverse multimodal datasets.
Third Stage: VLM Post-training: Model generalization ability is enhanced through supervised fine-tuning (SFT), trained using only task-diverse data.

NaViT Visual Encoder: Native Advantage from "Starting from Scratch"

dots.vlm1 is not fine-tuned on existing mature visual encoders but is trained entirely from scratch, natively supporting dynamic resolution.

This allows the model to natively support high-resolution input and is a visual encoder model specifically designed for visual language models.

The model scale includes 42 Transformer layers and 1.2 billion parameters, reserving sufficient representation capacity for high resolution.

dots.vlm1 designed a two-stage training strategy for the NaViT encoder.

· First Stage: Pre-training

Training starts with completely random initialization, avoiding the constraints of "resolution anchors" from old architectures, and natively supports dynamic resolution.

Starting from random initialization, training is performed on 224×224 resolution images to enable the model to learn basic visual and semantic perception.

This step uses a dual supervision strategy:

Next Token Prediction (NTP): Training the model's perceptual ability through a large number of image-text pairs;
Next Patch Generation (NPG): Utilizing pure image data to predict image patches through a diffusion model, enhancing spatial and semantic perception capabilities.

· Second Stage: Resolution Enhancement Pre-training

Gradually increase image resolution: starting from million-pixel level input for training on a large number of tokens, then upgrading to ten-million-pixel level for training.

To further enhance generalization capabilities, richer data sources were introduced, including OCR scene images, grounding data, and video frames.

VLM Pre-training Data Layout

To enhance dots.vlm1's multimodal capabilities, the lab divided the pre-training data into two main categories:

· First Category: Cross-modal Inter-translation Data

This category of data is used to train the model to describe, summarize, or reconstruct image content with text. Simply put, it's about "translating" Image ⇄ Text mutually.

General Images + Alt Text or Dense Caption
Complex charts, tables, formulas, graphics (real or synthetic) + structured annotations or text;
OCR scenarios: multi-language, scene understanding, pure text, document parsing, etc.;
Video frames + temporal sequence descriptions;
Grounding supervised data: such as bounding boxes and keypoints.

For example, Alt Text refers to the image and its accompanying ALT description.

Alt Text helps the model quickly grasp "general descriptions," while Dense Caption enables the model to "see details and describe them specifically."

Grounding supervised data is hard to exhaustively enumerate, covering various combinations of images/videos and corresponding texts.

For example, the Flickr30k Entities dataset.

dots.vlm1 aims to build a full-spectrum data distribution, covering all visual information that can be understood by humans and converted into discrete token sequences.

Second Data Category: Cross-modal Fusion Data

The second type of data is used to train the model to perform Next Token Prediction (NTP) in image-text mixed contexts, preventing the model from over-relying on a single modality.

Specialized cleaning pipelines were designed for different types of fusion data, with the following two types being particularly effective:

Web Data

Web image-text data is rich in diversity, but the quality of visual and text alignment is often poor.

Instead of using traditional CLIP score filtering, an internally developed VLM model is employed for rewriting and cleaning, removing low-quality images and weakly related texts.

PDF Data

PDF content generally has high quality.

To fully utilize this data, Xiaohongshu Hi Lab developed a dedicated parsing model, dots.ocr, which converts PDF documents into interweaved image-text representations.

dots.ocr was previously open-sourced on HuggingFace and reached SOTA level in this field.

Simultaneously, entire PDF pages are rendered as images, and portions of text areas are randomly occluded, guiding the model to combine layout and context to predict the occluded content, thereby enhancing its ability to understand visually formatted documents.

So the question arises: as a content-sharing platform, why is Xiaohongshu entering the already competitive field of AI large models with its own self-developed multimodal large model?

Multimodality: The Inevitable Path to AGI

From the "Ghibli fever" sparked by OpenAI's GPT-4o "native all-around multimodal model" in April, it's clear that pure text models are inferior to multimodal large models.

Ghibli-style images and Sora community images.

The importance of multimodal AI lies in its ability to simulate human comprehensive perception of the world using multiple senses, leading to a more complete and nuanced understanding.

By combining the strengths of information from different modalities, AI systems can make more holistic judgments about complex scenarios.

Tesla bot selling popcorn.

Visual Language Models (VLM), which integrate visual, text, and other capabilities, are becoming the main battleground for enterprise upgrades.

Whether it's autonomous driving or embodied AI, VLMs are needed as the robots' eyes, and even brains, to help them understand and integrate into human society.

Use cases for VLM models.

Meanwhile, Li Feifei's "World Model," Google's recently released Genie 3, and other 3D world generation technologies, along with embodied AI, are pushing multimodality to a higher dimension.

Google's recently released Genie 3.

It's not just about understanding and generating content, but also simulating the real physical world and self-evolution to foster more natural human-computer interaction forms.

Beyond generating images and videos, Google's NotebookLM can generate conversational podcasts from text, specializing in the audio domain.

Among them, text-to-image models and visual language models are two closely related branches in multimodal AI, but with different goals.

The former focuses on generating images, while the latter focuses on understanding images and outputting text.

Text-to-image models remain an industry hot spot, like Midjourney, Sora, etc., widely used in creativity, content generation, and advertising scenarios.

VLMs are playing an increasingly important role in understanding and reasoning, especially with strong demand in fields like embodied AI and intelligent driving.

However, the industry is increasingly blurring the boundaries between them, with text-to-image and VLM both starting to become "fused" MLLM (Multimodal LLM).

Upcoming models like GPT-5 and Google's Gemini 2.5 Pro are "all-around" models.

Although their focuses differ, both text-to-image models and VLMs fundamentally require models to learn the connection between vision and language.

Xiaohongshu prioritizing the launch of VLM rather than text-to-image models suggests that while text-to-image models are more used for "creative assistance," VLM focuses more on "making AI understand people better."

From Xiaohongshu's past actions in AI, it's clear that this community, which advocates UGC (User-Generated Content), is not aggressive in AIGC, still contemplating the impact of AI-assisted creation on content authenticity and human touch.

However, when it comes to "making AI understand people better," Xiaohongshu seems to have greater motivation to invest in R&D.

After all, Xiaohongshu currently has over 350 million monthly active users, with users generating massive amounts of image-text content daily. Large models can play a significant role in better understanding this content and providing more accurate personalized recommendations.

At the same time, how AI will participate in community interactions in the future is a question worth long-term exploration.

Xiaohongshu's determination in self-developed technology is also greater than before.

In addition to building its own cloud last year, a recent piece of unofficial news many overlooked is that Xiaohongshu will switch its online office software in mid-August, fully migrating from WeChat Work to its self-developed "redcity."

At the time, some colleagues believed that "self-developing IM" is an inevitable path for unicorns to become top-tier companies, signifying a clear strategic shift.

Therefore, Xiaohongshu's entry into self-developed large models is very reasonable, even inevitable.

Xiaohongshu's Pursuit of Diverse Intelligence

Whether it's dots.llm1, open-sourced two months ago, dots.ocr, open-sourced last week, or the newly released dots.vlm1, it's clear that Xiaohongshu's Humane Intelligence Lab has decided to develop its own large models.

The dots model family is also continuously growing.

Another point worth noting is that dots.vlm1 is based on DeepSeek V3, not their own dots.llm1.

It can be inferred that this project likely started in parallel internally at Xiaohongshu, with VLM training potentially being more complex and thus taking a bit longer.

But it indicates that Xiaohongshu intended to develop its own multimodal large models from the outset. In the future, it's possible that dots' multimodal models will be trained on dots' text models.

Perhaps Xiaohongshu will use this VLM as an "understanding foundation," first maximizing its ability to "understand users and content," and then progressively developing subsequent creative capabilities like image-to-image and video generation.

Perhaps these model capabilities will be better integrated with Xiaohongshu's application products in the future, proving the "model-application integration" prophecy.

At the beginning of this year, Xiaohongshu's hi lab began recruiting an "AI Humanistic Trainer" team to help AI with post-training.

The "AI Humanistic Trainer" team members come from diverse backgrounds, including philosophy, literature, political science, anthropology, history, and film arts. These "liberal arts majors" in some ways reflect Xiaohongshu's deep understanding of multimodality.

Looking forward to hi lab's next open-source work!

Xiaohongshu Open-Sources First Multimodal Large Model, dots.vlm1, Performance Rivals SOTA!

· First Category: Cross-modal Inter-translation Data

Share Short URL