Alibaba Just Open-Sourced Qwen-Image: Free GPT-4o Ghibli-Style Model, Best in Chinese

Early this morning, Alibaba DAMO Academy open-sourced its latest text-to-image model, Qwen-Image.

Qwen-Image is a 20-billion-parameter MMDiT model that can generate dozens of image types such as realistic, anime, cyberpunk, sci-fi, minimalist, retro, surreal, and ink wash. It supports common operations like style transfer, adding/deleting/modifying elements, detail enhancement, text editing, and adjusting character poses.

Qwen-Image can also generate the Ghibli-style images that went viral with OpenAI's GPT-4o. According to actual tests by "AIGC Open Community", the difference between the two is minimal, especially in understanding highly complex Chinese prompts and text embedding, where Qwen-Image performs better.

According to test data released by Alibaba, Qwen-Image demonstrates excellent image generation and editing capabilities in GenEval, DPG, OneIG-Bench, as well as GEdit, ImgEdit, and GSO tests. It significantly outperforms the dark horse in open-source text-to-image models, FLUX.1 [Dev], becoming the best Chinese text-to-image model.

Comparison of Qwen-Image with other models

Free online experience: https://chat.qwen.ai/c/guest

Open-source repositories:

https://huggingface.co/Qwen/Qwen-Image

https://modelscope.cn/models/Qwen/Qwen-Image

https://github.com/QwenLM/Qwen-Image

Currently, Alibaba offers Qwen-Image for free, and it can even be used in guest mode without registration. Open the address above and select "Image Generation" below to get started.

Qwen-Image web interface showing image generation option

Before generating images, you can choose the image ratio, such as 1:1, 3:4, 16:9, etc., to adapt to different devices like mobile phones and tablets, and various media platforms. It's very useful for covers and illustrations.

Qwen-Image interface showing aspect ratio options

First, let's try a simple prompt: A little girl running in the wind and rain, smiling, with "Qwen-Image" written on it. Ghibli style.

Ghibli-style image generated by Qwen-Image from a simple prompt

Another Ghibli-style image generated by Qwen-Image

Let's try something more complex: The streets of ancient Chang'an city, with antique buildings like taverns, teahouses, and shops on both sides. Pedestrians on the street are dressed in various traditional costumes, some riding horses, some walking, and hawkers are selling goods, all filled with a rich historical atmosphere. A prominent tavern signboard reads "Alibaba DAMO Academy".

Image generated by Qwen-Image of ancient Chang'an city

Another image of ancient Chang'an city generated by Qwen-Image

A Chinese beauty wearing a "QWEN" logo T-shirt is smiling at the camera, holding a black marker. On the glass board behind her, it is handwritten: "I. Qwen-Image's Technical Roadmap: Exploring the limits of foundational visual generative models, pioneering a future where understanding and generation are integrated. II. Qwen-Image's Model Features: 1. Complex text rendering. Supports Chinese and English rendering, automatic layout; 2. Precise image editing. Supports text editing, object addition/removal, style transformation. III. Qwen-Image's Future Vision: Empowering professional content creation, aiding the development of generative AI."

Image showing a woman next to a board with Qwen-Image features

Another image showing a woman next to a board with Qwen-Image features

Let's try an English prompt: An ancient battlefield, with dark clouds in the sky, thunder rumbling and lightning flashing. Soldiers in armor are fighting bravely on the battlefield. In the distance, huge monsters are roaring, as if it is a contest between humans and mythical creatures, filled with a tense and exciting atmosphere.

In Chinese, it means: 古代的战场,天空中乌云密布,电闪雷鸣,战场上有穿着盔甲的士兵在奋勇厮杀,远处有巨大的怪兽在咆哮,仿佛是一场人与神话生物的较量,充满了紧张与刺激的氛围。

Image generated by Qwen-Image depicting an ancient battlefield

Another image of an ancient battlefield generated by Qwen-Image

An endless desert silently spreads at night, with the Milky Way clearly visible in the sky, and stars densely scattered like silver sand. In the foreground is a rolling dune, with delicate ripples left by the wind, creating a serene, majestic, and mysterious atmosphere.

Image generated by Qwen-Image of a night desert with the Milky Way

Another image of a night desert with the Milky Way generated by Qwen-Image

Let's experience Qwen-Image's powerful image editing capabilities by transforming the desert image we just generated into Ghibli style.

Just upload the image to the chat box and enter: "Help me convert this image into a daytime Ghibli style."

Screenshot of Qwen-Image interface showing image upload and prompt

Edited image: Desert transformed to Ghibli style

Next, let's convert the first generated Ghibli-style little girl into a realistic girl.

Screenshot of Qwen-Image interface showing image upload and prompt

Edited image: Ghibli-style girl transformed to realistic

Regarding Alibaba's newly open-sourced Qwen-Image, netizens commented that it's excellent, as good as GPT-4o.

User comment about Qwen-Image comparing it to GPT-4o

"The images look great, definitely have to try it."

User comment expressing excitement for Qwen-Image

"The Qwen team is making great strides with all their models! Well done, the Qwen3 series is a significant upgrade for local open-source models. And now, even image generation is included."

User comment praising Qwen team's progress

"This is truly an amazing model. I never thought Qwen could release a 20-billion-parameter multimodal diffusion text-to-image generative model, but here it is!"

User comment expressing surprise and admiration for Qwen-Image

"It outperforms all other models in various benchmark tests and is released under the Apache license, which is highly commendable. Congratulations to the Qwen team."

User comment congratulating Qwen team on performance and license

The Qwen-Image model consists of three main components: a multimodal large language model, a variational autoencoder, and a multimodal diffusion Transformer (MMDiT).

Among them, the multimodal large language model acts as a conditional encoder, responsible for extracting key features from text input. Qwen-Image uses Qwen2.5-VL as the implementation for this module. Qwen2.5-VL not only excels in aligning language and visual spaces, allowing language and image information to correspond in the same dimension, but also has excellent language modeling capabilities, with virtually no performance loss compared to purely language models.

Qwen-Image supports multimodal input, capable of processing text and images simultaneously, unlocking a wider range of advanced functionalities, such as image editing. When a user inputs a text description, Qwen2.5-VL extracts key features from it, converts them into high-dimensional vector representations, providing precise semantic guidance for subsequent image generation.

Diagram illustrating the Qwen-Image model architecture

The variational autoencoder (VAE) is responsible for image tokenization, compressing input images into compact latent representations and decoding these latent representations back into images during inference. Qwen-Image's VAE design adopts a single-encoder, dual-decoder architecture, stemming from the pursuit of a universal visual representation that must be compatible with both images and videos while avoiding common performance compromises of joint models.

Qwen-Image is based on the Wan-2.1-VAE architecture, with its encoder frozen to maintain foundational capabilities, and only the image decoder fine-tuned to focus more on image reconstruction tasks. To improve the reconstruction fidelity of small text and fine details, the decoder's training data includes a large number of text-rich images, covering real documents and synthetic paragraphs across multiple languages.

In terms of training strategy, grid artifacts are reduced by balancing reconstruction loss and perceptual loss, and their proportions are dynamically adjusted. It was also found that as reconstruction quality improves, the effect of adversarial loss diminishes, so only the first two losses are retained, ultimately achieving the goal of enhancing detail rendering capabilities while ensuring efficiency.

As the core architecture of Qwen-Image, MMDiT is primarily responsible for modeling the complex joint distribution between noise and image latent representations under text guidance. It also introduces an innovative Multimodal Scalable RoPE (MSRoPE) embedding method, which effectively solves the positional confusion issue between text and images during joint encoding.

Diagram illustrating Multimodal Scalable RoPE (MSRoPE)

In traditional methods, text tokens are often directly concatenated after image position embeddings or treated as specific shapes of 2D tokens, which can easily lead to isomorphic positional encodings and affect the model's discriminative ability.

MSRoPE, however, treats text input as a 2D tensor, applying the same positional ID across both dimensions and conceptually concatenating it along the image diagonal. This approach retains the advantage of image resolution scaling while maintaining functional equivalence to 1D-RoPE on the text side, eliminating the need to determine optimal positional encoding for text, and significantly improving the accuracy of image-text alignment.

Main Tag:Generative AI

Sub Tags:Text-to-ImageMultimodal ModelsOpen-Source AIImage Generation


Previous:Replicating the AlphaGo Moment? Google Unveils New LLM Evaluation Paradigm Game Arena: Eight Models Compete, Chess King as Judge

Next:RAG Revolution! Graph-R1, the First RL-driven Graph Reasoning Agent

Share Short URL