Author | Ling Min
Nothing is more fitting than "constellation shining" to describe the recent field of TTS (Text-To-Speech) models.
Since the beginning of the year, from tech giants to startups and research institutions, all have been focusing on TTS models. In February, ByteDance's overseas lab launched a lightweight TTS model, MegaTTS3-Global; in March, Mobvoi, in collaboration with top academic institutions such as the Hong Kong University of Science and Technology, Shanghai Jiao Tong University, Nanyang Technological University, and Northwestern Polytechnical University, open-sourced the new generation speech generation model Spark-TTS; in the same month, OpenAI launched a TTS model based on the GPT-4o-mini architecture.
Compared to other popular technologies in the AI field, TTS seems particularly low-key, but it is the "invisible foundation" for scenarios such as smart hardware and digital humans. With a wide range of application areas and broad commercial prospects, TTS has made significant progress in the past year and is quietly changing industry rules.
Recently, there's a major "new arrival" in TTS models, the Speech-02 voice model, which has already surpassed OpenAI and ElevenLabs, topping the Arena leaderboard and becoming the world's number one.
Topping the Arena Leaderboard,
What is Unique About the Speech-02 Model?
Topping the Arena leaderboard is MiniMax's latest Speech-02 model.
On the Artificial Analysis Speech Arena Leaderboard, the Speech-02 model achieved an ELO rating of 1161, surpassing a series of models from OpenAI and ElevenLabs. The Arena leaderboard's ELO rating is derived from users' subjective preference judgments when listening to and comparing voice samples from different models. This means that users clearly prefer Speech-02 compared to other industry-leading voice models.
To explore the deeper reasons for user preference, we can perhaps find answers in specific technical indicators. On the key metric of Word Error Rate (WER), Speech-02 and ElevenLabs are neck and neck, while on Similarity (SIM, for voice cloning scenarios), Speech-02 achieves a complete碾压 (crushing).
Among these, Word Error Rate is an important metric for measuring the performance of speech recognition systems. It is calculated by comparing the text output of the speech recognition system with the human-annotated reference text and calculating the proportion of incorrect words in the recognized result relative to the total number of words in the reference text. The lower the Word Error Rate, the better the performance of the speech recognition system and the higher the recognition accuracy.
In terms of Word Error Rate, Speech-02 performed comparably with ElevenLabs in handling various languages such as English, Arabic, Spanish, and Turkish, with little difference. However, in Chinese, Cantonese, Japanese, and Korean, it is significantly better than ElevenLabs. Especially in the Chinese language environment, leveraging its localization advantage, Speech-02's Word Error Rate for Chinese and Cantonese is only 2.252% and 34.111%, respectively, while ElevenLabs' Word Error Rates for these two languages are 16.026% and 51.513%.
Similarity, on the other hand, is an important metric in voice cloning scenarios, used to measure the degree of similarity between the voice cloning result and the target voice. A value closer to 1 indicates higher similarity and better cloning effect, better able to restore the characteristics of the target voice.
In terms of similarity, Speech-02 is comprehensively better than ElevenLabs. This means that the Speech-02 model generates cloned voices that are closer to real human voices in these 24 evaluated languages.
These technical advantages bring more intuitive results, reflected in the model's performance in practical applications. Overall, Speech-02 has three main characteristics:
Super Human-like: Low and stable error rate, with performance in emotion, timbre, accent, pauses, and rhythm indistinguishable from real humans;
Personalized: Supports voice referencing and text-to-voice generation, being the first model in the industry to achieve "arbitrary timbre, flexible control";
Diversity: Supports 32 languages and can seamlessly switch between multiple languages within the same speech segment.
The author also conducted a test of Speech-02, choosing multiple timbres to narrate the same piece of text:
The sun lazily shone on the balcony, and wisps of hot steam rose from the teacup. I leaned back in the rattan chair and casually opened an old book; a faint scent of ink drifted from between the pages. Outside the window, a few sparrows hopped on the branches, occasionally chirping, as if discussing something important. The wind gently stirred the curtains, bringing a hint of osmanthus fragrance, reminding me of the osmanthus cake my grandma made when I was a child. Just sitting quietly like this, watching the clouds roll and unroll, listening to the wind whisper, is the best time.
With the same piece of text, the three timbres produced entirely different feelings: the first audio, a female voice, was clear and articulate, as if reciting, gentle and grand; the second audio (Cantonese) had more of a living atmosphere, like a neighbor's younger sister speaking softly; the third audio sounded like a grandmother telling a story by your ear, recounting it slowly.
In multi-language evaluation, Speech-02 demonstrated impressive capabilities, switching seamlessly between multiple languages:
This business trip to Tokyo was truly crazy! As soon as I left Narita Airport, I met a サラリーマン (salaryman) shouting into his phone 『やばい! deadlineに間に合わない!』 (Oh no! I won't make the deadline!) Then I helped him find a printer, and he actually said『感恩!』(Thank you!) in Chinese and even forced a box of クッキー (cookies) on me... This plot is too much like a マンガ (manga), isn't it? But those cookies were really 美味しい (delicious), and the packaging even said 『一期一会』(Ichigo ichie - once in a lifetime encounter).
Even during the internal testing phase of the Speech-02 series, many creators had the chance to experience it firsthand.
Professor Zhang Jingyu from the Department of Directing, School of Drama, Film and Television at the Communication University of China, used Speech-02 to produce a three-person dialogue script for a radio play. In the dialogue, the three characters had distinct personalities, their emotions were quite well-captured, and the dialogue rhythm flowed together, feeling natural overall. "Currently, the generation effect of Speech-02 is very good, especially for objective information works like news broadcasts and documentary narration. Even for more challenging dramatic works, it can achieve emotional and nuanced voice expressions, and when combined with editing, it already has the potential to produce radio plays, audio novels, and even voiceovers for dramatic films and television."
Chen Kun, founder of Xingxian Culture and a super creator of Spiral AI, said: "Compared to Runway's futures, I think MiniMax's voice is more surprising. The AI dubbing has a bit of human touch."
Beyond model performance, Speech-02 offers a significant cost advantage at a price of $50 per million characters of text. In comparison, ElevenLabs' cheapest Flash v2.5 costs $103 per million characters of text, more than double that of Speech-02.
Learnable speaker encoder enables zero-shot zero-cost replication
In TTS models, balancing model performance and cost-effectiveness is not easy. The innovation of Speech-02 lies in its ability to learn all voices simultaneously through data diversity and architectural generalization, better balancing model performance and cost.
In terms of architecture, Speech-02 is primarily composed of three components: a tokenizer, an autoregressive Transformer, and a latent flow matching model. Unlike other speech synthesis models that use pre-trained speaker encoders, the speaker encoder in Speech-02 is jointly trained with the autoregressive Transformer. This joint optimization allows the speaker encoder to be specifically tailored for the speech synthesis task, improving the model's synthesis quality by providing richer and more relevant speaker-specific information.
Furthermore, because the speaker encoder is learnable, it can be trained on all languages in the training dataset. Compared to pre-trained speaker encoders that may not have been exposed to the same diversity of languages, this learnable speaker encoder ensures broader language coverage and potentially enhances the model's generalization ability.
This also means that Speech-02 has powerful zero-shot learning capabilities, able to synthesize speech that mimics the unique timbre and style of a target speaker from just an untranscribed audio clip. Topping the Arena leaderboard this time also indicates that the underlying architecture of the Speech-02 model represents a more advanced next-generation approach. Perhaps this is the new solution for TTS models pursuing excellent performance and cost-effectiveness.
Innovative Flow-VAE architecture
provides a new solution for TTS models
Before Speech-02, many TTS methods had certain limitations, especially in core scenarios like zero-shot voice cloning and high-fidelity synthesis, where audio quality and human voice similarity were difficult to achieve optimally. For example, traditional TTS methods overly rely on transcribed reference audio, which not only limits the model's cross-language capability but also affects the expressiveness of speech synthesis. Furthermore, due to limitations in the generation component, many models struggle to balance audio quality and speaker similarity. This is why many TTS models sound very "AI-like," while Speech-02 can achieve human voice similarity of up to 99%.
At the architectural level, Speech-02 innovatively proposes the Flow-VAE architecture based on VAE (Variational Autoencoder). This architecture is significantly better than VAE. Its unique feature is the introduction of a flow matching model, which can flexibly transform the latent space through a series of invertible mappings. This combined solution can be described as a "powerful combination" – it not only fully utilizes VAE's initial data modeling capability but also leverages the flow model's accurate fitting capability for complex distributions, enabling the model to better capture the complex structures and distribution characteristics in the data.
According to reports, the flow matching model adopts a Transformer architecture and optimizes the encoder-decoder module through KL divergence as a constraint, making the latent distribution more compact and easier to predict. In contrast, traditional flow matching models mostly take a "detour": first predicting the Mel spectrogram and then converting it into audio waveforms using a vocoder. In this process, the Mel spectrogram can easily become an information bottleneck, limiting the final speech quality. The Flow-VAE model in Speech-02, however, directly models the continuous speech feature (latent feature) distribution extracted from the audio-trained encoder-decoder module, similar to "taking a shortcut," avoiding the information bottleneck problem.
In evaluations on some test sets, Flow-VAE achieved comprehensive superiority compared to VAE.
Taking the vocoder re-synthesis dimension test as an example, by comparing the waveform reconstruction capabilities of Flow-VAE and VAE and comparing the synthesized audio with the original audio on multiple dimensions, evaluation metrics were calculated. The final results show that on all evaluation metrics, the Flow-VAE model demonstrates significant advantages compared to the VAE model.
In terms of TTS synthesis, according to the Seed-TTS evaluation method for Word Error Rate (WER) and Similarity (SIM), the technical team generated test data under two inference settings: zero-shot and one-shot. The final test data shows that compared to the VAE model, Flow-VAE has significant advantages in both Word Error Rate and Similarity metrics.
This also explains why the Speech-02 model was able to top the Arena leaderboard and leave top overseas models behind in multiple technical indicators. From a longer-term perspective, the significance of the Speech-02 model goes far beyond sweeping the charts; it solves the pain points of existing TTS methods through innovative architecture and redefines the technical boundary.
More "Human-like" AI Dubbing,
The Journey is the Sea of Stars
From MegaTTS3-Global to Spark-TTS, and now Speech-02, TTS models are in a "divine battle," each showcasing their unique strengths. This healthy competition not only promotes the rapid iteration of TTS technology but also further prospers the AI application interaction ecosystem. Currently, TTS models are being widely applied in more and more fields, enhancing user experience from multiple dimensions.
Taking the education field as an example, TTS models can not only transform difficult-to-read written textbooks into vivid audiobooks but also provide users with 24-hour practice companions in the form of celebrity AI assistants through voice cloning. For instance, the "Daniel Wu Teaches You Spoken English" course, which has recently caused a craze in the market, utilizes voice cloning to create a 24-hour customizable AI language tutoring system – "AI A Zu". With the help of MiniMax's large voice model and multimodal interaction system, "AI A Zu" perfectly replicates Daniel Wu's voice and can not only correct users' pronunciation and grammar but also provide realistic and emotional feedback in situational conversations.
In the smart hardware field, TTS models also give life to various products with "more human-like" AI dubbing. Taking toys as an example, many dolls do not have voice functions. Through TTS models, AI pendants can make dolls "talk". Bubble Pal, rated as the Top1 AI toy by Xiaohongshu users, is a representative product of this type of conversational interactive pendant toy. By integrating MiniMax's voice model capabilities, Bubble Pal can replicate the voices of cartoon characters that children like and highly restore character timbres, making the toys "come alive".
In the smart car field, TTS models can also provide personalized experiences for users through joint deep inference models. Taking Jihu (Arcfox) vehicles as an example, they use DeepSeek to accurately understand user intent and MiniMax's voice model to respond instantly to user questions, making the cold cockpit warmer and allowing direct voice communication with users, thereby achieving a more personalized experience.
It is worth mentioning that as early as 3 years ago, MiniMax began focusing on the TTS track, providing users with personalized, natural, and pleasant voice services. In November 2023, MiniMax launched its first generation large voice model, the abab-speech series, supporting functions such as multi-character audio generation and text character classification. By opening up its voice technology, MiniMax became one of the earliest companies in China to provide voice services using a large model architecture. Currently, MiniMax has successfully served over 50,000 enterprise users and individual developers globally, including well-known companies such as China Literature's Qidian Audiobook and Gaotu Techedu.
As TTS technology continues to advance, we have reason to believe that it will be applied in more scenarios, bringing more convenience to users. It may even rewrite the future AI application interaction paradigm.
Today's Recommended Articles