Qwen Breakthrough: Using "Parallel Computing" Instead of "Stacking Parameters", New Method Reduces Memory by 22x, Latency by 6x

MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering NLP masters and doctoral students, university teachers, and enterprise researchers at home and abroad.

The vision of the community is to promote communication and progress among the academic community, industry, and enthusiasts of natural language processing and machine learning at home and abroad, especially the progress of beginners.

Source | Deep Learning Natural Language Processing

图片

Paper: Parallel Scaling Law for Language ModelsLink: https://arxiv.org/pdf/2505.10475

The evolution of LLMs has always relied on "stacking parameters", but the larger the model, the more obvious the problems:

Exploding training costs: Training a trillion-parameter model requires tens of millions of kilowatt-hours of electricity

Slow inference speed: Generating a sentence takes dozens of seconds

Cannot run on mobile phones: VRAM requirements are often hundreds of GB, and ordinary devices cannot deploy

图片

The recently proposed "Test Time Scaling" can improve performance, but it requires generating hundreds of intermediate steps, making it even slower. Scholars can't help but wonder: Is there a way to scale that is both efficient and resource-saving?

ParScale's breakthrough idea: Using "parallel computing" instead of "stacking parameters"

The core innovation of this paper is to let the same model "think separately".

Traditional method: One model computes in a "single thread"

ParScale: Copy the input and add different "thinking prefixes", and run P computation flows simultaneously

Dynamic fusion: Use LLM to automatically score different thinking results and synthesize the final answer with weights

图片

A common example: It's like letting 10 experts solve the same problem at the same time, and then dynamically selecting the best solution based on their problem-solving process, instead of just asking one super expert.

Core: Dynamic weighted fusion

The key formula is hidden in Proposition 1 of the paper: Model loss has a logarithmic relationship with the number of parallel streams P

(N is the number of parameters, P is the number of parallel streams)

This means:

The effect of parallel computing ≈ logarithmic growth of the number of parameters

Opening 8 parallel streams ≈ the effect of doubling the parameters 3 times

But the actual increased hardware cost is negligible

图片

图片

Experimental results: Inference efficiency increased by 22 times

The paper trained 67 models on 42B token data, and the conclusion is astonishing:

Performance comparable to parameter scaling: 1.6B parameters + 8 parallel streams ≈ 4.4B parameter model

Inference costs plummeted:

Memory usage reduced by 22 times

Latency reduced by 6 times

Mathematical reasoning surged by 34%: The improvement was most obvious for complex tasks such as GSM8K

Memory/latency comparison under different batches, blue arrow for traditional scaling, gray for ParScale

Memory/latency comparison under different batches, blue arrow for traditional scaling, gray for ParScale

Even more amazing is that old models can also be modified! With a small amount of data fine-tuning, existing models can support parallel computing, which is simply "the art of rejuvenation for old models".

Huge practical value: Even mobile phones can run "LLM"

The most subversive application scenario for this technology is edge devices:

Mobile phones/cars only need to load a small model and open multiple parallel streams to obtain the performance of a large model

Dynamically adjust the number of parallel streams: Open 2 streams when chatting, open 8 streams when solving math problems

Crushing cost advantage: It shows that its comprehensive cost is only 1/6 of traditional methods图片

In the future, our mobile assistant may be both a "life manager" and a "math teacher", but it won't lag at all!

Imagining the future: The model's "computing perpetual motion machine"

ParScale reveals a deeper law: Model capability is not only determined by parameters, but also by the method of computation. This opens a new world:

Dynamic scaling: Adjust the number of parallel streams in real time according to task difficulty

Hybrid architecture: MoE + ParScale combined

Cross-domain applications: Image generation, protein prediction can all be borrowed from

Proportion of parameter and parallel computing contributions to model capability

Proportion of parameter and parallel computing contributions to model capability

Perhaps the key to AI evolution in the future is no longer "building bigger models", but "using computing power smarter".

This paper is truly a masterpiece! Epoch-making! Well done, Qwen~

Technical Exchange Group Invitation Letter

图片

△Long press to add assistant

Scan the QR code to add the assistant's WeChat

Please note: Name - School/Company - Research Direction

(e.g., Zhang San - Harbin Institute of Technology - Dialogue System)

to apply to join the Natural Language Processing/Pytorch and other technical exchange groups

About Us

MLNLP community is a private academic community jointly established by machine learning and natural language processing scholars at home and abroad. It has now developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote the progress of the academic community, industry, and enthusiasts of machine learning and natural language processing.

The community can provide an open exchange platform for related practitioners' further studies, employment, and research. Welcome everyone to follow and join us.

图片

Main Tag:Large Language Models

Sub Tags:Parallel ComputingEdge AIInference OptimizationAI Research


Previous:Open-Source Implementation of Google's Self-Discovering Algorithm AlphaEvolve: OpenAplha_Evolve

Next:Why We’re Unlikely to Get Artificial General Intelligence Anytime Soon

Share Short URL