MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering NLP masters and doctoral students, university teachers, and enterprise researchers at home and abroad.
The vision of the community is to promote communication and progress among the academic community, industry, and enthusiasts of natural language processing and machine learning at home and abroad, especially the progress of beginners.
Source | Deep Learning Natural Language Processing
Paper: Parallel Scaling Law for Language ModelsLink: https://arxiv.org/pdf/2505.10475
The evolution of LLMs has always relied on "stacking parameters", but the larger the model, the more obvious the problems:
Exploding training costs: Training a trillion-parameter model requires tens of millions of kilowatt-hours of electricity
Slow inference speed: Generating a sentence takes dozens of seconds
Cannot run on mobile phones: VRAM requirements are often hundreds of GB, and ordinary devices cannot deploy
The recently proposed "Test Time Scaling" can improve performance, but it requires generating hundreds of intermediate steps, making it even slower. Scholars can't help but wonder: Is there a way to scale that is both efficient and resource-saving?
ParScale's breakthrough idea: Using "parallel computing" instead of "stacking parameters"
The core innovation of this paper is to let the same model "think separately".
Traditional method: One model computes in a "single thread"
ParScale: Copy the input and add different "thinking prefixes", and run P computation flows simultaneously
Dynamic fusion: Use LLM to automatically score different thinking results and synthesize the final answer with weights
A common example: It's like letting 10 experts solve the same problem at the same time, and then dynamically selecting the best solution based on their problem-solving process, instead of just asking one super expert.
Core: Dynamic weighted fusion
The key formula is hidden in Proposition 1 of the paper: Model loss has a logarithmic relationship with the number of parallel streams P
(N is the number of parameters, P is the number of parallel streams)
This means:
The effect of parallel computing ≈ logarithmic growth of the number of parameters
Opening 8 parallel streams ≈ the effect of doubling the parameters 3 times
But the actual increased hardware cost is negligible
Experimental results: Inference efficiency increased by 22 times
The paper trained 67 models on 42B token data, and the conclusion is astonishing:
Performance comparable to parameter scaling: 1.6B parameters + 8 parallel streams ≈ 4.4B parameter model
Inference costs plummeted:
Memory usage reduced by 22 times
Latency reduced by 6 times
Mathematical reasoning surged by 34%: The improvement was most obvious for complex tasks such as GSM8K
Memory/latency comparison under different batches, blue arrow for traditional scaling, gray for ParScale
Even more amazing is that old models can also be modified! With a small amount of data fine-tuning, existing models can support parallel computing, which is simply "the art of rejuvenation for old models".
Huge practical value: Even mobile phones can run "LLM"
The most subversive application scenario for this technology is edge devices:
Mobile phones/cars only need to load a small model and open multiple parallel streams to obtain the performance of a large model
Dynamically adjust the number of parallel streams: Open 2 streams when chatting, open 8 streams when solving math problems
Crushing cost advantage: It shows that its comprehensive cost is only 1/6 of traditional methods
In the future, our mobile assistant may be both a "life manager" and a "math teacher", but it won't lag at all!
Imagining the future: The model's "computing perpetual motion machine"
ParScale reveals a deeper law: Model capability is not only determined by parameters, but also by the method of computation. This opens a new world:
Dynamic scaling: Adjust the number of parallel streams in real time according to task difficulty
Hybrid architecture: MoE + ParScale combined
Cross-domain applications: Image generation, protein prediction can all be borrowed from
Proportion of parameter and parallel computing contributions to model capability
Perhaps the key to AI evolution in the future is no longer "building bigger models", but "using computing power smarter".
This paper is truly a masterpiece! Epoch-making! Well done, Qwen~
Technical Exchange Group Invitation Letter
△Long press to add assistant
Scan the QR code to add the assistant's WeChat
Please note: Name - School/Company - Research Direction
(e.g., Zhang San - Harbin Institute of Technology - Dialogue System)
to apply to join the Natural Language Processing/Pytorch and other technical exchange groups
About Us
MLNLP community is a private academic community jointly established by machine learning and natural language processing scholars at home and abroad. It has now developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote the progress of the academic community, industry, and enthusiasts of machine learning and natural language processing.
The community can provide an open exchange platform for related practitioners' further studies, employment, and research. Welcome everyone to follow and join us.