NVIDIA has released the Llama Nemotron series models, capable of dynamically switching between inference mode and regular chat mode to adapt to different task requirements.
Key Technologies of Llama-Nemotron Explained
Neural Architecture Search (NAS):
- Block-wise Local Distillation:
Starting from the Llama 3 instruction model, the Puzzle framework trains each alternative sub-block independently and in parallel to improve computational performance, such as reducing latency, memory usage, or increasing throughput, while maintaining the functionality of the parent block.
For example, for the LN-Ultra model, starting from the Llama 3.1-405B-Instruct model, each alternative sub-block is trained to improve computational performance while maintaining the functionality of the parent block.
During the training process, each alternative block is trained to approximate the functionality of the parent block while achieving improvements in computational performance.
For example, some blocks may reduce computation and KV cache memory consumption by removing attention mechanisms, while others may achieve different degrees of compression by adjusting the intermediate size of the Feed-Forward Network (FFN).
- Mixed Integer Programming (MIP):
After building a library of alternative blocks, the Puzzle framework utilizes a Mixed Integer Programming solver to select the optimal block for each layer based on given constraints.
For example, for the LN-Super model, constraints include achieving at least a 5x throughput improvement on a single NVIDIA H100 GPU and supporting approximately 300K cached tokens at FP8 precision.
Using the MIP solver to select the optimal block for each layer from the block library based on given constraints (such as hardware compatibility, maximum allowed latency, total memory budget, or desired inference throughput) to build the complete model.
The MIP solver optimizes the objective function to select the optimal combination of blocks from the block library under the given constraints, constructing the complete model.
For example, for the LN-Ultra model, the final model achieved at least a 1.5x latency reduction on 8 H100 GPUs and supported up to 3M cached tokens at FP8 precision.
- FFN Fusion:
For the LN-Ultra model, FFN fusion technology is introduced. After Puzzle removes some attention layers, consecutive FFN blocks often appear in the model.
For example, if there are two consecutive FFN blocks in the model, FFN fusion technology replaces them with a wider FFN layer that can be executed in parallel, thereby reducing sequential steps and improving computational utilization.
Through FFN fusion, the LN-Ultra model achieved significant improvement in inference latency, ultimately reaching a 1.71x latency reduction.
Knowledge Distillation and Continuous Pre-training:
- Knowledge Distillation:
The LN-Super model is trained using knowledge distillation on the Distillation Mix dataset for 40B tokens.
For example, by comparing the output of the LN-Super model with the output of the teacher model, the parameters of the LN-Super model are adjusted to better approximate the behavior of the teacher model.
The LN-Ultra model is first trained using knowledge distillation on the Distillation Mix dataset for 65B tokens, and then continues pre-training on the Nemotron-H Stage 4 pre-training dataset for 88B tokens.
For example, during the knowledge distillation phase, the LN-Ultra model gradually improves its performance by learning the output of the teacher model;
In the continuous pre-training phase, the model further expands its knowledge scope, ultimately surpassing the reference model Llama 3.1-405B-Instruct on key benchmarks.
- Continuous Pre-training:
After knowledge distillation, LN-Ultra continues pre-training on the Nemotron-H Stage 4 pre-training dataset to further enhance performance.
For example, in the continuous pre-training phase, the LN-Ultra model expands its vocabulary and language patterns by learning a large amount of unlabeled data, thus performing better in inference tasks.
Supervised Fine-tuning (SFT):
- Data Preparation:
Construct a mixed dataset containing both inference and non-inference data.
For example, in inference data, each prompt includes the