Train a Tiny LLM from Scratch for Just ¥8 in 9 Hours! Full Tutorial Including Reasoning, MoE, and More

Open the GpuGeek computing power marketplace, address as follows:

https://gpugeek.com/login?type=register&source=wechat_DLNLP_01

The most appealing one is the RTX-A5000-24G graphics card, which costs only ¥0.88 per hour. My friend exclaimed how cheap it was.

I was curious why the RTX-A5000-24G model was so cheap. I often used 4090 and 3090 graphics cards before. At the same level of 24GB graphics memory, isn't this price a steal?

The author investigated and found that the A5000 offers super high cost-effectiveness!

The A5000 is equipped with a powerful graphics core and multi-stream processors, boasting ultra-strong parallel computing capabilities, single-precision floating-point performance of 27.8 TFLOPS, and memory bandwidth of 768 GB/s.

This graphics card not only achieves 80% of the NVIDIA RTX 3090's performance but is priced at only 66% of the 3090. For algorithms often involving matrix operations like linear regression and principal component analysis, computing efficiency significantly improves, helping algorithm engineers iterate models faster. Combined with image and network acceleration functions, it can meet more application scenarios. Below is a comparison of core parameters:

Since it's so cheap, we'll use it to replicate the minimind project. This open-source project aims to train a tiny language model MiniMind, which is only 25.8M, completely from scratch for just ¥3 and 2 hours!

Project address: https://github.com/jingyaogong/minimind

PS: Because our training parameters are very small, this training focuses more on the overall large model training process, so we don't have too high expectations for performance for now. We will test the model's effect later.

With that, our background introduction is complete. Let's roll up our sleeves and get started. We'll train a small-parameter "large model" from scratch!

Create GPU Instance

First, we need to create a GpuGeek account. The official website address is: https://gpugeek.com/login?type=register&source=wechat_DLNLP_01. However, real-name authentication is required. New users might receive some benefits, which I won't go into detail here.

Click on "Console" in the navigation bar, then click "Create Instance".

Next, we go to the creation page. Here, you can choose your configuration. We directly select the RTX-A5000-24G graphics card. If the price is confirmed, we can create it.

After creation, in the left container instance interface, you can see the instance operation methods.

Multiple login methods are supported here, such as ssh, jupyterlab.

Click "More" to perform more operations on the container.

Also, I found a benefit: data disk expansion is very cheap. Below is the price for 100GB, less than ¥1 per day, which is crazy!

I expanded it by 100GB to ensure our subsequent data and model storage are normal. Now, let's start training the model. After creating the instance, the next step is to download the dataset.

Download MiniMind Training Dataset

The MiniMind project open-sources all datasets for large model pre-training, fine-tuning, and reinforcement learning, eliminating the need for self-preprocessing large-scale datasets and avoiding repetitive data processing work.

Dataset address:

https://www.modelscope.cn/datasets/gongjy/minimind_dataset/summary

Below is a brief introduction to each dataset:

dpo.jsonl --RLHF phase dataset

lora_identity.jsonl --Self-cognition dataset (e.g., Who are you? I am minimind...), recommended for LoRA training (also usable for full-parameter SFT, don't be limited by the name)

lora_medical.jsonl --Medical Q&A dataset, recommended for LoRA training (also usable for full-parameter SFT, don't be limited by the name)

pretrain_hq.jsonl ✨ --Pre-training dataset, integrated from jiangshu technology

r1_mix_1024.jsonl --DeepSeek-R1-1.5B distillation data, maximum character length per data entry is 1024 (thus set max_seq_len=1024 during training)

sft_1024.jsonl --Integrated from Qwen2.5 distillation data (a subset of sft_2048), maximum character length per data entry is 1024 (thus set max_seq_len=1024 during training)

sft_2048.jsonl --Integrated from Qwen2.5 distillation data, maximum character length per data entry is 2048 (thus set max_seq_len=2048 during training)

sft_512.jsonl --Integrated from Jiangshu Technology SFT data, maximum character length per data entry is 512 (thus set max_seq_len=512 during training)

sft_mini_512.jsonl ✨ --Minimal integration from Jiangshu Technology SFT data + Qwen2.5 distillation data (for quickly training Zero models), maximum character length per data entry is 512 (thus set max_seq_len=512 during training)

tokenizer_train.jsonl --All from Jiangshu large model dataset. This part of the data is relatively minor (not recommended to train tokenizer yourself for the reasons above). If you need to train your own tokenizer, you can freely choose the dataset.

Below is a diagram of the datasets used at each stage of large model training. This is actually a combination process. Everyone can think about why this order and combination are chosen?

We use the modelscope SDK to download the dataset.

Before downloading, please install ModelScope using the following command:

pip install modelscope

Download the complete dataset:

cd /gz-data # Data disk directory

modelscope download --dataset gongjy/minimind_dataset --local_dir minimind_dataset

GpuGeek's network speed is exceptionally fast. Below is a diagram of the dataset download speed. It's so good, no more network speed anxiety!

After downloading, let's check if the file size is normal.

Next, we will download the minimind project code and then proceed with training. The download command is as follows:

git clone https://ghfast.top/https://github.com/jingyaogong/minimind

https://ghfast.top/ This address is to speed up GitHub downloads, as there might be network issues when downloading GitHub in China.

Environment Setup

cd minimind

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

There's a pitfall here: we can remove the version numbers, otherwise, the installation will likely fail!

Pre-test if Torch can use cuda

import torch

print(torch.cuda.is_available())

We move all dataset files to the 'dataset' directory under the code.

Large Model Pre-training - Learning Knowledge

The training code is in the 'trainer' directory. Since this small-parameter model can run on a single card, the code is relatively clear and not very complex. Basically, a single script can complete the corresponding training. I believe these source codes are very suitable for beginners to understand the entire process of large model training.

cd trainer

python train_pretrain.py

Execute pre-training to obtain pretrain_*.pth as the output weight of pre-training (where * is the model's dimension, defaulting to 512).

Large Model Instruction Fine-tuning - Learning Dialogue

python train_full_sft.py

Execute supervised fine-tuning to obtain full_sft_*.pth as the output weight of instruction fine-tuning (where "full" means full-parameter fine-tuning).

Large Model Reinforcement Learning - Learning Preferences

python train_dpo.py

Train Inference Model

The dataset source has been introduced above. Data format example:

{"conversations": [{"role":"user","content":"Hello, I'm Xiaofang, nice to meet you."}, {"role":"assistant","content":"Hello! I am MiniMind-R1-Lite-Preview, an intelligent assistant independently developed by a personal developer in China. I am happy to serve you!Hello! I am MiniMind-R1-Lite-Preview, an intelligent assistant independently developed by a personal developer in China. I am happy to serve you!"}]}

The reply template for inference model R1 is:

Thought processFinal answer

In GRPO, this constrains the model to conform to the thought and reply tags by setting a rule-based reward function (the reward value should be increased in the earlier cold-start phase).

Another issue is that although the distillation process is similar to SFT, experimental results show that it is difficult for the model to always conform to the template's reply specification, i.e., to deviate from the thought and reply tag constraints. A small trick here is to add an additional loss penalty for marked position tokens, see train_distill_reason.py:

# Add extra penalty at sp_ids corresponding positions...loss_mask[sp_ids] = 10 # Penalty coefficient

For more details, please refer to:

https://github.com/jingyaogong/minimind

After training, we can obtain four model weights in the 'out' directory.

Test Model

Finally, we can test the model's performance using the eval_model.py script.

Thus, we have succeeded! We can see that the trained model can answer questions, but testing some cases reveals that the answers are quite short. This is highly related to the model's parameters.

Finally, I recommend everyone to read the source code to understand the underlying principles and implementation process. Below is the code for the MiniMind model's result section.

This code implements a language model named MiniMind, which is based on the Transformer architecture and includes Mixture-of-Experts (MoE) functionality. Below is a brief summary of the main components:

1. Configuration Class (MiniMindConfig)

Defines basic model parameters such as hidden layer size, number of attention heads, number of layers, etc.

Contains configuration parameters for the Mixture-of-Experts (MoE) system, such as the number of experts and the number of experts selected per token.

2. Base Modules

RMSNorm: Implements Root Mean Square normalization.

Rotary Position Embedding: Implements the encoding of positional information (RoPE - Rotary Position Embedding).

Attention Mechanism: Includes multi-head self-attention implementation, supporting Flash Attention optimization.

3. Feedforward Network

FeedForward: Traditional feedforward network implementation, including SwiGLU activation function.

MOEFeedForward: Mixture-of-Experts feedforward network implementation.

MoEGate: Expert selection gating mechanism.

4. Model Structure

MiniMindBlock: The main building block of the model, containing self-attention and a feedforward network.

MiniMindModel: Combines multiple Blocks to form a complete encoder.

MiniMindForCausalLM: The final model used for causal language modeling, including the decoding head.

Key features are as follows:

1. Mixture-of-Experts (MoE)

Allows dynamic selection of expert networks for each token.

Uses a gating mechanism to determine which experts should process each token.

Supports auxiliary loss functions to balance expert usage.

2. Optimized Implementation

Supports Flash Attention to accelerate attention computation.

Efficient KV caching mechanism for inference acceleration.

Expert system optimization for inference.

3. Integration with Hugging Face Ecosystem

Inherits from PreTrainedModel and GenerationMixin.

Compatible with Hugging Face model loading and generation interfaces.

The model's execution flow is as follows:

1. Input tokens are embedded as vectors.

2. Processed through multiple Transformer blocks.

3. Each block contains self-attention and a feedforward network (standard or MoE).

4. Finally, the output vocabulary distribution is generated via the language model head.

This implementation provides a lightweight yet full-featured language model framework, particularly enhancing model capacity while maintaining parameter efficiency through the Mixture-of-Experts system.

It is recommended that you further read the source code of the model training part.

Train a Tiny LLM from Scratch for Just ¥8 in 9 Hours! Full Tutorial Including Reasoning, MoE, and More

Share Short URL