ByteDance Seed's New Method! Open-Source 8B Code Model: Trains Itself by Curating Its Own Data, Achieves SoTA at Its Scale, and Even Surpasses 10 Billion Parameter Competitors

Have you ever thought that an LLM training its own data could be more efficient than human curation? Traditional code large models rely on manually set rules for data filtering, which is costly, inefficient, and can easily skew the model.图片

Paper: Seed-Coder: Let the Code Model Curate Data for Itself

Link: https://github.com/ByteDance-Seed/Seed-Coder/blob/master/Seed-Coder.pdf

The Seed-Coder team, however, directly “lets the LLM be the teacher”, using the model to filter data and train itself, creating a series of lightweight, open-source 8B parameter code models whose performance even surpasses billion-parameter competitors!图片

Seed-Coder

1. A Self-Sufficient Data FactoryTraditional models rely on manual rules to filter code data, such as “must include comments” or “must not have syntax errors”. But programmers have different styles, rules can conflict, and scalability is poor. Seed-Coder's solution is “brutal”: let another LLM be the judge! The team trained a “code quality scorer” using an LLM to score code based on readability, modularity, clarity, and reusability, automatically filtering low-quality data.图片

This “LLM teaches LLM” model improves data filtering efficiency a hundredfold, ultimately building a high-quality code training library of 6 trillion tokens, supporting 89 programming languages!

2. Small Body, Big Wisdom Model ArchitectureSeed-Coder is based on the Llama 3 architecture with 8.2B parameters:

Long Context Support: Through repository-level code concatenation, the model can handle ultra-long code files up to 32K, easily coping with complex projects.图片

Fill-in-the-Middle Training (FIM): Randomly splits code into prefixes, infixes, and suffixes, allowing the model to learn to “complete the missing middle”, improving code completion capabilities. The formula is as follows:<[fim-suffix]> SUFFIX <[fim-prefix]> PREFIX <[fim-middle]> MIDDLE

This training allows the model to learn code logic like solving a puzzle, far exceeding the effectiveness of traditional single-mode training.图片

3. Training Method for Reasoning AbilitySeed-Coder's reasoning model uses Long Chain-of-Thought Reinforcement Learning (LongCoT), specialized in multi-step complex coding problems. Simply put, it lets the model first write out the problem-solving steps, then generate code, and optimize the logic chain through repeated trial and error. For example, when solving an algorithm problem, the model will first break down the problem: “Step one read input, step two sort, step three calculate the range…” then write code step by step. This “think first, then do” strategy allows it to perform astonishingly well in competitive programming problem sets.图片

Actual Performance

Seed-Coder dominates competitors in multiple authoritative tests:

Code Generation: In the HumanEval+ test, the 8B model scored 77.4, surpassing the 70B parameter CodeLlama!图片

Code Completion: Facing cross-file completion tasks, Seed-Coder's Edit Similarity (ES) reaches 85.1%, crushing models of similar scale.图片

Software Engineering Practice: In the GitHub real problem fix test (SWE-bench), Seed-Coder's resolution rate is 19.2%, higher than the 32B model QwQ!图片

Even more astonishingly, it can even reach a score of 1553 on the competitive programming platform Codeforces, close to a human bronze medal level!图片

Future Outlook: Are AI Programmers Taking Our Jobs?

Despite Seed-Coder's impressive performance, there are still limitations:

Lack of General Ability: Focus on code leads to weaker common sense understanding, e.g., unable to answer “how to make scrambled eggs with tomatoes”.

Mathematical Ability Shortcoming: Less mathematical content in training data means it easily fails when solving complex math problems.

But the team has planned future directions:

Integrate more general corpus to build an “all-rounder” AI programmer

Explore MoE architecture to further compress model size

It is foreseeable that these lightweight and efficient code models will accelerate their penetration into development toolchains, and it won't be long before they become 24/7 online “super assistants” for programmers~(Feeling both relieved and endangered, right?!

Note: Nickname-School/Company-Field/Conference(eg.ACL), join technical/submission group

图片

id: DLNLPer, remember to add notes呦

Main Tag:Code Large Language Model

Sub Tags:Data CurationMachine LearningAI in ProgrammingModel Training


Previous:Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

Next:Research: LLM's Prefilling Feature Has Become Its Jailbreak Vulnerability!

Share Short URL