Open-R1: a fully open reproduction of DeepSeek-R1

0:00 0:00

100

If you’ve ever struggled with a tough math problem, you know how useful it is to think a little longer and work through it carefully. OpenAI’s o1 model demonstrated that when large language models (LLMs) are trained to use more compute during inference, their ability to solve reasoning tasks like mathematics, coding, and logic significantly improves. However, the exact methods behind OpenAI’s reasoning models have remained a closely guarded secret. That is, until last week, when DeepSeek released their groundbreaking DeepSeek-R1 model.

The release of DeepSeek-R1 has been nothing short of revolutionary. Performing as well as or better than OpenAI’s o1 model, DeepSeek-R1 wasn’t just about performance. It also came with a detailed technical report outlining its training recipe—a move that has sent shockwaves through the AI community and beyond.

What Makes DeepSeek-R1 Special?

DeepSeek-R1’s development builds on the foundation of the DeepSeek-V3 model, a 671B Mixture of Experts (MoE) architecture. DeepSeek-V3 rivals heavyweights like Sonnet 3.5 and GPT-4o while being remarkably cost-efficient, with a training budget of just $5.5M. This cost-effectiveness stems from innovative architectural features such as:

Multi Token Prediction (MTP): Enhancing prediction efficiency.
Multi-Head Latent Attention (MLA): Improving attention mechanisms.
Extensive hardware optimizations that maximize compute power.

DeepSeek-R1 took this solid foundation and applied several novel techniques, most notably the use of pure reinforcement learning (RL) to teach reasoning skills without human supervision.

The Training Recipe

DeepSeek introduced two distinct models in this release: DeepSeek-R1-Zero and DeepSeek-R1. Each had a unique training approach:

DeepSeek-R1-Zero:
- Skipped supervised fine-tuning entirely.
- Used Group Relative Policy Optimization (GRPO) to efficiently train the model with reinforcement learning.
- Employed a simple reward system based on accuracy and structure to guide the model.
- Learned to break problems into steps and verify its own outputs.
- However, its responses lacked clarity and were often hard to read.
DeepSeek-R1:
- Started with a “cold start” phase, fine-tuning on a small set of examples to enhance clarity and readability.
- Continued with additional RL and refinement steps, rejecting low-quality outputs using human preference-based and verifiable rewards.
- Produced a polished and consistent reasoning model that excelled across tasks.

Missing Pieces

Despite the impressive release, there are still gaps in what DeepSeek shared. While the model weights are open, the datasets and code used for training remain proprietary. This leaves key questions unanswered, such as:

Data collection: How were reasoning-specific datasets curated?
Model training: What hyperparameters work best, and how do they vary across model families and scales?
Scaling laws: What are the compute and data trade-offs for training reasoning models?

Introducing Open-R1: Filling the Gaps

The release of DeepSeek-R1 has inspired the launch of the Open-R1 project. This initiative aims to reconstruct DeepSeek-R1’s data and training pipeline, validate its claims, and push the boundaries of open reasoning models. By building Open-R1, the goal is to:

Provide transparency on how reinforcement learning enhances reasoning.
Share reproducible insights with the open-source community.
Create a foundation for future models to leverage these techniques.

Open-R1’s Approach

Open-R1 plans to tackle the missing pieces through a systematic three-step process:

Replicating R1-Distill Models:
- Distill a high-quality reasoning dataset from DeepSeek-R1.
Reconstructing the RL Pipeline:
- Curate new, large-scale datasets for math, reasoning, and code.
- Implement the pure RL pipeline that DeepSeek used to create R1-Zero.
Multi-Stage Training:
- Demonstrate how to go from a base model to supervised fine-tuning (SFT) and RL in multiple stages.

These steps will produce synthetic datasets and training recipes that allow researchers and developers to fine-tune existing or new LLMs into high-performing reasoning models.

Beyond Mathematics

While the initial focus is on reasoning datasets like math and code, Open-R1 envisions expanding into other domains, such as medicine and scientific research. These fields stand to benefit significantly from advanced reasoning capabilities.

A Call to Action

The DeepSeek-R1 release has ushered in a new era of reasoning models, but the journey is far from over. Open-R1’s mission is to democratize access to these techniques, enabling the entire AI community to contribute, iterate, and innovate.

If you’re passionate about open-source AI and reasoning models, consider joining the Open-R1 project. Together, we can unlock the full potential of reinforcement learning and create models that push the boundaries of what’s possible.

Open-R1: a fully open reproduction of DeepSeek-R1