Open-R1: a fully open reproduction of DeepSeek-R1

0:00 0:00
100

The Training Recipe

  1. DeepSeek-R1-Zero:
    • Skipped supervised fine-tuning entirely.
    • Used Group Relative Policy Optimization (GRPO) to efficiently train the model with reinforcement learning.
    • Learned to break problems into steps and verify its own outputs.
    • However, its responses lacked clarity and were often hard to read.
  2. DeepSeek-R1:

Missing Pieces

Introducing Open-R1: Filling the Gaps

  1. Replicating R1-Distill Models:

Leave a Comment

Your email address will not be published. Required fields are marked *

Summarize the post