DeepSeek-V3.2: A Technical Deep Dive

What happens when a well-funded Chinese AI lab decides to close the open-source gap

Jan 08, 2026

I spent the last week diving into DeepSeek’s latest paper, and I’ll be honest, I wasn’t expecting much. We’ve seen this pattern before: open-source model releases with big claims, only to find they fall short in practice. But DeepSeek-V3.2 is different, and the technical approach is worth understanding, hype aside.

Quick summary: DeepSeek-V3.2 is an open-source model achieving near-parity with closed models on olympiad-level mathematics and competitive programming, while costing significantly less to run than GPT-5 or Claude-4.5.

A note on variants: DeepSeek-V3.2 comes in two main flavours:

DeepSeek-V3.2-Thinking (standard): Balanced performance with length constraints for cost efficiency
DeepSeek-V3.2-Speciale: Extended thinking mode for maximum accuracy, earning gold medals at IMO, IOI, and ICPC competitions

Most benchmarks refer to the standard Thinking variant unless explicitly noted.

Let me walk you through what actually matters here.

Quick Reference

Performance: 93.1% AIME 2025, 2386 Codeforces rating
Cost Reduction: 83% lower on long contexts via DSA
Key Innovation: 10x reinforcement learning compute investment
Trade-off: 7-23% more tokens but significantly lower cost per token
Availability: Open-source weights, inference code, architectural details

The Context: Where Open Models Actually Stand
Three Core Technical Innovations
Sparse Attention: The Efficiency Win
Reinforcement Learning at Scale
Teaching Tool Use Through Synthetic Tasks
Benchmark Results (With Context)
The Speciale Variant: Removing Constraints
Practical Implications
Limitations and Caveats

The Context: Where Open Models Actually Stand

Early 2025: clear 15-20 percentage point performance gap between open models (Qwen, GLM, MiniMax) and closed ones (GPT-5, Claude-4.5, Gemini-3.0) on hard reasoning tasks, with even wider gaps on complex multi-step problems requiring tool use.

DeepSeek (a Chinese AI company backed by quantitative trading firm High-Flyer) identified three specific bottlenecks:

Problem 1: Attention Complexity Kills Long Context

Traditional transformer attention is O(n²). For 128K token documents, decode costs hit $2.40 per million tokens, impractical at scale.

Note: Cost figures are measured inference costs from paper benchmarking on H800 GPU clusters (rental price: $2/GPU-hour), not speculative API pricing. See paper Figure 3, page 6.

Why decode is expensive: During generation, each new token must attend to all previous tokens. With 128K context, generating 1 token requires 128,000 attention operations. At scale, this quadratic cost dominates inference expenses.

Problem 2: Insufficient Post-Training Compute

Most open models spend ~1% of total compute on reinforcement learning. Closed models? Evidence suggests 10-20%. That’s where the reasoning gap comes from.

Problem 3: Poor Tool Use

Open models struggled with multi-step tool use: debugging code in real environments, web search and synthesis, chaining multiple tools to solve problems. This “agentic” capability was the widest gap of all.

DeepSeek’s approach: tackle all three directly.

Three Core Technical Innovations

1. DeepSeek Sparse Attention (DSA)

Use a fast “indexer” to identify the top k=2048 most relevant tokens, then compute full attention only for those. The indexer uses few heads in FP8 precision, making it far cheaper than full attention. Though the indexer is still O(n²), total cost is dominated by sparse attention at O(n·k).

Result: 3-6x speedup on long documents, costs drop from 2.40 to 0.40 per million tokens (measured at 128K context on H800 hardware), an 83% reduction based on actual measured inference costs from the paper’s deployment on H800 GPU clusters.

Calculation Methodology Box
Throughout this document, percentage changes are calculated as:
Reduction: (Old - New) / Old × 100%
Increase: (New - Old) / Old × 100%
Multiplier: New / Old
All cost figures are from paper benchmarking on H800 clusters, not API pricing.

2. Massive RL Investment

DeepSeek allocated >10% of pre-training compute to reinforcement learning, 10x more than typical open models. They used GRPO (Group Relative Policy Optimization) with five critical stability improvements:

Unbiased KL estimation for fair policy comparison
Off-policy sequence masking to handle stale data
Routing consistency for MoE models
Sampling mask preservation
Specialist model distillation

3. Synthetic Agent Training at Scale

Built a pipeline generating training environments: 1,827 synthetic environments (e.g., travel booking systems with 50 cities, 200 hotels, complex constraints) + 24,667 real GitHub repositories = 85,267 total tasks for RL training.

Key insight: tasks are hard to solve but easy to verify (code passes tests or doesn’t; itinerary meets constraints or doesn’t).

Let’s examine each of these innovations in detail.

Sparse Attention: The Technical Details

Let me explain how DSA actually works, this is the most interesting architectural contribution.

The Two-Stage Process

Stage 1: The Indexer

For each query token at position t, compute relevance scores for every possible source token s:

$I_{t,s} = \sum_{j=1}^{H^I} w_{t,j} \cdot \text{ReLU}(q_{t,j} \cdot k_s)$

Variable definitions:

I[t,s] = relevance score between query token at position t and key token at position s
H^I = number of indexer heads (typically 4-8, much smaller than the 128 full attention heads)
w[t,j] = learned weight for indexer head j at position t (controls how much each head contributes)
q[t,j] = query vector for position t from indexer head j (reduced dimension: d_I = d/16)
k[s] = key vector for position s (same reduced dimension)
q[t,j] · k[s] = dot product (scalar similarity score between query and key vectors)
ReLU(x) = max(0, x), filters out negative similarities (only positive relevance counts)

How it works: For each indexer head, compute dot product between query and key vectors (measuring semantic similarity), apply ReLU to keep only positive similarities, multiply by learned weights, then sum across all heads. The dot product naturally measures relevance—higher values indicate semantically related tokens. Multiple heads (H^I) identify different relevance patterns while learned weights allow more reliable heads to dominate.

Efficiency gains: Uses only 4-8 indexer heads (vs 128 full attention heads), runs in FP8 instead of BF16 (2x memory/compute savings), and operates on reduced dimensions (d_I = d/16, giving 16x smaller vectors). The indexer selects the top k=2,048 highest-scoring tokens for full attention.

Stage 2: Sparse Attention

Full attention computed only between each query and its selected 2,048 key-value pairs. Everything else ignored.

Training the Indexer

Two-stage approach:

Warm-up (1,000 steps, 2.1B tokens): Keep full dense attention running. Train indexer to predict which tokens dense attention focuses on most, knowledge distillation where indexer learns to mimic full attention’s selection patterns.
Sparse training (15,000 steps, 943B tokens): Remove dense attention entirely. Only sparse pattern used; both indexer and main model adapt to work together.

Why This Works

The indexer overhead is real, still O(n²). But few indexer heads in FP8 precision (versus many heads in BF16 for full attention) plus reduced dimensions makes it far cheaper. The indexer eliminates 90%+ of expensive attention computations by selecting only top k=2048 tokens.

Trade-off: On short contexts (<32K tokens), DSA is actually slower due to indexing overhead with no quality benefit. It only wins on long contexts where quadratic costs dominate.

Having understood the architecture innovation, let’s examine the training approach.

Reinforcement Learning at Scale

The RL approach is conceptually straightforward, GRPO with group-relative rewards, but the devil is in the engineering details.

How GRPO Works (Simplified)

Generate 4-8 different solutions to the same problem
Grade them (run code tests, check math answers)
Compute relative rewards: best solution gets positive reward, worst gets negative, middle solutions get scaled rewards
Update policy to increase probability of better solutions, decrease probability of worse ones

The “group relative” part: you’re not comparing to an absolute standard, but to other attempts at the same problem. This reduces variance and stabilizes training.

The Five Engineering Improvements

These five fixes made RL actually scale:

1. Unbiased KL estimate: Proper importance sampling to fairly compare old and new policies. Without this, rare behaviors get over-optimised and training diverges.

2. Off-policy masking: When data becomes “stale” (policy changed significantly since data generation), ignore negative examples but keep positive ones. Prevents learning from outdated mistakes.

3. Keep routing: For MoE models, save which experts were used during generation and force identical routing during training. Critical for stability.

4. Keep sampling mask: Save which tokens were available during sampling (top-p filtering) and use the same mask during training. Prevents distribution shift.

5. Specialist distillation: Train specialized models (math expert, code expert, agent expert), use them to generate high-quality training data, then train the general model on this data. Gets you 95-98% of specialist performance faster.

Compute Investment & Scaling

They spent >10% of pre-training compute on RL, and performance was still improving when they stopped, suggesting even more compute would help further. Most open models spend ~1% on RL. DeepSeek 10x’d that budget and closed the reasoning gap.

Teaching Tool Use Through Synthetic Tasks

The agentic capability gap was the hardest to close, and DeepSeek’s approach here is genuinely creative.

The Context Management Innovation

Previous models (including DeepSeek-R1) discarded reasoning traces after every tool call to save memory, meaning re-reasoning from scratch each time, wasteful. DeepSeek-V3.2’s smarter approach: keep reasoning traces across multiple tool calls within a single user request, only discard when a new user message arrives, always preserve tool call history (inputs and outputs). This significantly reduces redundant reasoning tokens during multi-step tasks.

Synthetic Task Generation Pipeline

They built a synthetic travel planning environment (50 cities, 200 hotels, complex constraints) with 15 Python functions and automated verification. Tasks are hard to solve but easy to verify. See Appendix B for the full pipeline.

Does Synthetic Training Transfer?

The key question: do synthetic tasks help on real-world problems? They tested directly: trained only on synthetic general-agent tasks, then evaluated on real benchmarks the model had never seen:

Synthetic training transfers surprisingly well, the structured nature teaches general planning and tool-use skills that generalize to real tasks.

Code Agent: Real GitHub Issues

For code tasks, they mined Issue→PR pairs from GitHub: find issues with linked PRs that fix them, filter for quality (passing tests in final state), build reproducible environments where tests fail on buggy code → apply PR fix → tests pass.

Success criteria: F2P (Fail-to-Pass) count > 0 and P2F (Pass-to-Fail) count = 0 (no regressions).

They built tens of thousands of these environments across Python, JavaScript, TypeScript, Java, C, C++, Go, and PHP.

Result: 73.1% success on SWE-Verified, 70.2% on SWE-Multilingual. Beats all open models and some closed ones too.

Benchmark Results (With Context)

Let me present the numbers with proper context and caveats.

Mathematics: AIME 2025

Olympiad-level difficulty problems that <1% of high school students can solve.

Human math PhD students average 40-60% on these. All models shown are superhuman.

Note: The “Speciale” variant scores 96.0% by using more tokens per problem.

Competitive Programming: Codeforces

Average professional programmer: ~1200-1400. DeepSeek-V3.2-Thinking is legitimately elite tier, but there’s a ~300 rating point gap to Gemini.

Real-World Coding & Agent Tasks

*Kimi-K2’s score might include context management strategies; unclear from published results.

First time an open model topped a real-world coding benchmark against all closed competitors (SWE-Multilingual).

Token Efficiency: A Critical Trade-off

Token efficiency is DeepSeek’s most significant limitation, impacting both performance comparisons and cost calculations. Let me consolidate all the efficiency data here:

The Numbers Across Tasks:

Why This Happens:

Reasoning Architecture: DeepSeek uses explicit chain-of-thought even when not requested, generating intermediate steps that improve accuracy but increase token count
Sparse Attention Trade-off: DSA requires more context tokens to maintain coherence across sparse connections
RL Training Bias: Optimised for correctness over brevity during GRPO training, rewarded for getting answers right, not for being concise

Economic Impact Analysis:

$\text{Effective Cost} = \text{Tokens Used} \times \text{Cost Per Token}$

Example: Processing 1,000 complex reasoning tasks
- GPT-5: 13M tokens × $15/M = $195
- DeepSeek: 16M tokens × $0.40/M = $6.40
- Net savings: 97% despite 23% more tokens

When Token Efficiency Matters:

Latency-sensitive applications: 23% more tokens = 23% longer generation time
Context window limits: May hit limits sooner on complex multi-turn conversations
Streaming applications: Slower perceived response due to more tokens
Batch processing with strict SLAs: Need to account for longer processing times

When It Doesn’t:

Bulk analysis: Cost savings outweigh token overhead
Offline processing: Time isn’t critical
Self-hosted deployments: You control hardware utilization

This efficiency gap represents the fundamental trade-off in DeepSeek’s design: they chose accuracy and cost-per-token optimization over token minimization.

With these benchmarks establishing performance levels, let’s examine the extended Speciale variant.

The Speciale Variant: Removing Constraints

DeepSeek-V3.2-Speciale optimizes purely for performance, ignoring token efficiency. Think “extended thinking” that trades cost for accuracy.

Training Changes

Removed length penalty during RL training, trained only on reasoning tasks (no general chat), incorporated advanced mathematical techniques from DeepSeekMath-V2, let the model think as long as needed.

Results

Mathematics & Coding:

Speciale beats Gemini on all math benchmarks. That 99.2% on HMMT is genuinely impressive, a college-level competition where 70% would place you among top teams.

Competition Results: Gold Medals

IOI 2025 (International Olympiad in Informatics):

Score: 492/600 (82%)
Rank: 10th place globally
Medal: Gold

ICPC World Finals 2025:

Problems solved: 10/12
Rank: 2nd place globally
Medal: Gold

IMO 2025 (International Math Olympiad):

Score: 35/42 (83%)
Medal: Gold

These scores place in the top tier of world competitions.

The Token Cost

See the Token Efficiency section above for detailed analysis. Speciale represents the extreme end of the efficiency trade-off: 3.5x more tokens than Gemini, but potentially still cost-effective due to lower per-token pricing and self-hosting options.

These benchmark results translate to concrete practical implications across different use cases.

Practical Implications

For Developers

What you can actually do now:

Deploy locally with full control: No API rate limits, no data leaving your servers, can fine-tune on proprietary code. (See FAQ for hardware requirements)
Build coding assistants that work: 73% success rate on real GitHub issues, 70% across 8 programming languages, ~0.40 𝑝𝑒𝑟 𝑚𝑖𝑙𝑙𝑖𝑜𝑛 𝑡𝑜𝑘𝑒𝑛s vs 15-20 for GPT-5.
Process long documents efficiently: Linear scaling up to 128K context, 0.40 vs 2.40 per million tokens (decode), 3-6x faster on long contexts.

Real use cases that make economic sense:

Automated code review on PRs
Large-scale document analysis
Internal coding assistants with proprietary codebase access
Research applications needing model introspection

For Researchers

Open Source Definition: DeepSeek released full model weights (671B parameters, MoE architecture), inference code with optimized kernels, architectural details, and evaluation code. However, they didn’t release pre-training data, exact RL composition, synthetic task generation code, or training hyperparameters. It’s “open” as weights-available and inspectable, but not fully reproducible without significant resources.

Research opportunities:

Can we improve sparse attention patterns? (dynamic k, learned sparsity)
What are the actual RL scaling laws? (DeepSeek used 10%, what about 20%?)
How to generate better synthetic training tasks?
Token efficiency improvements while maintaining performance?

For Businesses

See Appendix C for detailed cost comparison tables.

When DeepSeek makes sense:

High-volume applications (>10M tokens/day)
Privacy-sensitive use cases
Need for fine-tuning or customization
Long document processing
Math/code-heavy applications

When to stick with closed models:

Need absolute best performance (Gemini-3.0-Pro still leads on many benchmarks)
Token efficiency critical (see Token Efficiency section for detailed trade-offs)
Multi-modal required (DeepSeek is text-only)

Limitations and Caveats

Let me be direct about where DeepSeek falls short:

1. Token Efficiency

Covered comprehensively in the benchmarks section—this remains DeepSeek’s primary limitation.

2. Knowledge Breadth: The Data Gap

The knowledge limitations stem from fundamental differences in pre-training scale and data curation. Let me break down where and why these gaps appear:

Corpus Size Comparison:

DeepSeek-V3: ~15T tokens (estimated from compute budget)
GPT-5: ~25-30T tokens (industry estimates)
Gemini-3.0: ~20-25T tokens (based on public hints)

Where Gaps Are Most Noticeable:

Technical Root Causes:

Data Sourcing Constraints:
- Limited access to proprietary databases (LexisNexis, academic publishers)
- Smaller web crawl infrastructure compared to Google/Microsoft
- Less diverse data partnerships
Curation Trade-offs:
- Prioritized high-quality technical content (arxiv, GitHub, StackOverflow)
- De-emphasized social media, forums, entertainment sites
- Stricter deduplication (removed more redundant content)
Language Distribution:

DeepSeek-V3 Training Data (estimated):
- English: 85%
- Chinese: 10%
- Other languages: 5%

GPT-5/Gemini (estimated):
- English: 60-65%
- Multilingual: 35-40%

Measured Impact on Applications:

Mitigation Strategies:

Retrieval Augmented Generation (RAG): Essential for production deployments
Domain-specific fine-tuning: Can close gaps in specialized areas
Hybrid approaches: Use DeepSeek for reasoning, other models for knowledge

3. No Multi-Modal Support: Architectural Constraints

DeepSeek-V3.2 is fundamentally a text-only model, and this limitation has deeper implications than it might initially appear:

What’s Missing:

Technical Reasons for Text-Only:

Architecture Design:
- MoE architecture optimized specifically for text tokens
- DSA mechanism designed for sequential text attention patterns
- No vision encoder or cross-modal attention layers
Training Compute Allocation:
- Chose to maximize text performance with available compute
- Multi-modal training would require 2-3x more compute for similar text quality
- Prioritized reasoning depth over modality breadth
Data Pipeline Complexity:
- Text-only pipeline is significantly simpler to scale
- Multi-modal alignment requires expensive annotation
- Quality control easier with single modality

Real-World Impact Examples:

Workaround Strategies:

Pipeline Approach:

Image → OCR/Caption Model → DeepSeek → Response
Adds 200-500ms latency, potential information loss

Hybrid Systems:
- Use specialized vision models for extraction
- DeepSeek for reasoning on extracted information
- Increases complexity but maintains DeepSeek’s reasoning strength
Future Considerations:
- DeepSeek team has hinted at multi-modal V4
- Likely 6-12 months away based on development patterns
- Will require significant architecture changes

4. Agentic Task Gap: The Long-Horizon Challenge

While DeepSeek made significant progress on agent tasks, gaps remain on complex, long-horizon problems requiring 10+ tool interactions:

Performance Breakdown by Task Complexity:

Specific Benchmark Analysis:

TerminalBench (46.4% vs Gemini’s 54.2%):

Where it fails: Complex multi-package installations with dependency conflicts
Root cause: Loses context after 6-7 error recovery attempts
Pattern: Starts repeating failed approaches after step 8-9

BrowseComp (67.6% with context management):

Actually leads here (Kimi: 60.2%), showing context management works
Success pattern: Web tasks typically require fewer steps (3-5 average)
Key innovation: Retaining reasoning traces across tool calls

MCP-Universe (45.9%):

Challenge: Requires coordinating multiple Model Context Protocol servers
Failure mode: Confusion when switching between different tool contexts
Gap to leaders: Gemini achieves ~65% through better context switching

Technical Root Causes:

Working Memory Degradation:

Step 1-5: Clear goal tracking, coherent planning
Step 6-10: Begins losing track of attempted solutions
Step 11+: Often repeats failed attempts, forgets constraints

Error Recovery Limitations:
- First error: 80% recovery rate
- Second error: 60% recovery rate
- Third+ errors: 30% recovery rate
- Lacks robust backtracking mechanisms
Planning Horizon:
- Effective planning depth: ~7 steps
- Beyond this, falls back to reactive mode
- Gemini/GPT-5 maintain planning to 12-15 steps

Why This Matters for Production:

Improvement Strategies in V3.2:

Context Management (implemented):
- Keeps reasoning traces within single user request
- Improved performance by 15-20% on medium-complexity tasks
Specialist Distillation (partial):
- Agent-specific training helped but didn’t close the gap
- Suggests architectural limitations, not just training
What’s Still Needed:
- Explicit memory mechanisms for long contexts
- Better error attribution and backtracking
- Hierarchical planning capabilities

5. Evaluation Methodology Questions

The paper doesn’t fully specify consistent prompting across models, number of attempts allowed (IOI/ICPC results used up to 50 submissions per problem), exact system prompts used, or how context management strategies affect results. A model generating 50 attempts and taking the best differs significantly from single-shot performance.

6. Reproducibility Challenges: The Hidden Costs

While DeepSeek released model weights and inference code, true reproducibility faces significant barriers:

What’s Actually Required for Full Reproduction:

The Reproducibility Spectrum:

Level 0: Inference Only (Available)
- Use pre-trained weights
- Cost: $5-10K for hardware
- Timeline: 1 day setup
Level 1: Fine-tuning (Feasible)
- Adapt to specific domain
- Cost: $10-50K
- Timeline: 1-2 weeks
Level 2: Continued Pre-training (Challenging)
- Add new knowledge
- Cost: $100K-1M
- Timeline: 1-3 months
Level 3: RL from Checkpoint (Very Difficult)
- Replicate post-training
- Cost: $500K-2M
- Timeline: 3-6 months
- Missing: Reward models, synthetic data
Level 4: Full Reproduction (Practically Impossible)
- Train from scratch
- Cost: $10-15M
- Timeline: 6-12 months
- Missing: Pre-training data, exact hyperparameters

What’s Missing for True Reproduction:

Practical Implications:

For Researchers:

Can study architecture and do ablations
Cannot verify claimed training efficiency
Limited ability to improve training methods

For Companies:

Can deploy and customize the model
Cannot train competitive alternative from scratch
Dependent on DeepSeek for major updates

For Open Source Community:

Can build applications and tools
Cannot truly “own” the technology stack
Vulnerable to upstream changes

Comparison to Other “Open” Models:

The Reality: DeepSeek-V3.2 is "open" in terms of deployment and customization, but not truly reproducible. This is similar to getting a compiled binary with some source code, useful, but not complete.

7. Safety and Alignment: The Open Model Trade-off

Open-source power comes with responsibility. DeepSeek implemented safety measures, but the open nature fundamentally changes the risk profile:

Safety Implementation Details:

Known Safety Gaps:

Fine-tuning Vulnerabilities:

With just 1000 examples, you can:
- Remove refusal behaviors completely
- Amplify specific biases
- Optimize for harmful outputs
Time required: 2-4 hours on 4x A100s

Prompt Injection Susceptibility:
- Success rate of known jailbreaks: ~35% (vs 5-10% for GPT-5)
- New jailbreak discovery rate: Higher due to white-box access
- Community actively finding novel bypasses
Bias Profile (Community Testing):

Deployment Risk Matrix:

Mitigation Strategies:

Technical Safeguards:
- Deploy behind API with rate limiting
- Implement output filtering layer
- Use ensemble voting with safety models
- Sandbox execution environments
Operational Controls:
- Audit logs for all interactions
- Anomaly detection on usage patterns
- Regular bias testing
- Incident response procedures
Legal Considerations:
- Clear terms of service
- Liability disclaimers
- Compliance with regional AI regulations
- Consider IP implications of outputs

Comparison to Closed Models:

Bottom Line: DeepSeek-V3.2 is powerful enough to cause harm if misused. The open nature means YOU become the safety team. Budget for safety infrastructure if deploying in production.

My Take

DeepSeek-V3.2 is legitimately impressive work. The sparse attention mechanism is clever, the RL scaling is straightforward but effective, and the synthetic task generation is genuinely novel.

What this proves:

The open-closed gap is narrowing faster than expected
Compute scaling works (spending 10x more on RL closed the reasoning gap)
Synthetic training transfers to real tasks
You don’t need infinite resources, just substantial ones and smart engineering

What this doesn’t prove:

Open models have caught up completely (gaps remain in efficiency and complex tasks)
The approach is reproducible without major resources
Current performance levels are the ceiling (we don’t know the scaling limits yet)

If you’re building applications that need strong reasoning, especially math or code, and you process high volumes or need data privacy, DeepSeek is worth serious consideration.

If you need the absolute best performance, lowest token usage, or multi-modal capabilities, you’ll still want Gemini-3.0-Pro or GPT-5.

The future might be open, but it’s not uniformly open yet. We’re getting there, though.

Resources

Official:

For comparison:

Chatbot Arena (live rankings): lmsys.org
Artificial Analysis (cost/performance): artificialanalysis.ai

Related reading:

DeepSeek-V3 base model: arXiv:2412.19437
DeepSeek-R1 reasoning focus: deepseek.com
DeepSeekMath-V2: arXiv reference in main paper

Technical Appendix

A.1: Full GRPO Objective

The complete training objective with all five improvements. Let me break this down piece by piece before showing the full formula.

Component 1: Policy Gradient with Clipping

$\min\left(r_{i,t}(\theta) \cdot \hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_{i,t}\right) \cdot M_{i,t} $

This is the core RL signal. Let’s understand each part:

r[i,t](θ) = importance sampling ratio = π_θ(o[i,t]|...) / π_old(o[i,t]|...)
- Measures how much more/less likely the new policy makes this action vs old policy
- Example: if r=2.0, new policy is 2x more likely to take this action
Â[i,t] = advantage = reward[i] - average_reward_in_group
- Positive advantage → this was a good action, increase its probability
- Negative advantage → this was a bad action, decrease its probability
clip(r[i,t](θ), 1-ε, 1+ε) = constrain ratio to [0.8, 1.2] (when ε=0.2)
- Prevents policy from changing too drastically in one update
min(...) = take the more conservative estimate between clipped and unclipped
- If advantage is positive and ratio > 1.2, clip it (don’t get too optimistic)
- If advantage is negative and ratio < 0.8, clip it (don’t get too pessimistic)
M[i,t] = off-policy mask (0 or 1)
- Set to 0 for stale negative examples (ignore outdated mistakes)
- Set to 1 otherwise (keep fresh data and positive examples)

Component 2: KL Penalty

$\beta \cdot \frac{\pi_\theta}{\pi_{\text{old}}} \cdot \left( \frac{\pi_{\text{ref}}}{\pi_\theta} - \log\left(\frac{\pi_{\text{ref}}}{\pi_\theta}\right) - 1 \right)$

This prevents the policy from drifting too far from a reference policy. Let’s simplify:

Let x = π_ref / π_θ (how much more likely the reference policy is)
The term (x - log(x) - 1) is a KL divergence approximation
When x = 1 (policies match): 1 - log(1) - 1 = 1 - 0 - 1 = 0 (no penalty)
When x ≠ 1 (policies diverge): penalty increases
β controls strength (0.0 for math = no penalty, 0.1 for general chat = moderate penalty)
Multiplied by π_θ / π_old for unbiased estimation (importance sampling correction)

The Complete Objective:

$J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\} \sim \pi_{\text{old}}(\cdot|q)} \left[ \text{Policy Gradient} - \text{KL Penalty} \right] $

$J_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(r_{i,t}(\theta) \cdot \hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_{i,t}\right) \cdot M_{i,t} - \beta \cdot \text{KL Term} \right]$

Where G = group size (4-8 solutions), β = KL penalty (0.0 for math, 0.01-0.1 for general), ε = clip range (0.2). The other variables are as defined in Components 1 and 2 above.

Critical MoE constraints: Use identical routing from generation (saved and replayed during training) and apply same top-p/top-k masks from generation to prevent distribution shift.

Intuitive summary: Maximize rewards for good solutions (policy gradient) while staying close to old policy (clipping) and not drifting from reference (KL penalty). Filter stale data (masking) and maintain MoE consistency.

A.2: DSA Complexity Analysis

Full attention: O(L² · d)

L² token pairs × d-dimensional dot products
Plus softmax and weighted sum

DSA attention:

$O(L^2 \cdot d_I + L \cdot k \cdot d) $

L² · d_I for indexer (d_I = d/16, in FP8)
L · k · log(L) for top-k selection (negligible)
L · k · d for sparse attention computation

Speedup condition:
DSA is faster when its total cost is less than full attention:

$L \cdot d_I + k \cdot d < L^2 \cdot d $

Divide both sides by L to simplify and then,

Substituting typical values (d=5120, d_I=320, k=2048, L=128000):

$d_I + k \cdot \frac{d}{L} < d $

$320 + 2048 · (5120/128000) < 5120$

Calculate the middle term step-by-step:

$2048 · (5120/128000) = 2048 · 0.04 = 81.92 ≈ 82$

Final check:

$320 + 82 < 5120$

$402 < 5120 $

Theoretical speedup calculation:

The simplified formula d_I + k · (d/L) = 402 represents the normalized per-query cost. To calculate actual speedup, we need to use the full cost formulas:

Let me calculate using the complete formulas:

Full attention per query: L · d = 128000 · 5120 = 655,360,000 operations
DSA per query: L · d_I + k · d = 128000 · 320 + 2048 · 5120
             = 40,960,000 + 10,485,760 = 51,445,760 operations
Speedup = 655,360,000 / 51,445,760 ≈ 12.7x

Actual speedup: 3-6x (real-world kernel overhead, memory bandwidth limitations, and top-k selection cost reduce the theoretical gains)

A.3: Training Configuration

Pre-training:

GPUs: 2048× H800 (80GB VRAM)
Precision: BF16 (main), FP8 (indexer)
Duration: ~3 days for sparse training phase
Context length: 4K→8K→16K→32K curriculum

RL Training:

GPUs: 1024× H800
Split: 512 rollout workers, 512 training workers
Batch size: 64-128 per update
Learning rate: 1e-6 with cosine decay

Data mixture (RL phase):

Reasoning (math, logic): 40%
Coding: 30%
Agent tasks: 20%
General chat: 10%

Appendix B: Synthetic Task Generation Pipeline

Here’s the actual process for the travel planning environment:

Step 1: Use an LLM to generate a synthetic database:

50 fictional cities with realistic details (climate, population, landmarks)
200 hotels with varying prices, ratings, amenities
500 restaurants across cuisines and price points
100 attractions with ticket prices and operating hours
Transport connections between cities

Step 2: Generate 15 Python functions to interact with this database:

def get_hotels_by_city(city: str) -> List[Hotel]
def get_city_by_hotel(hotel: str) -> str
def calculate_transport_cost(origin: str, destination: str) -> float
# ... etc

Step 3: Generate tasks with complex constraints:

Plan a 3-day trip from Hangzhou (Oct 1-3, 2025).

Budget rules:
- If day 2 hotel cost ≥800 CNY: restaurant total <350 CNY, attractions <120 CNY
- If day 2 hotel cost 500-800 CNY: different rules apply
- No repeated cities, hotels, restaurants, or attractions
- All locations must match (hotel and restaurants in same city)
- Output must be valid JSON with specific schema

Step 4: Write a verifier function that checks all constraints automatically. Task is hard to solve (large search space, complex constraints) but trivial to verify (just check the rules).

Step 5: Iteratively increase difficulty until tasks are challenging but solvable ~30-70% of the time.

Appendix C: Cost Comparison Details

Cost comparison (processing 100M tokens/day):

FAQ

Q: Can I really run this locally?

Technically yes, but you need ~400GB VRAM for full precision (5× A100 80GB). More practical: use 4-bit quantization on 4× A100 (reduces quality slightly), or use their API/hosted version.

Q: Is it actually as good as GPT-5?

See benchmark results section for detailed comparisons. Short answer: within 1-5% on reasoning tasks, GPT-5 leads on general knowledge and token efficiency.

Q: What’s the catch?

Higher token usage, no multi-modal support, and gaps on complex agent tasks. See Limitations section for details.

Q: Should I switch from GPT-5/Claude-4.5?

Depends on your use case:

High volume + cost sensitive + math/code heavy → Yes, try it
Need best efficiency or multi-modal → No, stick with closed models
Privacy concerns or need fine-tuning → Yes, DeepSeek wins

Q: How does Speciale compare to o1?

Both trade tokens for performance. o1 is likely more token-efficient, but we don’t have public numbers. Speciale is open and cheaper to run at scale.

Q: What’s the license?

Permissive open source. You can use commercially, modify, fine-tune, and redistribute. Check the specific license on HuggingFace for details.

Q: Will there be a V4?

Likely, based on their trajectory. Expect larger pre-training corpus, better efficiency, possibly multi-modal support. Timeline: probably 6-12 months.

This analysis is based on DeepSeek-AI’s paper (arXiv:2512.02556) and my own interpretation of the technical content. I have no affiliation with DeepSeek. For authoritative information, consult the original paper.

Corrections welcome, if I’ve misunderstood something technical, let me know.

Brewing Intelligence

Discussion about this post

Ready for more?