DeepSeek AI Introduce DeepSeek V3.2 and V3.2 Speciale for Long Context Reasoning and Agent Workloads

By Marktechpost AI

Summary

Topics Covered

Sparse Attention Matches Dense Quality
GRPO Distills Specialist RL Domains
85K Synthetic Tasks Train Agents
Thinking Mode Persists Across Tools
Olympiad Gold on Open Weights

Full Transcript

How do you get GPT5 level reasoning on real long context tool using workloads without paying the quadratic attention and GPU cost that usually makes those

systems impractical?

Deepseek research introduces DeepSseek V3.2 and Deepseek V3.2 special. They are

reasoning first models built for agents and targets highquality reasoning, long context and agent workflows with open weights and production APIs. The models

combine deepseek sparse attention DSA, a scaled gpo reinforcement learning stack and an agentnative tool protocol and

report performance comparable to GPT5 with deepseek v3.2 special reaching Gemini 3.0 0 Pro level reasoning on

public benchmarks and competitions. Both

Deepseek V3.2 and Deepseek V3.2 Special use the Deepseek V3 mixture of experts transformer with about 671 billion total

parameters and 37 billion active parameters per token inherited from V3.1 terminus. The only structural change is

terminus. The only structural change is Deepseek sparse attention introduced through continued pre-training. Deepseek

sparse attention splits attention into two components. A lightning indexer runs

two components. A lightning indexer runs a small number of low precision heads over all token pairs and produces relevant scores. A fine grain selector

relevant scores. A fine grain selector keeps the top K key value positions per query. And the main attention path runs

query. And the main attention path runs multi-query attention and multi- head latent attention on this sparse set.

Deepse seek sparse attention DSA is introduced by continued pre-training on top of DeepSeek V3.2 terminus. In the

dense warm-up stage, dense attention remains active. All backbone parameters

remains active. All backbone parameters are frozen and only the lightning indexer is trained with a pullback libler loss to match the dense attention

distribution on 128k context sequences.

This stage uses a small number of steps and about 2 billion tokens, enough for the indexer to learn useful scores. In

the sparse stage, the selector keeps 248 key value entries per query. The

backbone is unfrozen and the model continues training on about 944 billion tokens. Gradients for the indexer still

tokens. Gradients for the indexer still come only from the alignment loss with dense attention on the selected positions. This schedule makes deepsek

positions. This schedule makes deepsek sparse attention DSA behave as a drop-in replacement for dense attention with similar quality and lower long context

cost. On top of the sparse architecture,

cost. On top of the sparse architecture, DeepSseek v3.2 2 uses group relative policy optimization GRPO as the main

reinforcement learning method. The

research team state that post-training reinforcement learning RL compute exceeds 10% of pre-training compute. RL

is organized around specialist domains.

The research team trains dedicated runs for mathematics, competitive programming, general logical reasoning, browsing and agent tasks and safety.

then distills these specialists into the shared 685b parameter base for deepseek v3.2 and

deepseek v3.2 special grpo is implemented with an unbiased k estimator of policy sequence masking and

mechanisms that keep mixture of experts m routing and sampling masks consistent between training and sampling. Deepseek

research team builds a large synthetic agent data set by generating more than 1,800 environments and more than 85,000

tasks across code agents, search agents, general tools, and code interpreter setups. Tasks are constructed to be hard

setups. Tasks are constructed to be hard to solve and easy to verify and are used as RL targets together with real coding and search traces. At inference time,

DeepSeek V3.2 two introduces explicit thinking and non-thinking modes. The

deepseek reasoner endpoint exposes thinking mode by default where the model produces an internal chain of thought before the final answer. The thinking

with tools guide describes how reasoning content is kept across tool calls and cleared when a new user message arrives and how tool calls and tool results stay

in the context even when reasoning text is trimmed for budget. The chat template is updated around this behavior. The

DeepSseek v3.2 special repository ships Python encoder and decoder helpers instead of a Ginga template. Messages

can carry a reasoning content field alongside content controlled by a thinking parameter. A developer role is

thinking parameter. A developer role is reserved for search agents and is not accepted in general chat flows by the official API which protects this channel

from accidental misuse. On standard

reasoning and coding benchmarks, DeepSseek V3.2 and especially Deepseek V3.2 Special are reported as comparable

to GPT5 and close to Gemini 3.0 Pro on benchmark suites such as AIM 2025, HMMT 2025,

GPQA, and Live Codebench with improved cost efficiency on long context workloads. For formal competitions,

workloads. For formal competitions, Deepseek research team states that DeepSseek V3.2 two special achieves gold medal level performance on the

International Mathematical Olympiad 2025, the Chinese mathematical olympiad 2025 and the international olympiad in

informatics 2025 and competitive gold medal level performance at the ICPC world finals 2025. Five.

Loading...

Loading video analysis...