TLDW logo

NEW Deepseek just dropped...

By Matthew Berman

Summary

## Key takeaways - **DeepSeek V3.2 Wins IMO Gold**: Deepseek 3.2 is officially the first open-source model that has scored gold at the International Math Olympiads. It is beating the top models by the closed source frontier labs like OpenAI and Anthropic. [00:05], [00:26] - **Special Tops GPT-5 High 96%**: GPT5 High 94.6, Gemini 3.0 Pro 95, and Deepseek V3.2 Special 96 on benchmarks. The special version uses a ton of tokens but scores incredibly well. [01:00], [01:24] - **DSA Sparse Attention Breakthrough**: Deepseek introduced DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long context scenarios. It reduces core attention complexity from big O of L squared to big O of L times K, scaling more linearly instead of exponentially. [02:19], [06:47] - **10% Compute on RL for Agents**: They spent over 10% of their compute budget on reinforcement learning, exceeding 10% of the pre-training cost. They generated over 1,800 distinct environments and 85,000 complex prompts for agentic tasks. [05:32], [06:15] - **671B MoE, 37B Active Params**: This model is Frontier, but it is also relatively small, coming in at 671 billion parameters. It is a mixture of experts model with 37 billion active parameters at inference time. [07:13], [07:40] - **Fully Open Source MIT License**: The model is fully open source, open weights, and MIT licensed. Check out the model on Hugging Face and give it a try. [07:29], [07:44]

Topics Covered

  • Open-source Beats Frontier Labs
  • DSA Slashes Attention Complexity
  • Context Windows Stagnate Quadratically
  • 10% Compute Unlocks RL Power
  • Synthetic Agents Scale Tool Mastery

Full Transcript

Deepseek 3.2 is here and we have another DeepSseek moment in artificial intelligence. 3.2 is officially the

intelligence. 3.2 is officially the first open-source model that has scored gold at the International Math Olympiads. It is beating the top models

Olympiads. It is beating the top models by the closed source frontier labs like OpenAI and Anthropic. And they've done this on a fraction of the budget and at

insane efficiency. So, I'm going to

insane efficiency. So, I'm going to break it all down for you right now.

Here's the tweet from the official DeepSseek account launching DeepSk v3.2 and 3.2 special reasoning first models

built for agents. So as it says Deepseek 3.2 comes in three flavors. We have the regular thinking model in 3.2 and then we have the max thinking model in 3.2

special. So look at these benchmarks.

special. So look at these benchmarks.

Here's GPT5 high, Gemini 3.0 Pro, Kimmy K2 thinking, and these are the top models. The only one missing is Opus

models. The only one missing is Opus 4.5. Here is Amy 2025. So, GPT5 High

4.5. Here is Amy 2025. So, GPT5 High 94.6, Gemini 3.0 Pro 95, and Deepseek

V3.2 Special 96. Now, in parenthesis, what you're seeing is the number of tokens used. And so as you can see for

tokens used. And so as you can see for the regular thinking model it is pretty darn token efficient as compared to GPT5

high and Gemini 3.0 Pro but when the special version of the model is used it uses a ton of tokens and yes it is not as token efficient but it does score

incredibly well. Here's live codebench

incredibly well. Here's live codebench 84.5 versus 88.7 for special and 83.3 for the regular thinking model. Here's

GPQA Diamond, 85.7 for GPT5 high, 91.9 for 3.0 Pro, and 85.7 for Deepseek V3.2 Special. And so DeepSeek did a couple

Special. And so DeepSeek did a couple really novel things with this new model.

And that is what Deepseek is known for is really pushing the bounds on the algorithmic side of building these models. And so the two major

models. And so the two major breakthroughs here are one, Deepseek sparse attention. They basically updated

sparse attention. They basically updated the attention mechanism, you know, attention is all you need to be incredibly efficient. So we introduced

incredibly efficient. So we introduced DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long context scenarios.

And so what improving the efficiency of the attention mechanism allows us to do is increase the context window of these models without sacrificing speed. So,

one thing you might have noticed with all of the model releases over the last 3 years is the context window hasn't really been growing. And that's because the math behind increasing the context

window is quadratic. Meaning, as you increase the context window, the compute cost to allow for that increased context window increases quadratically, which means basically by a lot. The second

thing that they were able to do is invest in a scalable reinforcement learning framework by implementing a robust reinforcement learning protocol and scaling post-training compute.

Deepseek v3.2 performs comparably to GPT5.

Notably, our high compute variant 3.2 special surpasses GPT5 and exhibits reasoning proficiency on par with Gemini

3.0 Pro achieving gold medal performance in both the 2025 IMO and the IOI. And

then number three, largecale agentic task synthesis pipeline. These models

are all about agentic use and they are especially good at tool calling. And the

way that they were able to do that was they set up environments specifically to create synthetic agentic data. So to

integrate reasoning into tool use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. Very

cool. The more we can remove humans from the AI creation process, the more scalable they'll be. And by the way, if you like AI models that specialize in tool use, you can get over 8,000 tools

to use with the sponsor of today's video, Zapier. What is Zapier? Zapier

video, Zapier. What is Zapier? Zapier

allows you to connect your apps and AI tools together so they work autonomously. You set a trigger, choose

autonomously. You set a trigger, choose what you want the AI to do, and Zapier will run it automatically. Zapier lets

you drop AI in the middle of your workflows to supercharge them, allowing you to do things like drafting emails, summarizing notes, generating content, updating data, all without needing you

to touch anything. I've been using Zapier for 10 plus years. We use it at my current company and I just made an automation where I automate the creation of Instagram posts easily. I simply drag

and drop an article about what I want that post to be about and it not only creates the text, it creates the image and it posts it all automatically. It's

super seamless and I will drop a link to that specific automation down below. You

can click it and copy it and add on to it. Zapier has over 8,000 different apps

it. Zapier has over 8,000 different apps that it integrates into. It's basically

MCP on steroids and it actually has an MCP endpoint as well. So check them out, let them know I sent you. Go copy my workflow down below by clicking that link. And thanks again to Zapier for

link. And thanks again to Zapier for sponsoring this video. Now back to the video. Now let's talk about

video. Now let's talk about reinforcement learning for a second.

They spent over 10% of their compute budget on reinforcement learning, which 10% doesn't sound like a lot, but compared to essentially every model before it, it is quite a bit. Listen to

this. Notably, this framework allocates a post-training computational budget exceeding 10% of the pre-training cost, unlocking advanced capabilities. And to

make these models really good at agentic use cases, listen to this. They

generated over 1,800 distinct environments and 85,000 complex prompts for a gentic task. This extensive

synthesized data drives the RL process, significantly enhancing the model's generalization and instruction following capability. in the agent context. Now,

capability. in the agent context. Now,

here is an incredibly important part.

Why the attention mechanism that they used to train this model is so important. DSA, deepseek attention,

important. DSA, deepseek attention, reduces the core attention complexity of the main model from big O of L 2 to big O of L * K. And if this sounds

confusing, the easiest way to think about it is rather than scaling the attention complexity exponentially in this example, it is scaling more linearly. It is the difference between

linearly. It is the difference between exploding costs and controlled costs. So

really the gist of it is it is much less expensive to run. And as mentioned, it is exceedingly good at tool use. On tool

use benchmarks, Deepseek V3.2 2 substantially narrows the performance gap between open source and closed source LLMs. Though it remains below Frontier models, so it isn't quite at

the frontier for tool use, but it is still really good and really close. Now,

this model is Frontier, but it is also relatively small, coming in at 671 billion parameters. It is a mixture of

billion parameters. It is a mixture of experts model with 37 billion active parameters at inference time. So, if you wanted to run this at FP8, you can do so

with 700 GB of VRAM. And you need 1.3 terb of VRAM if you want to run it at the full BF-16. The model is fully open source, open- source, open weights, and

MIT licensed. So, that's it. Check out

MIT licensed. So, that's it. Check out

the model. Give it a try. Let me know what you think. If you enjoyed this video, please consider giving a like and subscribe. and I'll see you in the next

subscribe. and I'll see you in the next

Loading...

Loading video analysis...