DeepSeek is BACK & More AI News You Can Use
By The AI Advantage
Summary
## Key takeaways - **DeepSeek-V3.2 Benchmarks Impressive**: DeepSeek-V3.2 scores 93 on benchmarks where Google Gemini 3.0 Pro is at 95, with agent capabilities on par with top models. But in real-world tests like creating a stunning website or Death Star SVG, it lags behind Opus 4.5 and Gemini, producing kindergarten-level results. [00:51], [03:10] - **DeepSeek-Math-V2 Hits Olympiad Gold**: DeepSeek-Math-V2 achieves gold level at the International Math Olympiad using a novel verifier system that generates over 10,000 answers and scores them for accuracy. This open-source model matches what Google's best model accomplished months ago. [03:54], [04:33] - **Aristotal Cracks 90s Math Problem**: Harmonic's Aristotal resolved a math problem from the 90s unsolved until now, marking the first math problem an AI model actually cracked. Researchers call it the start of the era of vibe proving with AI solving problems humans couldn't. [05:16], [05:35] - **Kling O1 Unifies Video Features**: Kling O1 combines every video feature like editing, image-to-video, multi-frames, lip sync, and sound effects into one interface. Demos show generating complex scenes like a cat with hat on a boat drinking wine with rose petals, easily refined. [06:19], [07:29] - **AI Detectors Useless in Education**: All AI detectors are basically useless and won't improve, so teachers must rely on in-class work over homework where students always use AI. Basic writing skills remain essential like learning multiplication before using a calculator you always carry. [09:01], [09:31] - **OpenAI API Data Leak Reminder**: A security leak between OpenAI API and Mixpanel exposed some user data and accounts, showing nothing put into AI is 100% secure. Abstract sensitive details like changing $4,300 to $4-4.5k to get same results while anonymizing. [09:52], [10:28]
Topics Covered
- Benchmarks deceive real performance
- Verifier scales math to Olympiad gold
- AI cracks unsolved math problems
- Unify video tools into one interface
- Abstract data to anonymize AI inputs
Full Transcript
Welcome to yet another week in AI. This
week we'll be looking at the Chinese answer to all the Frontier models that we got throughout the past few weeks.
Gemini, Opus, and GPD 5.1. And there's a new video model that combined every video feature that you might have seen out there into one thing, also Chinese.
That and a few more really interesting stories. In this week's episode of AI
stories. In this week's episode of AI News, you can use News to show that rounds up all the news releases in this crazy crazy AI space. We filter for the ones that matter and I get to present them back to you. Let's start by looking
at Deepseek and their new release, Deepseek 3.2. If you're not familiar,
Deepseek 3.2. If you're not familiar, they were the first big model out of China where everybody was freaking out that, hey, the Chinese have the Frontier models now and they're open sourcing them. But ever since then, the attention
them. But ever since then, the attention on them was kind of declining and now they released a big big new release to compete with some of the Frontier models out there. And the thing is, some of
out there. And the thing is, some of these benchmarks are actually on par with some of top model makers. I mean,
look at this. The Thinking Model scoring 93 on benchmarks where Google Gemini 3.0 0 Pro that came out two weeks ago is at 95. Agenta capabilities on par with some
95. Agenta capabilities on par with some of the top models. Really impressive
stats, but what matters at the end of the day is how it performs in a real world. So, I want to show you if you
world. So, I want to show you if you pull up their website and go to their platform, their application, and just loging in with your Google account, you can use this for free as you can many of the competitors. Sure, there's a usage
the competitors. Sure, there's a usage limit after a few dozen messages, but this is completely for free. They don't
even sell you a plan. You can kind of just use this. Now, as you'll see, my latest chats are from November and January when they had their latest releases. I kind of just tried it out a
releases. I kind of just tried it out a little bit and didn't really go back to it. And I'm going to do the same thing
it. And I'm going to do the same thing here. I think the selling point of the
here. I think the selling point of the big platforms isn't just the models, it's also the feature set that it brings. And here, it's very limited. But
brings. And here, it's very limited. But
let's give this a fair shot anyway. And
what I wanted to do today is pull up examples from the Opus 4.5 video last week, which if you're not familiar, is the biggest and baddest model in the entire space that actually kept its
initial hype. A lot of people are using
initial hype. A lot of people are using it every day for a variety of tasks and they're absolutely loving it. And in
that video, I compared it to the big release before that, which was out of Google's Gemini 3.0. And I ran two prompts and I want to just try the same two prompts so we can then compare
Gemini versus Opus versus Deepseek. Now,
again, I just want to highlight that this model is actually open source, so people can download it and use it locally, build it into their apps. But
let's see how it performs. This rather open-ended prompt, create a visually stunning design website for a studio that will impress web front- end developers, will show us what kind of front end this creates on the first try.
Obviously, this is no comprehensive test, but interesting nonetheless. While
it's doing that, I'm also going to run the second prompt that I tried there, which creates a visual. Create an SVG of the Death Star in the sky above Los Angeles. Nice. Let's run this and see
Angeles. Nice. Let's run this and see what we get. Okay, so we got our death star over LA. I'm really curious to see what this will look like. There's a
quote. The more you tighten your grip, the more traffic will slip through your fingers. There's some moving pieces. Are
fingers. There's some moving pieces. Are
these cars? The Death Star. Does that
even look like a Death Star. I say Opus is the winner on this benchmark with this probably being third. Hey, not a perfect test, but it's worth something.
All right, it took a whole while to write this code, but let's see what it did here. I can see it did all of it in
did here. I can see it did all of it in one HTML file. Okay, that's not bad. I
mean, I like the particles, but honestly, compared to both what Gemini and Opus did, this is kindergarten level. Sure, it did all of it in one
level. Sure, it did all of it in one HTML file, but hey, I give it free reign over how to do it. And yeah, compared to what the other models did, this is not even close. So, this is why benchmarks
even close. So, this is why benchmarks aren't everything. Honestly, after one
aren't everything. Honestly, after one week of usage and trying all the different models and now seeing this, my personal recommendation right now would really be with Opus 4.5. It's just so damn good. and reliable. But that would
damn good. and reliable. But that would just be my recommendation right now.
Deepseek, I'm going to close out here and probably not touch for a whole while until they release something new like this new math model that they released this week. So, let's have a look at
this week. So, let's have a look at that, I guess. And usually I don't spend too much time on these proprietary models, but they're really pushing what's possible and bringing results that Google a few months ago was
bragging about, concretely achieving gold level at the International Math Olympiad. So, we only got that out of
Olympiad. So, we only got that out of the best Google model a few months ago.
And now the open- source model out of China does it, too. And we won't actually go ahead and test a bunch of math prompts cuz I don't think there's a point. But here's the interesting part.
point. But here's the interesting part.
When you read this, this math model uses a novel verifier system. So, what that means is that there's a reward model in the background and the model verifies its own answers before it actually gets them to you. And it doesn't just create
a few alternatives and then verify them.
It actually creates thousands, over 10,000 in some cases, and then assigns scores to them on how accurate they could be. And they really made it work
could be. And they really made it work because heck, this model is state-of-the-art in terms of math. I
mean, gold at the International Math Olympiad was the well, no pun intended, but gold standard for these math models a few months ago, and now we got a open source version of that. So, I just wanted to quickly update you on that if
you want. It's available. You can
you want. It's available. You can
download it and run it locally, but most people won't do that. I just wanted to inform you that China is still doing its thing and open sourcing most things they're doing. Hey, if you're finding
they're doing. Hey, if you're finding this interesting, make sure to subscribe to the channel. I'm really hoping to hit the half a million mark. Every subscribe
counts. And now let's get back to the next piece of AI news that you can use.
And as we're already on the topic of math, I actually want to show you one new story which I thought was almost unbelievable actually. And I'm not a
unbelievable actually. And I'm not a mathematician, okay? I'm just paring
mathematician, okay? I'm just paring what I read here. But basically, there's this new AI system built by a company named Harmonic called Aris Total. And
Aristotal resolved the math problems from the '90s that has not been solved yet. And researchers are calling this
yet. And researchers are calling this the very first math problem that an AI model actually cracked. And they named this occurrence the starting of the era of vibe proving where they use AI models
to solve mathematical problems that people couldn't solve before. I mean
this is a big milestone. No. And again
I'm just repeating this announcement that is making its rounds on the internet and it might be a bit amplified for marketing reasons but really you're seeing this progress week by week and usually just takes that one domino and
then all of a sudden a whole new world of possibilities opens up. This might be it for math. We'll keep an eye on it, but for now, we'll switch to something that you might actually use this week, which is again a Chinese model that
brought a bunch of these features into one interface. I always love seeing
one interface. I always love seeing that. And in this case, it's for AI
that. And in this case, it's for AI video. It's called Cling Omni. And if
video. It's called Cling Omni. And if
you're familiar with all the various video models and features, usually it's an interface with a bunch of buttons and a bunch of different modules because you can just do so much. You can edit, you can image generate, you can turn the
images into video, you can do starting, ending frames, all of this different stuff. And for all the different
stuff. And for all the different functions you had different modules cling brought it all together now with cling 01 input anything understand anything generate any vision that's
their headline here and rather than just talking about it let's go in here and try this for ourselves so let's do that example cat with a hat in a romantic scene on a boat I added this image with
rose petals I added these rose petals falling from the sky I say comma drinking wine and then unicorn I don't know ask me what that is about just unicorn important. Okay, I guess I could
unicorn important. Okay, I guess I could also add more. I could do transformations, video references, multiple frames. I'm just going to
multiple frames. I'm just going to generate at this point and see what we get. Okay, time to review. Where's the
get. Okay, time to review. Where's the
cat sipping wine? Okay, maybe let's add a cat and remove the unicorn and stuff.
It's a romantic scene. Let's give him a rose. Okay, just generate. Okay, second
rose. Okay, just generate. Okay, second
time is a charm, I suppose. Yep, that's
everything I expected, including the rose in the water. Look at that. Okay,
but then there's all these features here, too. So, if I want to do III
here, too. So, if I want to do III sound, it's just a press of a button and it autogenerates a prompt and voila, I make the sound effect for it. If I want to do lip sync and the cat should start singing that, too. Press of a button. I
guess it's still kind of using all these tools, but they really brought it together and I like that, especially for people that are new to these tools. Kind
of figuring out all the things that are possible can be a process and having it in one interface is definitely the way to go.
Yep, that's audio. So, this is great if you're doing stuff with AI video. This
was one of the frontier models. Anyway,
sure, there's other features like virtual tryon, but this does bring a lot of the stuff that you want to do together into one, and that's a trend we see across many tools. Okay, let's see what's next. All right, so let's have a
what's next. All right, so let's have a look at this week's quick hits. First of
all is another AI video tool, Runway Gen 4.5. It looks good, but honestly, at
4.5. It looks good, but honestly, at this point, it's kind of hard to tell these models from each other. I mean,
some of the demos are super impressive.
Runway particularly always does a great job with these demo videos. Then also
Perplexity is adding new features. For
example, the memory feature which you might know from all other providers. But
even more interestingly so they added a email assistant. The idea is connecting
email assistant. The idea is connecting multiple calendars to one assistant so it can see everything and people can talk to the assistant in order to find new slots to schedule in. Interesting.
I'm going to give this one a look over the next week and let you know if it's any good. But this idea is really sound.
any good. But this idea is really sound.
Usually my experience with all the apps that connect to a calendar is that it's just not reliable enough. like 85% is not good enough when it comes to telling me if I have time in a particular day and that has been my experience of all
the connectors and other apps. Let's see
if this one is different. Time will
show. Then I wanted to highlight this tweet from Andre Karpathy where he talks about AI and education. And first he explores that all AI detectors are basically useless. No matter which one
basically useless. No matter which one or how many of those you use basically don't work. And that's not going to
don't work. And that's not going to change. So teachers have to rely on work
change. So teachers have to rely on work that is done in class rather than homework because in homework they'll always be using AI and there's nothing that can be done about that. And then in this third point he explores that while
that might be the case building a basic skill set of working with knowledge and writing yourself seems essential just like with a calculator you first learn how to calculate without it and then they give you a calculator and I guess
now you always have one in your pocket.
But it's good to know basic multiplication just like it's good to know basic writing even though you have AI. I couldn't agree with this more. I
AI. I couldn't agree with this more. I
just thought it was a really good take and I wanted to include it here. And
then lastly, a quick reminder that all of your data might not be as secure as most people are hoping for, even with the big providers. There was a small security leak with the OpenAI API with one of their partners, Mix Panel. I've
never used that application, but the point is that some of the user data between Mix Panel and OpenAI's API was actually leaked along with the accounts and that this can happen. I want to be clear, this does not pertain to chat
GBT, just this particular app using their developer models through the API.
But that's the reality of it. They can
only control so much and nothing that you put into AI is 100% secure from outside eyes. So keep that in mind,
outside eyes. So keep that in mind, especially when you handle sensitive data. My recommendation would be that if
data. My recommendation would be that if you're in doubt, you can always abstract away from the specifics. What I mean by that is that instead of saying, "Hey, this invoice for $4,300," you can say,
"This invoice ranged between $4 and $4.5,000."
$4.5,000." And the response will be the same as if you included the specifics. This is a great way to handle things like sensitive contracts or company data. You
can just generalize certain figures or certain names and you get the same results while keeping the data anonymized. Or you're just like me and
anonymized. Or you're just like me and you put almost everything in there.
Wouldn't recommend that though, but got to stay honest about these things. And
that's really everything we have for this week. It was a rather short one
this week. It was a rather short one after the tumultuous weeks that we've had. I sincerely hope that there was
had. I sincerely hope that there was something that's interesting or inspirative to you. I remain dedicated to this weekly schedule no matter what the releases look like. And with that being said, my name is Igor Pagani and I
hope you have a wonderful day.
Loading video analysis...