DeepSeek R1: What devs should know about AI’s biggest shakeup

Discover how DeepSeek R1 and V3 are redefining AI with groundbreaking innovations, unmatched efficiency, and open-source accessibility—challenging industry giants like OpenAI and Google.

18 mins read

Jan 29, 2025

It’s not every day that an open-source AI model shakes the foundations of the AI world, but that’s exactly what DeepSeek did.

Within a week of its launch on January 20, 2025, DeepSeek’s R1 model and app climbed to the #1 spot on the U.S. App Store, surpassing ChatGPT. Meanwhile, NVIDIA’s stock plunged by $590+ billion as the industry realized the implications: powerful AI can now be trained without massive budgets or premium chips.

And every developer should be paying attention to how this all plays out. Here’s why.

By making advanced, open-source AI more accessible and affordable than ever, DeepSeek announced the start of a new era: one with powerful, cutting-edge AI tools. Companies that couldn’t afford to train their models now can—and they’ll need developers who can fine-tune, deploy, and innovate with these tools.

Here’s what we’re covering today:

What makes DeepSeek’s V3 and R1 models revolutionary?
How does DeepSeek stack up against AI heavyweights like GPT-4 and Claude?
What does DeepSeek’s rise indicate about a new era of open-source innovation (and what does it mean for the future of AI development)?

Let’s dive in!

How Did DeepSeek V3 Break the Mold?#

Imagine a giant workshop of specialists. Someone walks in with a job, and instead of the entire workshop firing up their tools, only the experts who handle that specific task jump in. That’s DeepSeek V3—a Mixture of Experts (MoE) architecture. Out of its colossal 671 billion parameters, only 37 billion wake up at a time to solve a problem. The rest get to nap, saving power and effort.

You might say, “Gemini multimodal is also an MoE—how is this different?” Think of Gemini like a high-end restaurant kitchen. It has separate expert chefs for pastries, seafood, and sauces, each focused on their craft. But DeepSeek V3 isn’t just a kitchen—it’s a food truck fleet.

Each truck is a modular specialist broken down into smaller, well-labeled compartments. Not only does every truck handle its specialized tasks, but some carry shared supplies (like salt, spices, and fuel) that help the whole fleet stay efficient and coordinated. The twist? These trucks can dynamically reassign tasks depending on demand. If a big order of desserts comes in, more trucks can shift to dessert-making mode. That’s DeepSeek V3’s upgraded MoE: modular, flexible, and resource-smart.

Perhaps most intriguingly, DeepSeek V3 forgoes traditional MoE load-balancing penalties. These penalties, used in older systems, forced models to distribute tasks evenly across experts—even when uneven allocation would be more efficient.

Instead, DeepSeek employs a dynamic routing mechanism to distribute tasks on the fly. Imagine traffic controllers at a busy intersection. Older MoE systems relied on fixed schedules to direct traffic, which sometimes caused jams (bottlenecks) or left some routes underused. DeepSeek’s dynamic routing is like installing AI-powered traffic lights that instantly adapt to real-time conditions, redirecting vehicles to open lanes or faster paths. This eliminates inefficiencies and keeps things running smoothly. By skipping auxiliary-loss penalties (essentially side quests that older systems used to balance the load), DeepSeek focuses all its training effort on the main tasks, making it faster, more stable, and better suited to handling complex, multifaceted problems.

In short, DeepSeek V3 doesn’t just fine-tune MoE; it redefines it. Modifying experts, sharing resources intelligently, and introducing real-time task routing has created a model that runs faster, learns smarter, and adapts better than anything we’ve seen before.

How DeepSeek-V3 Takes Innovation to the Next Level#

DeepSeek V3’s brilliance isn’t just about its experts—it’s also in how it trains and thinks.

For starters, it uses Float8 (FP8) precision, which makes number crunching faster and lighter. Here’s the clever twist: FP8 calculations can lose some accuracy over time—like a photocopy of a photocopy. DeepSeek solves this by periodically syncing these calculations to a high-precision primary copy, keeping everything sharp and reliable.

Then there’s Latent Attention, a new way to handle memory during inference. Instead of storing large key-value pairs (as in traditional attention methods), DeepSeek compresses them into a more compact format—akin to swapping an encyclopedia for an index card. This saves space without sacrificing crucial details, making DeepSeek fast and well-suited for long text or code sequences

Another game-changer is multi-token prediction (MTP), where the model predicts multiple future tokens at once rather than one at a time. Imagine solving a puzzle while seeing several pieces ahead—it’s faster and more strategic. MTP can also reduce inference latency in production because multiple tokens are predicted in parallel—a major boost for real-time applications. This approach enriches training by providing denser learning signals, which improves efficiency and boosts inference quality.

With these innovations under the hood, DeepSeek V3 emerges as a leader among open-source models, rivaling closed-source giants like GPT-4o and Claude 3.5 Sonnet while remaining accessible. Though exact performance metrics can vary by domain, DeepSeek V3’s open-source nature may foster faster community-driven improvements—and that’s the beauty of open source!

How Does DeepSeek V3 Stack Up?#

Now that we’ve explored DeepSeek V3’s architecture and key innovations, the big question is: How does it perform in real-world tests?

This section compares DeepSeek V3 against established industry heavyweights—GPT-4o, Claude 3.5 Sonnet, and others—on various popular benchmark evaluations.

On educational benchmarks like MMLU (Massive Multitask Language Understanding) and its advanced variant, MMLU-Pro, DeepSeek V3 scored 88.5 and 75.9, respectively. This outpaces other open-source models and nears the performance of giants like GPT-4o.

Beyond educational tasks, DeepSeek V3 also demonstrates strong coding-related performance in benchmarks such as CodeForces (a platform for algorithmic problem solving) and SWE Verified (focused on generating functional, bug-free code).

In fact, DeepSeek-V3 showed exceptional results.

On CodeForces-style challenges, it tackled competitive programming problems with precision and efficiency, leveraging MTP to reduce latency and navigate complex logic seamlessly. Meanwhile, its success on SWE Verified—where it generated high-quality code that passed stringent test cases—demonstrates how well it handles extended contexts (up to 128K tokens), a major advantage for software engineers.

DeepSeek V3 also excels in mathematics and reasoning. It achieves state-of-the-art results in the MATH-500 benchmark, surpassing GPT-4o on certain tasks. This robust handling of intricate problem-solving underscores its advanced architecture and meticulous training approach. DeepSeek V3 leads among open-source models for fact-based and general knowledge tasks and shows performance comparable to closed-source mainstays like Claude 3.5 Sonnet.

Although DeepSeek’s published metrics don’t explicitly pit V3 against o1, the available data shows DeepSeek V3 performing competitively in several key areas. On MMLU, for instance, V3 achieves an 88.5% pass@1 score vs. o1’s 92.3%, and in HumanEval, it scores 82.6% vs. o1’s 92.4%. While DeepSeek V3 trails more significantly in tasks like MATH (61.6% vs. 94.8% pass@1) and GPQA (59.1% vs. 77.3% pass@1), it still demonstrates impressive capability—especially considering it’s an open-source model with a fraction of the compute costs associated with many closed alternatives. o1 is roughly 178.6x more expensive than DeepSeek V3 for input and output tokens!

Ultimately, DeepSeek V3’s benchmark performance highlights its real value for developers. Whether tackling competitive programming puzzles, debugging complex software, or building new applications from the ground up, DeepSeek V3 offers more than just a tool—it’s a collaborative partner that empowers you to produce better code faster. Its blend of efficiency and open-source accessibility makes it a compelling alternative to closed-source titans, fueling innovation and community-driven progress.

With just $5.6 million and 2,048 GPUs over 55 days (2.8 million GPU-hours), the team built a frontier-grade LLM—using 11x less compute than Meta’s Llama 3 (30.8 million GPU-hours).

The Chain-of-Thought Debate#

Having established DeepSeek’s remarkable efficiency gains, let’s turn to a crucial new feature stirring up the AI landscape: chain-of-thought reasoning (CoT).

Unlike traditional models that jump straight to answers, CoT enables DeepSeek V3 to break problems into smaller, logical steps, mirroring how humans tackle complexity. The model can systematically reason through each solution stage, whether solving a math problem, debugging code, or navigating intricate queries.

With every disruptive innovation comes a bit of drama. Following DeepSeek’s announcement, OpenAI’s Founder and CEO, Sam Altman, took to social media—implying that improving existing ideas is “easy” compared to pioneering something entirely new.

But here’s the twist: isn’t that what OpenAI did when building on Google’s “Attention is All You Need” paper? By refining and extending the BERT model and transformer architecture, OpenAI stood on the shoulders of giants. DeepSeek’s CoT implementation follows that same philosophy: take a great idea, refine it, and make it accessible.

DeepSeek V3’s CoT is particularly exciting because it levels the playing field in the open-source space.

Until now, OpenAI’s proprietary model (often called o1) was the sole mainstream system capable of advanced CoT reasoning. DeepSeek V3 matches this capability and improves it by remaining fully open-source, giving developers unparalleled transparency and adaptability.

By introducing CoT reasoning, DeepSeek V3 bridges a critical gap in the capabilities of open-source models. For developers and researchers, this means access to an advanced tool that doesn’t just generate answers but also provides explainable, step-by-step solutions. It’s a feature that elevates the model’s utility and sets a new benchmark for what open-source AI can achieve—with or without the spicy commentary.

Cracking the “Strawberry” Problem#

Before diving into the model’s identity quirks, let’s look at an example that once tripped up many advanced language models: How many times does the letter “r” appear in the word strawberry?”

When GPT-4o first debuted, it consistently answered two—a glaring mistake, as the correct answer is three. While newer updates reportedly fixed this, whether it was done via explicit inclusion of the question in training data or through broader architectural refinements.

Counting letters in a single word appears simple, but many LLMs rely on tokenization and pattern-matching rather than literal string processing. This architectural focus often causes them to overlook exact character counts or default to incomplete internal “memories,” leading to surprising mistakes like missing an extra “r” in “strawberry.”

DeepSeek V3, however, handles this question more reliably—even without enabling its advanced reasoning capabilities. It correctly identifies all three occurrences of “r” by breaking the problem into smaller, logical steps. This underscores the model’s capacity for detail-oriented tasks and its strong handling of language challenges that historically confounded other models.

Why Does DeepSeek-V3 Think It’s ChatGPT?#

Another curiosity has sparked a wave of online commentary: the model sometimes identifies itself as ChatGPT, OpenAI’s AI-powered chatbot. When pressed, it elaborates, claiming to be a version of OpenAI’s GPT-4 released in 2023. If asked about DeepSeek’s API, it might offer instructions for OpenAI’s API instead.

So, what’s going on here? The likely explanation lies in training data. While DeepSeek hasn’t disclosed the exact datasets used, the internet is brimming with public datasets containing synthetic text generated by GPT-4 via ChatGPT.

If DeepSeek V3 was trained on a mix of such datasets, it may have inadvertently absorbed some of GPT-4’s outputs. These models don’t know who they are; they’re statistical machines generating responses based on patterns in their training data.

Creating a model that accurately identifies itself requires additional fine-tuning. Without that deliberate step, the model has no inherent reason to respond accurately to questions about its own identity. If the data is saturated with references to ChatGPT, the model’s nearest neighbor response might be to identify as ChatGPT.

From a technical perspective, this quirk isn’t a major failure—rather, it’s an emergent property of the data landscape. ChatGPT’s outputs are widely distributed online, and any model trained on these sources could easily echo those themes. It’s an amusing reminder that AI systems aren’t self-aware; they simply reflect the patterns and biases in their training data.

Is DeepSeek-V3 the Best Model Out There?#

While DeepSeek V3 has turned heads with its novel architecture, strong coding performance, and reasoning abilities, we should ask: Is it the best model currently available? The answer is nuanced. Although DeepSeek excels in many domains, certain evaluations reveal vulnerabilities—especially with trick questions or subtle reasoning tasks.

Consider the Misguided Attention eval, a set of prompts designed as subtle twists on well-known riddles or paradoxes.

These prompts twist classic riddles and logical puzzles (e.g., Trolley Problem, Monty Hall variants, simplified river crossings) in ways that expose how LLMs can latch onto memorized patterns rather than genuinely parse context. The questions often look deceptively simple yet require exact reasoning or careful parsing of counterintuitive details. As a result, many models overcomplicate the solution, revert to familiar but incorrect logic, or invent elaborate scenarios that fail to address the core challenge.

These aim to trip up models that rely heavily on memorized patterns instead of genuine understanding. Slight modifications to common riddles can expose whether a model truly comprehends a query or regurgitates trained data.

DeepSeek V3’s performance here was underwhelming for a model of its size—solving only 22% of the 13 test questions. This result suggests that some architectural optimizations (like compressed KV caches or the MoE design) may make it more prone to overfitting when faced with unusual or subtly altered inputs. While these features excel in efficiency and long-context tasks, they could simultaneously cause the model to rely too heavily on pretraining shortcuts.

Still, this shortfall doesn’t overshadow DeepSeek V3’s overall achievements. It remains one of the most capable open-source models, proving that cutting-edge developments can flourish outside massive budgets. For developers, the main takeaway is that no single model fits every use case. If your project demands nuanced reasoning under tricky or deceptive prompts, DeepSeek V3 might need further customization—or a complementary tool—to handle those edge cases seamlessly.

On the other hand, pricing can be a decisive factor in determining a model’s best fit. Compared to its closest competitors—OpenAI's GPT-4o (at $2.50 per 1M input tokens and $10.00 per 1M output tokens) and Anthropic's Claude 3.5 Sonnet (at $3 per 1M input tokens and $15 per 1M output tokens)—DeepSeek V3’s usage costs are drastically lower: just $0.014 per 1M input tokens and $0.28 per 1M output tokens. For large-scale deployments, these savings are substantial. Whether DeepSeek V3 is the best depends on your project’s specific requirements, the tasks you need to solve, and your budget constraints.

A New Frontier in Reasoning AI#

If DeepSeek V3 is like a factory where specialized teams seamlessly handle tasks, R1 is the self-taught prodigy in a library, debating imaginary friends to learn. Let me explain how it stands apart.

Most AI models learn like students memorizing flashcards: they rely on correct answers teachers provide. R1, however, learns like a homeschooled genius tackling puzzles with no answer key. It takes on the dual roles of both student and teacher:

It generates multiple solutions to a problem (e.g., Maybe 2+2=4 because...).
It critiques itself, finding flaws and debating the best approach (e.g., Solution B forgot carryover!).
It rewards logical reasoning that feels the most human-like, using GRPO (Group Relative Policy Optimization).

Think of a kid learning basketball by shooting 100 shots and analyzing which arm angles create the cleanest swishes—without a coach. While traditional AI models improve by consuming correct answers, R1 evaluates its outputs, selecting the most promising path without relying on external guidance.

Andrej Karpathy, a renowned AI researcher and former Director of AI at Tesla, emphasizes the importance of how AI learns. Imagine AI learning not just by copying answers but by experimenting and figuring things out on its own. Karpathy compares traditional AI's dependence on provided answers to students using flashcards, whereas advanced models like R1 engage in a trial-and-error process similar to how a child learns to play a game by trying different moves and learning from mistakes.

GRPO is a unique learning process where R1 organizes a competition among its outputs, selecting the most reasonable or effective solution. This self-ranking process lets R1 achieve reasoning levels on par with OpenAI o1. Its performance is particularly impressive in the following benchmarks:

Its performance is particularly impressive in the following benchmarks:

Even though DeepSeek R1 doesn’t outperform OpenAI’s o1 across all benchmarks, it offers 30x lower costs for developers—a critical factor for startups and small companies. Hosting an AI model of this caliber at a fraction of the cost allows developers to integrate powerful reasoning capabilities without stretching their budgets.

This affordability doesn’t compromise quality. Even smaller R1 models, like DeepSeek-R1-7B, perform better than much larger models, demonstrating how optimization can achieve more with less compute. For developers building applications that require reasoning-heavy tasks, R1 is both a high-performance and cost-effective option.

Just as a self-taught musician might explore various styles and techniques to create unique music, R1’s ability to learn and adapt without strict guidance allows it to develop innovative solutions. This emergent behavior—where complex patterns arise from simple rules—is what makes R1’s reasoning capabilities both impressive and practical for real-world applications.

What About Data Protection?#

One common concern with DeepSeek R1 revolves around data protection—specifically, how user data might be used. To understand this, let’s clarify the difference between the AI model and the app that interacts with it first:

The AI Model (DeepSeek R1) is a collection of matrices filled with floating-point numbers (weights). When you give it an input—like a sequence of characters—it processes this input through sequential matrix multiplications to generate an output sequence. The entire computation happens within the model; no external data transmission is required for reasoning or generation.
The App (DeepSeek) is a mobile app that connects users to the AI model via a chat interface. When you use the app, your prompts and data are sent to the company’s servers.

So, the DeepSeek team has open-sourced the R1 model weights, allowing anyone to download and run them on their servers. This means you can host the AI model locally or on servers in your preferred location, ensuring full control over data privacy. Also, because the model executes all computations internally, user data never has to leave the server, protecting sensitive information.

Even other companies can use these weights to create chat interfaces, customize the model for specific tasks (e.g., code execution, web search), and even integrate additional tools. Big players like Groq already have their versions of DeepSeek V3 and R1 available on their platforms, so if you want to use the model through an established name in the industry, you can!

This setup empowers businesses and developers to leverage R1’s capabilities without worrying about their data being sent overseas, as long as the servers remain under their control.

More than a Clone#

Also, there’s a widespread misconception that DeepSeek R1 is just a copy of OpenAI’s o1 models. This couldn’t be further from the truth and reflects a misunderstanding of how AI models are trained.

DeepSeek R1 is built on innovative methods like RL fine-tuning, which are far removed from simply mimicking outputs. The team behind R1 published a detailed paper, DeepSeek R1 Zero, explaining how they trained the model without supervised fine-tuning (SFT)—a process where models are typically taught using human-labeled examples.

Instead, R1 learned reasoning from scratch, using techniques like good rejection sampling (filtering poor responses) to refine its domain knowledge. R1 developed reasoning skills that aren’t bound to copying human behavior or other AI models by focusing on reinforcement learning over human imitation.

This fundamental difference—learning reasoning from scratch—sets R1 apart and explains why it excels in tasks that require logical deduction and problem-solving. DeepSeek R1 doesn’t just provide answers—it reasons.

Unlike most models, it generates outputs in a structured format that mirrors human deduction. Also, this approach ensures transparent, verifiable problem-solving, making R1 ideal for mathematics, logic, and software engineering tasks. Additionally, its ability to transfer reasoning intelligence to smaller models while maintaining high performance highlights its scalability and efficiency.

DeepSeek R1 exemplifies how AI evolves from imitation to discovery by focusing on innovation, reasoning, and adaptability. It paves the way for a future where intelligence isn’t just programmed—it’s cultivated.

What's Next for AI Development?#

DeepSeek models are a striking example of innovation under constraints, delivering world-class performance with a fraction of the resources other players deem necessary. Its open-source approach, combined with features like chain-of-thought reasoning, Mixture-of-Experts (MoE) routing, Float8 precision, and GRPO, demonstrates that there’s still enormous potential for creative solutions in AI.

While it’s not perfect—especially when confronted by trick questions—this model paves the way for more collaborative, transparent, and cost-effective advances in large language models.

Notably, DeepSeek’s success may shift the pendulum from relying heavily on fine-tuning and prioritizing robust pretraining methodologies. Its strong performance at such a relatively low compute budget showcases how thoughtful architecture and data optimization can rival the results of massive fine-tuning efforts. As a result, major AI companies could be prompted to reevaluate where they invest the bulk of their training efforts and resources, potentially redefining the path toward the next generation of models.

A New Era of AI Innovation#

As the AI landscape continues to evolve, DeepSeek’s story is a case study in how impactful innovation doesn’t require colossal GPU clusters. Practical engineering, efficient training, and open collaboration are pushing the frontiers of what’s possible—challenging even the industry’s biggest players to keep up.

And the momentum doesn’t stop there. Just yesterday, DeepSeek unveiled Janus, its new multimodal model that promises to combine vision and language understanding into a single powerhouse system. But that’s a story for another time.

Written By:

Usama Ahmed

The AI Infrastructure Blueprint: 5 Rules to Stay Online

Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.

9 mins read

Apr 9, 2025