Alibaba’s ZeroSearch Magic: Zero API Costs, Maximum Insight
What if your LLM could search like Google?
Most large language models (LLMs) struggle to answer questions about recent events or obscure facts. Traditional fixes like retrieval-augmented generation (RAG) rely on external search APIs, which rack up costs and introduce new points of failure. Enter ZeroSearch, Alibaba’s internal search simulation framework that eliminates API dependencies while boosting answer accuracy.
ZeroSearch trains LLMs to simulate a search engine internally. By generating both relevant and noisy documents during training, the model learns to retrieve and synthesize grounded answers using only its pretraining memory.
In this post, we’ll explore:
Why traditional RAG and RL-based search incur heavy costs
How ZeroSearch works under the hood
What kind of performance gains it offers
Limitations and future directions for internal tool simulation
Let's get started.
Before ZeroSearch: RAG, RL, and rising API costs#
LLMs only know what they learned up to their training cutoff date. This means that if you ask about anything that happened after that cutoff date, they can fill in gaps with made-up (hallucinated) or outdated information, which undermines trust in real-world use.
Retrieval-augmented generation (RAG) tackles this by letting models use external knowledge to ground their answers.
Early RAG work relied on carefully crafted prompts to break questions into queries, decompose complex requests, and stitch together information across multiple steps. That design requires relentless prompt tuning and strong reasoning by the model.
Later methods used supervised fine tuning (SFT) to strengthen smaller models and even applied
at inference to explore more answer paths. Those techniques boosted accuracy but came with heavy computational overhead.Monte Carlo tree search (MCTS) Monte Carlo Tree Search is an algorithm that builds a decision tree through iterative random simulations and selects actions based on their estimated outcomes.
Reinforcement learning (RL) has also emerged as a way to teach models when and how to fetch information based purely on reward signals. Some approaches hook into live search engines during training to mimic real web queries. In that setup, the quality of returned documents can be wildly unpredictable, and the cost of making hundreds of thousands of API calls quickly becomes a financial roadblock that limits scale.
How does ZeroSearch work?#
A
Let’s break down the steps:
1. Search engine#
A separate LLM is fine-tuned to act as a search engine simulator. First, a training dataset is generated using a three-step template—think, search, answer—where the model produces candidate documents labeled positive or negative based on retrieval accuracy. After collecting a large set of these query–document examples, including both relevant and noisy results, the LLM is fine-tuned to enhance its ability to generate realistic search-style outputs. Here is the training template used during supervised fine-tuning and inference:
Training Template |
Answer the given question. You must conduct reasoning inside <think> and </think> first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by <search> query </search>, and it will return the top searched results between <information> and </information>. You can search as many times as you want. If you find no further external knowledge, you can directly provide the answer inside <answer> and </answer> without detailed illustrations. For example, <answer> Beijing </answer>. Question: |
Fine-tuning for realistic search results#
The model is given two slightly different prompts, one to produce relevant documents and one to generate noisy documents. It can produce each type on demand by tweaking just a few words. Each prompt also embeds example input–output pairs to strengthen the LLM’s retrieval knowledge. Here is the template for the search simulation:
Template for Search Simulation |
You are the Google Search engine. Given a query, you need to generate five [useful/noisy] documents for the query. The user is trying to answer the question: [question] whose answer is [ground truth]. Each document should contain about 30 words, and these documents should contain [useful / noisy] information. Query: [query] [Useful / Noisy] Output: |
By the end of this stage, the LLM is ready to generate good and noisy search results and be used by the main LLM.
2. Curriculum rollout#
With the simulation LLM calibrated to produce both clean and noisy results on demand, the policy model is ready to learn how to use those documents effectively. The main LLM is trained via reinforcement learning. Early in training, it receives high-quality documents generated by the simulation LLM. A curriculum rollout then gradually increases the probability of noisy documents according to:
Where
The policy model is trained using three reinforcement learning algorithms: proximal policy optimization (PPO), group relative policy optimization (GRPO), and reinforce++.
Sample of relevant and noisy document outputs#
Below is a sample of documents generated by the simulation LLM in response to the query. The table contrasts the useful outputs, which contain relevant information about Óskar Jónasson’s birth date, with the noisy outputs, which are off-topic or irrelevant.
Query details
Component | Content |
Question | When was the director of film Reykjavík-Rotterdam born? |
Answer | June 30, 1963 |
Query | Óskar Jónasson birth date |
Retrieved documents
Doc ID | Useful Output | Noisy Output |
1 | Óskar Jónasson is a television and film director from Iceland. His feature debut, Reykjavík-Rotterdam, won the Golden Bear at the Berlin Festival. | The first single, “Láta ég allt,” was released on March 18, 2022. He wrote it along with Óskar Jónasson and is accompanied by a music video. |
2 | Director of Reykjavík-Rotterdam and Silence of My Father, Óskar Jónasson, was born on June 30, 1963, in Reykjavík, Iceland. He is a graduate of the Iceland Academy. | The children of the first generation include Ásgeir Trausti (born 1984) and Amiina (formed 2005), who also introduced the world. |
3 | A documentary about Iceland’s aborted attempt to build a nuclear power plant in the 1970s explains why the nation became largely nuclear-free. | Óskar Jónasson · 2022 · Iceland · Tracks: 13 · Genres: Pop · Languages: Icelandic. |
4 | Director of Reykjavík-Rotterdam and Silence of My Father, Óskar Jónasson, was born on June 30, 1963, in Reykjavík, Iceland. He is a graduate of the Iceland Academy. | Óskar Jónasson · Location: Reykjavík, Iceland · Industry: Music · Looking for: Jobs · Experience: 7 years · Education: 4 years. |
5 | Icelandic filmmaker Óskar Jónasson talks about his new film, the Golden Bear-winning Reykjavík-Rotterdam, at the Berlinale this year. | Jónasson was born in Iceland, and his family moved to Sweden when he was a child. He graduated from Stockholm University with a degree in […]. |
ZeroSearch vs. Baselines: Benchmark results#
Across seven diverse QA benchmarks, covering in-domain tasks such as NQ and HotpotQA and out-of-domain challenges like TriviaQA and Bamboogle, ZeroSearch consistently outperforms every baseline it was tested against, demonstrating its robustness across varied scenarios.
The following table compares ZeroSearch (base and instruction-tuned) with several baseline methods across seven QA datasets for Qwen-2.5-7B-Base/Instruct, showing that ZeroSearch (instruction-tuned) achieves the highest average score of 40.54 and leads on five out of seven tasks.
Method | NQ | TriviaQA | PopQA | HotpotQA | 2Wiki | Musique | Bamboogle | Avg. |
Direct Answer | 11.60 | 35.60 | 1.20 | 16.40 | 22.20 | 4.80 | 14.40 | 15.17 |
CoT | 12.80 | 35.60 | 3.80 | 16.20 | 22.60 | 6.60 | 24.00 | 17.37 |
RAG | 27.40 | 58.20 | 17.80 | 25.80 | 23.20 | 9.40 | 16.80 | 25.51 |
RA-Agent | 21.20 | 40.20 | 8.80 | 19.60 | 19.60 | 7.60 | 28.00 | 20.71 |
Search-o1 | 19.40 | 40.60 | 11.40 | 17.00 | 27.00 | 8.60 | 30.40 | 22.06 |
R1-base | 25.15 | 43.18 | 22.29 | 21.02 | 28.46 | 9.76 | 24.80 | 24.95 |
R1-instruct | 25.25 | 42.68 | 27.81 | 20.45 | 26.83 | 8.33 | 27.05 | 25.49 |
Search-R1-base | 41.51 | 60.53 | 51.02 | 32.25 | 36.31 | 16.39 | 28.00 | 38.00 |
Search-R1-inst | 41.46 | 62.17 | 49.80 | 34.55 | 34.22 | 19.43 | 33.06 | 39.24 |
ZeroSearch-base | 41.84 | 63.54 | 51.72 | 30.30 | 40.33 | 12.25 | 30.25 | 38.61 |
ZeroSearch-inst | 43.24 | 61.81 | 51.52 | 29.21 | 43.12 | 19.72 | 35.20 | 40.54 |
ZeroSearch still pulls ahead, underscoring its promise as a cost-free alternative for live retrieval in large-scale reinforcement learning.
ZeroSearch’s advantages apply across different model families and sizes. It delivers strong results for both base and instruction-tuned variants and shows further improvements as model size increases, illustrating its broad applicability and scalability. Performance of simulated search engines using different LLM configurations is shown in the table below:
Search Engine | NQ | TriviaQA | PopQA | HotpotQA | 2Wiki | Musique | Bamboogle | Avg. |
Prompt-3B | 35.77 | 52.74 | 41.62 | 25.76 | 26.02 | 8.47 | 10.57 | 28.71 |
Prompt-7B | 39.71 | 57.93 | 38.24 | 29.47 | 27.64 | 7.93 | 7.26 | 29.74 |
Prompt-14B | 40.40 | 58.82 | 37.78 | 26.26 | 29.01 | 10.14 | 15.45 | 31.12 |
SFT-3B | 42.03 | 59.68 | 44.22 | 29.18 | 30.24 | 10.41 | 11.29 | 32.44 |
SFT-7B | 41.70 | 61.18 | 46.46 | 30.66 | 28.98 | 11.76 | 10.66 | 33.06 |
SFT-14B | 41.21 | 61.49 | 43.99 | 31.02 | 33.20 | 12.58 | 14.29 | 33.97 |
Google Search | 41.13 | 61.22 | 40.73 | 27.64 | 31.97 | 12.17 | 12.40 | 32.47 |
From the table, it’s clear that SFT-7B matches Google Search’s performance, and SFT-14B even exceeds it, proving that a well-tuned LLM can stand in for a live engine.
Bigger simulation LLMs only improve, offering more accurate retrievals and stronger curriculum learning as size increases.
The moment of truth: GPU vs. API costs#
ZeroSearch swaps commercial API fees for GPU infrastructure costs by running simulation LLMs on GPU servers. Using Qwen-2.5-7B with a batch size of 64, five rollout repetitions, and 200 training steps (approximately 12 hours and 64 000 search requests), the table below shows that:
One A100 GPU for SFT-3B costs $17.7
Two GPUs for SFT-7B cost $35.4
Four GPUs for SFT-14B cost $70.8
...Compared with $586.7 in Google Search API charges
However, GPU costs are far lower than API costs, utilization peaks during rollouts and drops during policy updates. Sharing a simulation server across multiple RL jobs can reduce idle time and cut costs. Offering simulation LLMs of various sizes lets teams balance performance and resource use to fit their needs.
Simulation LLM Size | A100 GPUs | GPU Cost (USD) | Google Search API Cost (USD) |
SFT-3B | 1 | 17.7 | 0 |
SFT-7B | 2 | 35.4 | 0 |
SFT-14B | 4 | 70.8 | 0 |
None | 0 | 586.7 |
Limits of simulated search#
ZeroSearch depends entirely on the simulation LLM’s pretraining memory. Any request for information introduced after the model’s cutoff date or for highly specialized topics not seen in its training corpus will exceed its capabilities. For example, asking “Who won the 2025 Booker Prize?” may produce a plausible but incorrect answer or revert to previous winners instead of the actual recipient. Typical failure scenarios include breaking news events, ultra-niche technical findings, real-time financial data, and live sports scores. When up-to-the-minute accuracy or truly novel facts are required, a live search engine or external data source remains necessary.
Isn’t the knowledge cutoff still the ultimate limit, even with simulated search?
Large language models only know what they saw during pretraining, leaving them prone to hallucinations or stale facts when asked about anything beyond their cutoff date. ZeroSearch taps into that frozen knowledge more effectively by training the model to retrieve from its memory, surfacing relevant passages to ground its answers. This internal retrieval sharpens reasoning, reduces made-up content, and delivers Google-grade performance without live lookup fees. Even though it cannot fetch new information, ZeroSearch makes the most of what the model already knows, boosting reliability and accuracy across various tasks.
What lies beyond simulating search engines?#
Simulating a search engine is only the beginning. This approach can internalize services such as code execution, translation, summarization, and structured data extraction.
For example:
An LLM fine-tuned as an internal Python interpreter could debug and optimize scripts.
Another could act as a multilingual translation engine, converting text in dozens of languages offline.
A summarization module could condense lengthy meeting transcripts into concise briefs.
Knowledge graph generators could map complex relationships in legal or medical documents.
Chaining these simulated tools, the model could orchestrate workflows from data analysis to personalized tutoring without external dependencies or unpredictable costs.
A future in simulated search#
ZeroSearch transforms a model’s frozen knowledge into a dynamic in-house retrieval engine, cutting API costs and boosting answer quality. It delivers robust, scalable performance across diverse tasks by blending supervised fine-tuning, curriculum rollout, and reinforcement learning. ZeroSearch counts every byte, empowering LLMs to act like self-sufficient research assistants even with a static knowledge base.
Simulated search is just one part of the fine-tuning frontier. If you're ready to train models that reason better, retrieve smarter, and adapt faster, check out our hands-on course below.
This hands-on course will teach you the art of fine-tuning large language models (LLMs). You will also learn advanced techniques like Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA) to customize models such as Llama 3 for specific tasks. The course begins with fundamentals, exploring fine-tuning, the types of fine-tuning, comparison with pretraining, discussion on retrieval-augmented generation (RAG) vs. fine-tuning, and the importance of quantization for reducing model size while maintaining performance. Gain practical experience through hands-on exercises using quantization methods like int8 and bits and bytes. Delve into parameter-efficient fine-tuning (PEFT) techniques, focusing on implementing LoRA and QLoRA, which enable efficient fine-tuning using limited computational resources. After completing this course, you’ll master LLM fine-tuning, PEFT fine-tuning, and advanced quantization parameters, equipping you with the expertise to adapt and optimize LLMs for various applications.