This post is by Salman Paracha, Co-Founder & CEO of Katanemo, building the modern infrastructure stack for AI. Before that, Salman built cloud infrastructure at AWS and Oracle, where he was at the forefront of two major technology waves: cloud computing and serverless. He is now at the forefront of the agentic AI wave.
He still considers himself a developer and focuses on building technologies that empower other developers to do more, move faster and create the next great application.
Hugging Face, an open-source community for AI projects, now hosts over 1.7M models (up from 1M in late 2024), and the pace of model releases is accelerating. On the proprietary side, OpenAI alone has shipped more than a dozen model families and versions since GPT-4, from the o-series (o1/o3/o4-mini) to GPT-5. Google continues to roll out Gemini versions behind stable aliases, xAI keeps iterating on Grok, and Amazonās Bedrock aggregates multiple leading LLMs under one platform. The takeaway is simple: model choice is exploding. And thereās no sign of this trend slowing down, especially with the rise of smaller but increasingly capable models.Ā
The image below shows a torrent of models hitting the market over the course of the past few years, a chaotic wall of names and versions that makes clear: keeping up with model choice isnāt optional, itās survival.
For developers, the real question is: how do I harness this abundance without drowning in complexity? In this post, Iāll walk through (with code examples) why itās getting easier to harness this abundance. All the examples are in Python, so if you have even a bit of programming experience, youāll be able to follow along and run them yourself.
While this article is primarily developer-focused, the following are key takeaways for non-developers and executive leaders: your teams can experiment faster, adopt the right model for the right scenario to deliver better user experiences, and easily keep up with the latest innovationsāall without rewrites or vendor lock-in. Just as important, you can centrally govern AI usage across teams, gain visibility into which models are being used and how they perform, and establish guardrails for compliance and accountability. This ensures that AI adoption is not only fast, but also reliable, auditable, and enterprise-ready.
For anyone observing the AI landscape or building agents, the flurry of options can either feel overwhelming or empowering. For builders, Iād argue itās the latter. Choice means youāre no longer stuck waiting on one provider for breakthroughs to trickle down. Instead, new intelligence is arriving at your doorstep at a pace that lets you experiment, swap in better fits, and move faster than ever before. In this post, Iāll share why model choice is the closest thing to a āfree lunchā in AI, and how you can take full advantage of it.
In most areas of tech, more choice usually means more complexity: different APIs, new standards to learn, and usually a steeper learning curve. With AI, the dynamic looks a bit different. The surge in model choice is mostly reducing friction for developers. Hereās why:
(near) Drop-in compatibility: While the way you prompt GPT-4 vs. Claude vs. LLaMA results in different levels of performance, the atomic unit common among them is a prompt, which has more interoperability than before. Sure, each API provider offers some unique features that you might want to leverage, but the foundational block is the same. And with the right tools and infrastructure around models, you can harness the full potential of these LLMs. Weāll talk more about the right tools and infrastructure later.
Rapid iteration: New models arrive with better reasoning, faster throughput, or lower cost. You can test and adopt them quickly without waiting for a ānext generationā cycle.
Fit to purpose: Instead of stretching one model to handle everything, you can select the right model for reasoning, summarization, coding, or creative tasks.
The upshot: more choice doesnāt mean more lock-ināit means more leverage. Developers get access to state-of-the-art intelligence at the pace itās invented, with less overhead than ever before. If āmore model choiceā is the opportunity, the way you turn it into day-to-day leverage is surprisingly simple: pair lightweight testing with an intelligent proxy server. No new SDKs, no rewrites, just two thin layers that make swapping and scaling safer and faster.
In order to take advantage of the near-constant stream of new models, it is important to have an adoption infrastructure. That adoption infrastructure consists of two parts:Ā
Part 1āTesting infrastructure: Keep a tiny, templatized prompt set per task and run it against a few candidate models. Validate with deterministic checks (factual anchors, schema) and sprinkle in an LLM-as-judge or quick human review where needed. The goal is a fast, repeatable signal to pick the right model.
Part 2āNetwork Infrastructure: Put a lightweight, high-performance proxy server (e.g., archgw) between your agents and LLM providers. Your app still owns prompts, but the proxy server gives you consistent APIs, metrics, guardrails (JSON/schema checks), and a single control point for model choice. Instead of hardcoding vendor names, this allows you to call intent-based aliases like arch.summarize.v1
. Behind the scenes, the proxy can route that alias to the right model, shift traffic during canary tests, or fall back automatically on errors.
How do aliases and routing work?
Think of an alias as a handle your code always calls. Routing is the set of rules that decide what actually happens when that handle is invokedāwhich model to send the request to, how to split traffic across candidates, and what to do if something fails. Together, aliases and routing make model selection programmable, safe, and observable.
Letās walk through the setup. Itās just two steps.
Write test fixtures per task: Keep 10ā20 real examples for a single task (say summarization). Each fixture has an input and a couple of must_include anchors. Optionally specify a JSON schema.
# evals_summarize.yamltask: summarizefixtures:- id: sum-001input: "Thread about a billing disputeā¦"must_include: ["invoice"]schema: SummarizeOut- id: sum-002input: "Thread about a shipping delayā¦"must_include: ["status"]schema: SummarizeOut...
Pick a few candidate models: As youāre routing through archgw, you donāt need provider API keys in your code harness. Just point your client at the proxy server and list the models (or aliases like arch.summarize.v1
) you would want to evaluate your tasks against.
Now, let's build a minimal python harness that we can use to test out different models.
# bench.pyimport json, time, yaml, statistics as statsfrom pydantic import BaseModel, ValidationErrorfrom openai import OpenAI# archgw endpoint (keys are handled by archgw)client = OpenAI(base_url="http://localhost:12000/v1", api_key="n/a")MODELS = ["arch.summarize.v1", "arch.reason.v1"]FIXTURES = "evals_summarize.yaml"# Expected output shapeclass SummarizeOut(BaseModel):title: strbullets: list[str]next_actions: list[str]def load_fixtures(path):with open(path, "r") as f: return yaml.safe_load(f)["fixtures"]def must_contain(text: str, anchors: list[str]) -> bool:t = text.lower()return all(a.lower() in t for a in anchors)def schema_fmt(model: type[BaseModel]):return {"type":"json_object"} # Simplified for broad compatibilitydef run_case(model, fx):t0 = time.perf_counter()schema = SummarizeOut.model_json_schema()resp = client.chat.completions.create(model=model,messages=[{"role": "system", "content": f"Be concise. Output valid JSON matching this schema:\n{json.dumps(schema)}"},{"role": "user", "content": fx["input"]}],response_format=schema_fmt(SummarizeOut))dt = time.perf_counter() - t0content = resp.choices[0].message.content or "{}"passed, reasons = True, []try: data = json.loads(content)except: return {"ok": False, "lat": dt, "why": "json decode"}try: SummarizeOut(**data)except ValidationError: passed=False; reasons.append("schema")if not must_contain(json.dumps(data), fx.get("must_include", [])):passed=False; reasons.append("anchors")return {"ok": passed, "lat": dt, "why": ";".join(reasons)}def main():fixtures = load_fixtures(FIXTURES)for model in MODELS:results = [run_case(model, fx) for fx in fixtures]ok = sum(r["ok"] for r in results)total = len(results)latencies = [r["lat"] for r in results]print(f"\nāŗāŗāŗ {model}")print(f" Success: {ok}/{total} ({ok/total:.0%})")if latencies:avg_lat = stats.mean(latencies)p95_lat = stats.quantiles(latencies, n=100)[94]print(f" Latency (ms): avg={avg_lat*1000:.0f}, p95={p95_lat*1000:.0f}")if __name__ == "__main__":main()
What āgood enoughā could look like for your test cases:
ā„90% schema-valid
ā„80% anchors present
Latency within your SLO
Cost within budget
Before we can run the harness, we need to set up archgw (the smart edge and LLM proxy for agents). Why bother with a proxy at all? Itās about giving you a consistent API surface across providers, a unified view of logs and traces for every model call, centralized key management, and a single abstraction layer for model choice.
The key abstraction in the proxy is the model alias. Instead of wiring your app to vendor-specific names like gpt-4o-mini
or claude-3-sonnet
, you wire it to intent with aliases such as arch.summarize.v1
or arch.reason.v1
. Your code always calls the alias; the proxy decides what model runs underneath, how traffic is shaped, and what fallback rules apply. Aliases may look like simple labels, but they encode intent rather than vendor detailsāand that shift unlocks several practical advantages:
Decoupled application code: Instead of hardcoding gpt-4o-mini
or claude-3-sonnet
throughout your codebase, you reference a single alias. Swap the underlying model once in config, and your entire app updatesāno redeploys, no hunt-and-replace.
Safe promotions: Want to test a new model? Point the alias at a candidate behind the scenes. If your canary tests or metrics hold, flip the pointer to 100% traffic. If not, roll back instantly without touching application code.
Central governance: Apply SLOs, guardrails, and policies at the alias level. For example, enforce JSON schema validation, timeout rules, or max token caps once, instead of scattering that logic across services.
Observability by task: Dashboards and traces group naturally by alias (arch.summarize.v1
) rather than shifting provider names. That makes it easier to track latency, cost, and error rates by intent, not by vendor string.
Quota & cost control: Throttle or cap usage per alias. For instance, you can enforce a daily budget on summarization tasks, regardless of which model powers them underneath.
Hereās the simple archgw config that you will need to run the test harness:
version: v0.1.0listeners:egress_traffic:address: 0.0.0.0port: 12000message_format: openaitimeout: 30sllm_providers:- model: openai/gpt-4o-miniaccess_key: $OPENAI_API_KEYdefault: true- model: openai/o3access_key: $OPENAI_API_KEYmodel_aliases:arch.summarize.v1:target: gpt-4o-miniarch.reason.v1:target: o3
## export OPENAI_API_KEY="sk-..."## Please install Poetry: https://python-poetry.org/docs/#installation if not installed## Install all dependencies as described in the main Arch README ([link]...
Run the test: Execute the Python benchmark script as follows:
python bench.py
After running the command above, you should see an output like this, giving you a clear signal on which model alias performs best for your task:
āŗāŗāŗ arch.summarize.v1Success: 16/20 (80%)Latency (ms): avg=450, p95=750āŗāŗāŗ arch.reason.v1Success: 19/20 (95%)Latency (ms): avg=850, p95=1300
From the results shown above, you can quickly see that arch.summarize.v1
(GPT-4o Mini) is faster but less reliable, while arch.reason.v1
(o3) is slower but more reliable. This is simple, actionable data that you can use to make decisions on what models to use in your agentic application. And, with the proxy in place, you can flip the switch to a different model tomorrow without a single line of code change in your application.
With the rate at which models are being launched, there is no time to be overwhelmed.Ā With a handful of fixtures (inputs + anchors + optional schema), a tiny Python harness, and a lightweight proxy like archgw, you can turn the buffet of model choices into a super power.
Here is a recap of how to quickly consider, test, and adopt new models without being overwhelmed:Ā
Evaluate quickly: Run fixtures across a few candidates, get a clear pass/fail signal, and decide whatās āgood enoughā for your use case.
Swap safely: Point your app at an alias (arch.summarize.v1
) and update the underlying model in config ā no redeploys, no code changes.
Stay consistent: Centralize guardrails, metrics, and governance at the proxy layer so your application code stays clean.
This is about as close to a free lunch as you can get with AI. The ecosystem keeps producing better, faster, cheaper modelsāand you can actually take advantage of them without rewrites or lock-in. Drop a proxy in the middle, and the payoff compounds as the model landscape keeps shifting.
If this playbook resonates, check out archgw on GitHub, give it a star, and hop into our Discord to be part of our community. Itās early days, but weāre rethinking the network stack for agentsābuilding an open-source edge and AI gateway, purpose-built for agentic apps. With routing, guardrails, and observability built in, you can focus on your agentsā business logic while we ensure they run reliably in production.
Free Resources