Smart LLM Routing on a Budget: A NadirClaw Q&A Guide
NadirClaw provides a cost-effective way to route prompts to different language models without sacrificing quality. Instead of always hitting an expensive pro model, it uses a lightweight local classifier to decide if a prompt is simple or complex. Simple tasks stay with a cheap local model, while complex ones are forwarded to a more capable model like Gemini. This Q&A covers setup, local classification, live routing, and cost comparisons so you can build your own smart routing system.
What is NadirClaw and how does it make LLM routing cost-aware?
NadirClaw acts as an intelligent proxy layer between you and various language models. It inspects every incoming prompt and determines its complexity using a purely local, zero‑LLM‑call classifier. The classifier relies on pre‑computed centroid vectors that represent “simple” and “complex” tasks. When a new prompt arrives, NadirClaw calculates its similarity to each centroid. If the prompt is close to the simple centroid, it routes to a lightweight, inexpensive model – often a local one that runs without API costs. If it lands near the complex centroid, the system forwards the prompt to a more powerful and expensive model like Gemini. This way, you only pay full price for the hardest queries, dramatically reducing overall costs while maintaining response quality.

How do I install NadirClaw and set up the required environment?
To get started, you need Python and a few supporting libraries. Install NadirClaw along with packages for embeddings, plotting, and API calls using pip. After installation, import the necessary modules – this includes numpy, pandas, matplotlib, and requests. You also need a Gemini API key if you plan to use live routing. The system securely captures this key from the environment or via a hidden input prompt. If you skip the key, the local classifier sections still work perfectly; you just won’t be able to test live forwarding to Gemini. A boolean flag later decides whether to enable the live routing parts of the code.
How does the local prompt classifier work without making any live API calls?
The local classifier uses the command nadirclaw classify with a --format json flag. You feed it a prompt, and it returns a JSON object indicating whether the prompt is simple or complex – all computed offline. The classifier never contacts an external model; it compares the prompt’s embedding against two centroid vectors stored inside the NadirClaw package. These centroids were created by averaging many embeddings of known simple and complex queries. By measuring cosine similarity, the classifier decides which centroid the new prompt is closer to. This means you can test dozens of prompts and see the classifications instantly, making it ideal for experimentation without incurring any LLM cost.
Why are centroid vectors important, and how do they separate simple from complex tasks?
Centroid vectors are the mathematical heart of the routing decision. Think of them as the archetypal “simple” and “complex” prompt embeddings. During development, NadirClaw computed these centroids from a large set of labeled examples. When you embed a new prompt, its vector lands somewhere in embedding space. The system calculates its similarity to each centroid – usually using cosine similarity. Simple tasks (e.g., “What is 2+2?”) cluster near the simple centroid, while complex tasks (e.g., “Design a distributed event‑sourced order pipeline…”) lie close to the complex centroid. You can inspect these centroids directly, embed your own prompts, and even plot the similarity scores to see the separation visually. This transparency helps you understand and trust the routing logic before it goes live.

How can I visualize similarity scores and adjust confidence thresholds?
After embedding your prompts and computing similarity to the two centroids, you can create a scatter plot or bar chart to see where each prompt falls. Tools like matplotlib make this straightforward – you can color points by their assigned tier (simple vs. complex) or by the raw similarity value. NadirClaw also lets you set a confidence threshold: if the similarity to the best‑match centroid is too low, the system can be configured to fall back to a more robust model or to abstain from routing. Experimenting with the threshold helps you balance cost and accuracy. A higher threshold ensures only very clear simple prompts skip the expensive model, but more prompts will be sent to Gemini. Lower thresholds save more money but might misclassify borderline complex prompts as simple.
How do I launch the NadirClaw proxy server and send OpenAI‑compatible requests through it?
Once you’re satisfied with the local classifier, you start the NadirClaw proxy server. This server listens on a local port and accepts requests formatted just like OpenAI’s API (with a model parameter). When a request arrives, the proxy uses the local classifier to decide the actual model to call. If the prompt is simple, it might route to a local model running on your machine (like a small transformer). If complex, it forwards the request to Gemini using the API key you provided. You can test this by sending a cURL or Python request to http://localhost:<port>/v1/chat/completions. The response will come from whichever model the classifier chose, and you can compare the behavior of different routed prompts side‑by‑side.
How can I estimate the cost savings compared to always using an expensive pro model?
To estimate savings, log every request along with its routed model and the token count. For each request, calculate what it would have cost if you always sent it to the most expensive model (e.g., Gemini Pro). Then sum up the actual costs: zero for local classifier routes (assuming local inference is free) and the Gemini rate for complex prompts. The difference between the “always‑pro” total and your actual total is your savings. Typically, if 70–80% of prompts are simple, you can cut costs by a similar percentage. NadirClaw’s built‑in logging makes this calculation easy – you can export the log to a CSV and run a simple Python script to compute the numbers. The result gives you a clear, data‑driven justification for deploying a routing system.
Related Articles
- AWS 2026 Unveils Amazon Quick Desktop App and Expands Connect with Agentic AI Solutions
- How SentinelOne’s Autonomous AI Defense Stopped a Zero-Day Supply Chain Attack Targeting LLM Infrastructure
- Python 3.14.3 and 3.13.12 Roll Out With Critical Bug Fixes, New Features
- 7 Key Insights into Small Language Models for Enterprise AI
- TokenSpeed: A New Open-Source LLM Inference Engine Tailored for Agentic AI Workloads
- AI Engineers Rush to Abandon LangChain for Native Architectures in Production
- OpenAI Launches Three Real-Time Audio Models with Reasoning, Translation, and Transcription Capabilities
- Your Guide to Choosing Claude, Gemini, or Any AI Assistant as Your Default in iOS 27