Maximising Copilot Token Efficiency 🚀

With GitHub Copilot shifting to usage-based billing, every token in an agentic session now carries real cost. The good news? GitHub and Microsoft have spent months engineering the Copilot harness to be smarter about how it spends those tokens. This guide walks you through the key efficiency improvements, along with practical habits that will help your credits go further.
Understanding the Cost of Agentic Sessions
When you use Copilot in a longer coding session, especially with agents performing multi-step tasks, token usage quickly compounds. Each request repeats:
- System instructions and tool definitions
- Repository context and conversation history
- Available tools and task state
- Model reasoning overhead
That repetition is the problem. But GitHub has tackled it with two complementary strategies: harness-level efficiency (reducing what Copilot repeats per turn) and model intelligence (choosing the right model for the job).
Strategy 1: Prompt Caching and Deferred Tools
Extended Prompt Caching (OpenAI)
Prompt caching lets the inference provider reuse cached model state for repeated prefixes instead of recomputing them from scratch. Here’s what makes this powerful:
- Cost impact: Cached input tokens cost up to 10 times less than uncached tokens
- Default behaviour: OpenAI caches the prompt prefix automatically, but the cache expires after ~5–10 minutes of inactivity
- Extended retention: Setting
prompt_cache_retention: "24h"moves the cache to GPU-local storage and preserves it for up to 24 hours
The benefit is dramatic when you pause and resume:
| Pause duration | Cache hit rate improvement |
|---|---|
| 10–20 min | +13% |
| 20–30 min | +135% |
| 30–40 min | +301% |
| 40–60 min | +338% to +919% |
After a 40–60 minute break, your cache is 3–10x more likely to remain warm. That’s the difference between picking up where you left off cheaply, or paying full price to reprocess your entire conversation history.
Smarter Prompt Caching (Anthropic)
Anthropic’s approach differs: you place explicit cache_control breakpoints at the prompt’s most stable boundaries.
- End of tool definitions and system prompt (least-changing elements)
- Two rolling anchors on the two most recent cacheable messages
- The older anchor serves as a safety net if the fresher one expires
Result: ~94% cache hit rate for agentic workloads, meaning only a small fraction of each request needs recomputation.
Tool Search and Deferred Loading
Agents can access dozens of tools, including MCP servers, built-in tools, and extensions. Historically, every tool definition was loaded into context on every request, even if unused.
Tool search flips this: tools are deferred, and the model sees only lightweight metadata (name and description) upfront. The heavy parameter schemas load on demand when the model actually calls a tool.
Measurable gains (4-day experiment with GPT-5.4 and GPT-5.5):
| Metric | Improvement |
|---|---|
| Tokens per turn (P50) | ~9% reduction |
| Time to first token | ~7% faster |
| Time to complete | ~5% faster |
| Session-level token usage | ~8–11% reduction |
For Anthropic models with client-side tool search, a 7-day experiment showed:
| Metric | Improvement |
|---|---|
| Prompt tokens (P50 per session) | ~18% reduction |
| Total tokens (P50 per session) | ~18% reduction |
| Time to first token | ~2% faster |
| User error rate (Sonnet 4.6) | ~4% lower |
The embedding-guided search (matching intent, not keywords) is smarter too: “find all references to this symbol” surfaces the right tool even when names don’t overlap literally.
WebSocket Persistence (OpenAI)
Agentic turns often involve many sequential requests as the model calls tools and works toward a solution. Switching between HTTP requests adds latency overhead; WebSockets keep a persistent connection open and enable connection-local caching of the most recent response state.
Latency improvements during rollout:
| Metric | Improvement |
|---|---|
| Time to first token (p50) | ~19% faster |
| Time to first token (p95) | ~13% faster |
| Time to complete per turn (p50) | ~13–14% faster |
Bonus: WebSockets also drove measurable engagement gains:
- Active users: +1.27% to +2.17%
- Two-day engagement: +1.90% to +3.14%
Strategy 2: Auto Model Selection with Task Intent
Quick explanations, focused edits, and multi-file refactorings don’t all benefit from the same reasoning level. GitHub’s Auto model selection uses two signals to choose the right model for the work:
Real-Time Model Health
A dynamic engine tracks model availability, utilisation, speed, error rates, and cost. A model may be capable of handling your task, but if it’s under heavy load, a more efficient model might deliver faster results at lower cost.
Task-Aware Routing (HyDRA)
A proprietary routing model considers:
- Reasoning depth required
- Code complexity
- Debugging difficulty
- Tool orchestration needs
Benchmark results (across a 5-model production pool):
| Operating point | Quality vs baseline | Cost savings |
|---|---|---|
| Peak (max quality) | Exceeds Sonnet | 12.9% savings |
| Balanced | Matched | 72.5% savings |
On SWEBench benchmarks, HyDRA’s conservative operating point ties commercial routers while delivering 3.3x the cost savings. The aggressive operating point outperforms both Azure Foundry modes.
Making Auto Work in Practice
Auto avoids cache-breaking mid-conversation by routing only at natural cache boundaries:
- On the first turn (no cache to lose)
- After compaction (when Copilot summarises older turns and resets the prefix)
Between these points, the selected model stays in place so cache can keep building.
Multilingual routing: HyDRA was trained on 16 language families (CJK, European, etc.). Routing accuracy stays within 4 points of the English baseline across all language groups.
Practical Habits to Stretch Your Credits
Start with Auto. Auto is the strong default because it chooses a model based on the task without manual tuning every time.
Keep context focused. Start a new session when you switch tasks, compact long-running sessions when needed, and mention the files you want Copilot to use. Less unnecessary context means more of your session goes toward actual work.
Avoid mid-session changes. Switching models, reasoning levels, context size, or tool configuration breaks cache reuse. Set up the session the way you want it, then keep related work together.
Plan before parallelising. For larger tasks, ask Copilot to plan first. Parallel agents consume credits in parallel, so use them deliberately.
Use only the tools you need. Tools and MCP servers are powerful, but broad toolsets add extra context. Enable what’s relevant to the task. Use Agent Finder to streamline your tool usage.
Monitor your usage. Your AI usage page shows where credits are going across features and models. In Copilot CLI, session-level usage flags expensive patterns while you work.
What’s Next for Copilot Efficiency
GitHub is continuing to invest:
- Specialised subagents for narrow tasks (workspace search, running commands, summarising results), each running on the smallest, cheapest model that can do the job
- Better transparency in the product: alerts when resuming after a long pause with expired cache, or changing reasoning mid-session
- Auto expansion to Copilot CLI, GitHub App, and additional IDEs
- Simplified plans: Copilot Free and Student plans will leverage Auto as the only model selection option
- Admin controls: Teams can set Auto as default or enforce it as the only option
Getting Started
Auto model selection is live across supported Copilot experiences today. To learn more:
The engineering investment is clear: every token now goes further. Pair that with deliberate habits: keep sessions focused, trust Auto to route intelligently, and monitor your usage. Then you will get more value from each credit you spend.
References
- Getting more from each token: How Copilot improves context handling and model routing – GitHub Blog
- Improving token efficiency for GitHub Copilot in VS Code – VS Code Blog
- Auto model selection docs – GitHub Docs