Written by Sujith Quintelier • Jun 19, 2026

Maximising Copilot Token Efficiency 🚀

Master token efficiency in Copilot sessions: learn prompt caching, auto model selection, and practical strategies to stretch your credits further.

Jun 19, 2026

With GitHub Copilot shifting to usage-based billing, every token in an agentic session now carries real cost. The good news? GitHub and Microsoft have spent months engineering the Copilot harness to be smarter about how it spends those tokens. This guide walks you through the key efficiency improvements, along with practical habits that will help your credits go further.

Understanding the Cost of Agentic Sessions

When you use Copilot in a longer coding session, especially with agents performing multi-step tasks, token usage quickly compounds. Each request repeats:

System instructions and tool definitions
Repository context and conversation history
Available tools and task state
Model reasoning overhead

That repetition is the problem. But GitHub has tackled it with two complementary strategies: harness-level efficiency (reducing what Copilot repeats per turn) and model intelligence (choosing the right model for the job).

Strategy 1: Prompt Caching and Deferred Tools

Extended Prompt Caching (OpenAI)

Prompt caching lets the inference provider reuse cached model state for repeated prefixes instead of recomputing them from scratch. Here’s what makes this powerful:

Cost impact: Cached input tokens cost up to 10 times less than uncached tokens
Default behaviour: OpenAI caches the prompt prefix automatically, but the cache expires after ~5–10 minutes of inactivity
Extended retention: Setting prompt_cache_retention: "24h" moves the cache to GPU-local storage and preserves it for up to 24 hours

The benefit is dramatic when you pause and resume:

Pause duration	Cache hit rate improvement
10–20 min	+13%
20–30 min	+135%
30–40 min	+301%
40–60 min	+338% to +919%

After a 40–60 minute break, your cache is 3–10x more likely to remain warm. That’s the difference between picking up where you left off cheaply, or paying full price to reprocess your entire conversation history.

Smarter Prompt Caching (Anthropic)

Anthropic’s approach differs: you place explicit cache_control breakpoints at the prompt’s most stable boundaries.

End of tool definitions and system prompt (least-changing elements)
Two rolling anchors on the two most recent cacheable messages
The older anchor serves as a safety net if the fresher one expires

Result: ~94% cache hit rate for agentic workloads, meaning only a small fraction of each request needs recomputation.

Tool Search and Deferred Loading

Agents can access dozens of tools, including MCP servers, built-in tools, and extensions. Historically, every tool definition was loaded into context on every request, even if unused.

Tool search flips this: tools are deferred, and the model sees only lightweight metadata (name and description) upfront. The heavy parameter schemas load on demand when the model actually calls a tool.

Measurable gains (4-day experiment with GPT-5.4 and GPT-5.5):

Metric	Improvement
Tokens per turn (P50)	~9% reduction
Time to first token	~7% faster
Time to complete	~5% faster
Session-level token usage	~8–11% reduction

For Anthropic models with client-side tool search, a 7-day experiment showed:

Metric	Improvement
Prompt tokens (P50 per session)	~18% reduction
Total tokens (P50 per session)	~18% reduction
Time to first token	~2% faster
User error rate (Sonnet 4.6)	~4% lower

The embedding-guided search (matching intent, not keywords) is smarter too: “find all references to this symbol” surfaces the right tool even when names don’t overlap literally.

WebSocket Persistence (OpenAI)

Agentic turns often involve many sequential requests as the model calls tools and works toward a solution. Switching between HTTP requests adds latency overhead; WebSockets keep a persistent connection open and enable connection-local caching of the most recent response state.

Latency improvements during rollout:

Metric	Improvement
Time to first token (p50)	~19% faster
Time to first token (p95)	~13% faster
Time to complete per turn (p50)	~13–14% faster

Bonus: WebSockets also drove measurable engagement gains:

Active users: +1.27% to +2.17%
Two-day engagement: +1.90% to +3.14%

Strategy 2: Auto Model Selection with Task Intent

Quick explanations, focused edits, and multi-file refactorings don’t all benefit from the same reasoning level. GitHub’s Auto model selection uses two signals to choose the right model for the work:

Real-Time Model Health

A dynamic engine tracks model availability, utilisation, speed, error rates, and cost. A model may be capable of handling your task, but if it’s under heavy load, a more efficient model might deliver faster results at lower cost.

Task-Aware Routing (HyDRA)

A proprietary routing model considers:

Reasoning depth required
Code complexity
Debugging difficulty
Tool orchestration needs

Benchmark results (across a 5-model production pool):

Operating point	Quality vs baseline	Cost savings
Peak (max quality)	Exceeds Sonnet	12.9% savings
Balanced	Matched	72.5% savings

On SWEBench benchmarks, HyDRA’s conservative operating point ties commercial routers while delivering 3.3x the cost savings. The aggressive operating point outperforms both Azure Foundry modes.

Making Auto Work in Practice

Auto avoids cache-breaking mid-conversation by routing only at natural cache boundaries:

On the first turn (no cache to lose)
After compaction (when Copilot summarises older turns and resets the prefix)

Between these points, the selected model stays in place so cache can keep building.

Multilingual routing: HyDRA was trained on 16 language families (CJK, European, etc.). Routing accuracy stays within 4 points of the English baseline across all language groups.

Practical Habits to Stretch Your Credits

Start with Auto. Auto is the strong default because it chooses a model based on the task without manual tuning every time.
Keep context focused. Start a new session when you switch tasks, compact long-running sessions when needed, and mention the files you want Copilot to use. Less unnecessary context means more of your session goes toward actual work.
Avoid mid-session changes. Switching models, reasoning levels, context size, or tool configuration breaks cache reuse. Set up the session the way you want it, then keep related work together.
Plan before parallelising. For larger tasks, ask Copilot to plan first. Parallel agents consume credits in parallel, so use them deliberately.
Use only the tools you need. Tools and MCP servers are powerful, but broad toolsets add extra context. Enable what’s relevant to the task. Use Agent Finder to streamline your tool usage.
Monitor your usage. Your AI usage page shows where credits are going across features and models. In Copilot CLI, session-level usage flags expensive patterns while you work.

What’s Next for Copilot Efficiency

GitHub is continuing to invest:

Specialised subagents for narrow tasks (workspace search, running commands, summarising results), each running on the smallest, cheapest model that can do the job
Better transparency in the product: alerts when resuming after a long pause with expired cache, or changing reasoning mid-session
Auto expansion to Copilot CLI, GitHub App, and additional IDEs
Simplified plans: Copilot Free and Student plans will leverage Auto as the only model selection option
Admin controls: Teams can set Auto as default or enforce it as the only option

Getting Started

Auto model selection is live across supported Copilot experiences today. To learn more:

The engineering investment is clear: every token now goes further. Pair that with deliberate habits: keep sessions focused, trust Auto to route intelligently, and monitor your usage. Then you will get more value from each credit you spend.

References

Getting more from each token: How Copilot improves context handling and model routing – GitHub Blog
Improving token efficiency for GitHub Copilot in VS Code – VS Code Blog
Auto model selection docs – GitHub Docs

GitHub Copilot Mastery

Development Tools AI-Powered Development

GitHub GitHub Copilot AI VS Code Development Tools AI Credits

Enjoyed it? Fuel my coffee