Reduce 66% of LLM input tokens while improving accuracy up to +1.1%
Full benchmark report for The Token Company adaptive compression powered by bear-1, evaluated on GPT-4o-mini using LongBench v2 benchmark.
Large-context LLMs still stumble on long inputs: they lose facts in the middle of a sequence (the classic “lost-in-the-middle” failure mode), hit hard limits on usable context, and turn big prompts into higher costs and slower latency. Scaling context alone does not fix the attention, budget or throughput constraints when every extra token compounds price and response time.
In this LongBench v2 study, The Token Company’s bear-1 adaptive compression showed improved answer accuracy while discarding 23–66% of tokens, cutting latency and model spend in the process. The bear-1 model compresses tokens based on their importance and a configurable threshold for cut-off leaving only the most important tokens to be passed to an LLM model.
In the test case, input is first passed to The Token Company bear-1 compression model via API and the compressed input is then passed to GPT-4o-mini. In the control case, the input is passed directly to GPT-4o-mini without compression. Sign up to try it yourself.
Results Summary
| Cutoff | Mean Accuracy | Std Dev | Δ vs Baseline | Token Reduction | Significance |
|---|---|---|---|---|---|
| baseline | 28.2% | ±0.6% | — | — | |
| 0.1 | 28.4% | ±0.7% | +0.2% | 10.3% | |
| 0.2 | 27.7% | ±0.7% | -0.5% | 15.5% | |
| 0.3 | 29.2% | ±1.0% | +1.0% | 23.4% | |
| 0.4 | 29.1% | ±0.7% | +0.9% | 24.6% | |
| 0.5 | 28.9% | ±0.7% | +0.7% | 31.4% | |
| 0.6 | 28.5% | ±0.9% | +0.3% | 35.6% | |
| 0.7 | 27.8% | ±1.0% | -0.4% | 42.4% | |
| 0.8 | 29.0% | ±0.8% | +0.8% | 52.4% | |
| 0.9 | 29.2% | ±0.8% | +1.1% | 66.1% | |
| 0.95 | 27.7% | ±0.6% | -0.5% | 77.4% |
Significance: Two-sample t-test, |t| > 2 (~95% confidence)
Key Observations
Six cutoffs showed statistically significant accuracy improvements over baseline using bear-1 (0.3, 0.4, 0.5, 0.6, 0.8, 0.9).
Best accuracy improvement: 0.9 cutoff with bear-1 at +1.1% with 66% token reduction.
Non-monotonic relationship: Some cutoffs (0.2, 0.7) underperform their neighbors despite bear-1 removing fewer tokens. This is likely an artifact of how specific thresholds interact with the importance score distribution in this dataset, rather than a fundamental property of compression. Different benchmarks may show slight dips at different thresholds.
Aggressive compression (0.95) via bear-1 removes too much context and loses accuracy.
Recommended configurations
Two cutoffs are most practical with bear-1: an importance_cutoff of 0.3 serves as a conservative setting that improves accuracy (+1.0%) while removing 23% of tokens, making it best for users who want performance lift with minimal risk of losing context.
Conversely, an importance_cutoff of 0.9 represents an aggressive setting that improved accuracy the most (+1.1%) and removed 66% of tokens, which is ideal for cost-sensitive or latency-bound workloads. Both of these bear-1 configurations were statistically significant across 50 runs per configuration on LongBench v2 with GPT-4o-mini.
Bootstrap Analysis
10,000 bootstrap iterations. Baseline: 50 runs, mean accuracy 0.2817. Tested with bear-1.
| Config | Mean | Diff | 95% CI | P(better) | Verdict |
|---|---|---|---|---|---|
| 0.1 | 0.2842 | +0.0024 | [-0.0001, +0.0050] | 96.91% | |
| 0.2 | 0.2770 | -0.0048 | [-0.0074, -0.0022] | 0.01% | |
| 0.3 | 0.2917 | +0.0100 | [+0.0070, +0.0132] | 100.00% | |
| 0.4 | 0.2907 | +0.0090 | [+0.0063, +0.0116] | 100.00% | |
| 0.5 | 0.2890 | +0.0073 | [+0.0047, +0.0100] | 100.00% | |
| 0.6 | 0.2850 | +0.0033 | [+0.0003, +0.0062] | 98.42% | |
| 0.7 | 0.2781 | -0.0037 | [-0.0070, -0.0004] | 1.28% | |
| 0.8 | 0.2901 | +0.0084 | [+0.0056, +0.0114] | 100.00% | |
| 0.9 | 0.2924 | +0.0107 | [+0.0079, +0.0135] | 100.00% | |
| 0.95 | 0.2768 | -0.0049 | [-0.0073, -0.0025] | 0.01% |
How to start using?
The Token Company bear-1 compression model provides significant advantages as a middleware for compressing large LLM inputs. The model both improves accuracy and latency by removing less important tokens but also reduces the number of tokens passed to the LLM model saving significantly on LLM costs.
Our bear-1 model is available in beta and can be used by signing up for the waitlist. Once you are approved, you will receive an API key and can start using the model immediately. Sign up for the waitlist.
Methodology
- •Dataset: LongBench v2 multiple-choice questions (paper)
- •Sampling: 230 questions stratified from 503, filtered to ≤100k tokens
- •Compression: The Token Company bear-1 adaptive compression with
importance_cutoffparameter - •Token counting: tiktoken (gpt-4o-mini encoding)
- •Runs: 50 independent evaluations per configuration
- •Temperature: 0 (near-deterministic)
Limitations
- •Results specific to GPT-4o-mini; may differ for other models when using bear-1
- •LongBench v2 subset (230/503 questions due to token limits)
- •Effect sizes are small (~1%); practical significance depends on use case
Token counts calculated with tiktoken. Compression performed using The Token Company bear-1.