Benchmarks

Accuracy & latency methodology · Last updated May 2026

Note: These benchmarks reflect internal testing on our development dataset. Independent third-party benchmarks are on our roadmap. We publish methodology here so results can be reproduced and challenged.

Methodology

All benchmarks were run against a synthetic dataset of intentionally garbled strings generated by applying each supported keyboard layout mapping in both directions. For each layout pair, 200 strings were generated covering: single words, multi-word phrases (3–8 words), short sentences, and mixed-length inputs including numerals and punctuation.

Latency was measured as wall-clock time from the start of the detection pipeline to the response being written, on a Cloudflare Workers runtime (V8 isolate) with cold-start excluded. Reported numbers are medians across 10,000 requests per layout pair.

Detection accuracy

Layout pair	Accuracy	AI fallback rate
English typed on Russian (ru-RU)	99.2%	0.8%
English typed on Ukrainian (uk-UA)	98.7%	1.3%
English typed on Greek (el-GR)	98.5%	1.5%
English typed on Arabic (ar-SA)	97.9%	2.1%
English typed on Hebrew (he-IL)	97.4%	2.6%
English typed on German (de-DE)	96.1%	3.9%
English typed on Turkish-Q (tr-Q)	95.8%	4.2%
English typed on French (fr-FR)	94.3%	5.7%
All layout pairs (aggregate)	96.4%	3.6%

Accuracy = correct layout identified and text decoded without errors. AI fallback rate = fraction of requests where rule-based confidence fell below threshold and the AI layer was invoked (only when use_ai: true).

Latency distribution

Path	p50	p95	p99
Rule-based only (use_ai: false)	4ms	9ms	14ms
Rule-based + AI fallback invoked	180ms	420ms	680ms
Already-correct input (fast-path)	1ms	3ms	5ms

Measured on Cloudflare Workers runtime, excluding network round-trip. AI path includes external API call latency.

Detection pipeline overview

The 8-layer pipeline processes each request sequentially, short-circuiting as soon as a high-confidence result is found:

1
Native script fast-path — If input is already >70% Cyrillic/Arabic/etc. with correct vowel distribution, return immediately.
2
Character frequency analysis — Compare character distribution against per-layout profiles to produce initial candidate scores.
3
Bigram scoring — Score candidate decoded strings using language-specific bigram tables (English + Cyrillic).
4
Dictionary coverage — Measure what fraction of candidate words appear in the per-language word list.
5
Vowel ratio check — Validate decoded output has a linguistically plausible vowel-to-consonant ratio.
6
Reverse sweep — For Latin-script inputs, attempt reverse decoding into each non-Latin layout to detect Latin→Cyrillic/Arabic errors.
7
Word reconstruction — Apply phrase-level reconstruction to handle partial matches and word boundaries.
8
AI fallback — If confidence is still below threshold and use_ai is enabled, send to external LLM for final disambiguation.

Known limitations

Very short inputs (1–2 characters) have lower accuracy due to insufficient signal.
Mixed-script inputs (e.g. Latin + Cyrillic in the same string) may produce ambiguous results.
Phonetic/transliteration layouts (e.g. Russian Phonetic) can produce false positives on short Latin words.
CJK inputs rely on heuristics and benefit most from the AI fallback layer.
Custom or regional keyboard variants not in the 70+ layout database are not detected.

Reproducing these results

The test harness and dataset generator are part of the KeySense AI repository. You can reproduce the benchmarks by running the detection worker locally and feeding it the same synthetic dataset. Contact support@keysense.tech if you'd like access to the full dataset or find results that differ from what we report here.