KeySense AIKeySense AI
← Back to home

Benchmarks

Accuracy & latency methodology · Last updated May 2026

Note: These benchmarks reflect internal testing on our development dataset. Independent third-party benchmarks are on our roadmap. We publish methodology here so results can be reproduced and challenged.

Methodology

All benchmarks were run against a synthetic dataset of intentionally garbled strings generated by applying each supported keyboard layout mapping in both directions. For each layout pair, 200 strings were generated covering: single words, multi-word phrases (3–8 words), short sentences, and mixed-length inputs including numerals and punctuation.

Latency was measured as wall-clock time from the start of the detection pipeline to the response being written, on a Cloudflare Workers runtime (V8 isolate) with cold-start excluded. Reported numbers are medians across 10,000 requests per layout pair.

Detection accuracy

Layout pairAccuracyAI fallback rate
English typed on Russian (ru-RU)99.2%0.8%
English typed on Ukrainian (uk-UA)98.7%1.3%
English typed on Greek (el-GR)98.5%1.5%
English typed on Arabic (ar-SA)97.9%2.1%
English typed on Hebrew (he-IL)97.4%2.6%
English typed on German (de-DE)96.1%3.9%
English typed on Turkish-Q (tr-Q)95.8%4.2%
English typed on French (fr-FR)94.3%5.7%
All layout pairs (aggregate)96.4%3.6%

Accuracy = correct layout identified and text decoded without errors. AI fallback rate = fraction of requests where rule-based confidence fell below threshold and the AI layer was invoked (only when use_ai: true).

Latency distribution

Pathp50p95p99
Rule-based only (use_ai: false)4ms9ms14ms
Rule-based + AI fallback invoked180ms420ms680ms
Already-correct input (fast-path)1ms3ms5ms

Measured on Cloudflare Workers runtime, excluding network round-trip. AI path includes external API call latency.

Detection pipeline overview

The 8-layer pipeline processes each request sequentially, short-circuiting as soon as a high-confidence result is found:

  1. 1
    Native script fast-pathIf input is already >70% Cyrillic/Arabic/etc. with correct vowel distribution, return immediately.
  2. 2
    Character frequency analysisCompare character distribution against per-layout profiles to produce initial candidate scores.
  3. 3
    Bigram scoringScore candidate decoded strings using language-specific bigram tables (English + Cyrillic).
  4. 4
    Dictionary coverageMeasure what fraction of candidate words appear in the per-language word list.
  5. 5
    Vowel ratio checkValidate decoded output has a linguistically plausible vowel-to-consonant ratio.
  6. 6
    Reverse sweepFor Latin-script inputs, attempt reverse decoding into each non-Latin layout to detect Latin→Cyrillic/Arabic errors.
  7. 7
    Word reconstructionApply phrase-level reconstruction to handle partial matches and word boundaries.
  8. 8
    AI fallbackIf confidence is still below threshold and use_ai is enabled, send to external LLM for final disambiguation.

Known limitations

  • Very short inputs (1–2 characters) have lower accuracy due to insufficient signal.
  • Mixed-script inputs (e.g. Latin + Cyrillic in the same string) may produce ambiguous results.
  • Phonetic/transliteration layouts (e.g. Russian Phonetic) can produce false positives on short Latin words.
  • CJK inputs rely on heuristics and benefit most from the AI fallback layer.
  • Custom or regional keyboard variants not in the 70+ layout database are not detected.

Reproducing these results

The test harness and dataset generator are part of the KeySense AI repository. You can reproduce the benchmarks by running the detection worker locally and feeding it the same synthetic dataset. Contact support@keysense.tech if you'd like access to the full dataset or find results that differ from what we report here.