A light estimate, computed — not measured, not edited
Every Syftly answer carries a confidence label. Today it reads light estimate: the verdict is aggregated from public benchmarks — with sources and dates — and computed from structured fields, not produced by first-hand Syftly measurement and not chosen by an editor. Here is exactly how it works, and where it is weak.
- 1category — transcription, built end-to-end (MVP-1)
- 6provider offerings, each a callable API + model-tier
- 14curated etalage query, indexable & hand-written
- 7computed decision axes the engine maps questions onto
Light estimate rests on public research and benchmarks (with attribution) plus AI-as-a-judge to synthesise. It is broad and useful, but always labelled as light. Hard tested would rest only on first-hand measured facts — latency, price and uptime probed by Syftly, accuracy scored against a verified ground truth. That tier does not exist yet; nothing here is presented as hard tested. AI-as-a-judge never counts as hard.
Each category is produced by a fixed, version-controlled research recipe: a source hierarchy (Tier-1 provider docs and recognised benchmarks carry the ranking; marketing never does), then a two-phase pipeline — deterministic extraction of the hard fields, then judged synthesis on top of that grounded data. The winner on each axis is then computed from those structured fields — “cheapest” is literally a min() over the prices. So a new recipe run changes the numbers and the winners recompute by themselves. Credibility comes from provenance — a dated source plus a confidence label plus attribution — not from human approval.
A free question is mapped — deterministically, with no LLM call — onto one of these axes; its winner is computed from the ranking. If nothing matches, the answer falls back to the category default (the top of the ordered ranking). An in-category question never 404s.
| Axis | Computed by |
|---|---|
| Cheapest | lowest directly-comparable per-minute price |
| Most accurate | lowest word error rate (WER) |
| Most multilingual | highest supported-language count |
| Lowest latency | lowest published streaming latency |
| Best price-to-accuracy | lowest price × WER |
| Capability filter | top-ranked offering that has diarization, word-timestamps or custom vocabulary |
| Language | most accurate offering that supports the asked language (e.g. Dutch, Spanish) |
- WER is English-leaning.Accuracy uses the Artificial Analysis aggregate WER, which weights English heavily. A “most accurate for Dutch” verdict is a light estimate, not a Dutch-specific measurement.
- Token-priced models are excluded from “cheapest”. An LLM-based transcriber billed per input-audio token (e.g. Gemini) understates real cost, so it is not directly comparable to per-minute pricing and is kept out of the price axis.
- Latency figures are heterogeneous. Some are independent P50 numbers, others vendor claims; they are not strictly comparable, so a latency verdict carries that caveat.
- Only the curated etalage is indexable. Hand-written published queries are crawlable; on-demand engine answers for the long tail are served
noindexso the site never fills with thin near-duplicate pages.
- Artificial Analysis — Speech to Text2026-06-19
- Open ASR Leaderboard2026-06-19