methodologyhow we reach a verdict

A light estimate, computed — not measured, not edited

Every Syftly answer carries a confidence label. Today it reads light estimate: the verdict is aggregated from public benchmarks — with sources and dates — and computed from structured fields, not produced by first-hand Syftly measurement and not chosen by an editor. Here is exactly how it works, and where it is weak.

What we cover today

1category — transcription, built end-to-end (MVP-1)
6provider offerings, each a callable API + model-tier
14curated etalage query, indexable & hand-written
7computed decision axes the engine maps questions onto

Light estimate vs hard tested

Light estimate rests on public research and benchmarks (with attribution) plus AI-as-a-judge to synthesise. It is broad and useful, but always labelled as light. Hard tested would rest only on first-hand measured facts — latency, price and uptime probed by Syftly, accuracy scored against a verified ground truth. That tier does not exist yet; nothing here is presented as hard tested. AI-as-a-judge never counts as hard.

The recipe: computed, not edited

Each category is produced by a fixed, version-controlled research recipe: a source hierarchy (Tier-1 provider docs and recognised benchmarks carry the ranking; marketing never does), then a two-phase pipeline — deterministic extraction of the hard fields, then judged synthesis on top of that grounded data. The winner on each axis is then computed from those structured fields — “cheapest” is literally a min() over the prices. So a new recipe run changes the numbers and the winners recompute by themselves. Credibility comes from provenance — a dated source plus a confidence label plus attribution — not from human approval.

The decision axes

A free question is mapped — deterministically, with no LLM call — onto one of these axes; its winner is computed from the ranking. If nothing matches, the answer falls back to the category default (the top of the ordered ranking). An in-category question never 404s.

The computed decision axes and the rule each uses
Axis	Computed by
Cheapest	lowest directly-comparable per-minute price
Most accurate	lowest word error rate (WER)
Most multilingual	highest supported-language count
Lowest latency	lowest published streaming latency
Best price-to-accuracy	lowest price × WER
Capability filter	top-ranked offering that has diarization, word-timestamps or custom vocabulary
Language	most accurate offering that supports the asked language (e.g. Dutch, Spanish)

Where it is weak (read this)

WER is English-leaning.Accuracy uses the Artificial Analysis aggregate WER, which weights English heavily. A “most accurate for Dutch” verdict is a light estimate, not a Dutch-specific measurement.
Token-priced models are excluded from “cheapest”. An LLM-based transcriber billed per input-audio token (e.g. Gemini) understates real cost, so it is not directly comparable to per-minute pricing and is kept out of the price axis.
Latency figures are heterogeneous. Some are independent P50 numbers, others vendor claims; they are not strictly comparable, so a latency verdict carries that caveat.
Only the curated etalage is indexable. Hand-written published queries are crawlable; on-demand engine answers for the long tail are served noindex so the site never fills with thin near-duplicate pages.

Sources

Artificial Analysis — Speech to Text2026-06-19
Open ASR Leaderboard2026-06-19