I ran two SearchAPI queries permitted by the verification process. The only directly relevant hit is the blog post “How I built a sub-500ms latency voice agent from scratch” (ntik.me). That post explicitly describes aiming for perceptual latency under 500 ms and the practical work of wiring streaming STT into a token‑streaming LLM and back into TTS, using engineering “hacks” to meet latency targets. This is a concrete, practitioner-level artifact documenting a production‑oriented engineering effort to optimize end‑to‑end voice latency.
Beyond that single writeup, the two searches returned no additional independent community writeups, no GitHub issues surfaced that could corroborate recurring pain points, no arXiv or academic hits (e.g., VoXtream) were returned, and no vendor comparison threads (NVIDIA Riva, Deepgram, Whisper, etc.) or tutorial/architecture guides were found within those queries. Because the verification rule for this report requires two independent sources per factual claim, broader statements about systemic tooling gaps, widespread demand among indie founders, or the absence of a vendor-agnostic runtime cannot be treated as VERIFIED here.
Taken together, the verifiable evidence is narrow but specific: at least one practitioner encountered nontrivial engineering work to reach sub‑500ms perceptual latency and explicitly stitched STT→LLM→TTS in custom ways. That single data point supports cautious inference that there are real engineering tradeoffs when targeting low latency in voice agents, but it does not establish how common those tradeoffs are across teams or whether existing vendor tooling already satisfies many production needs.