Essay

Are you using voice input? — On the next input interface after the keyboard

April 4, 2026 · by Masaki Kondo · 6 min read

Introduction — Caring about the input interface

Are you using voice input?

The AI paradigm is shifting rapidly these days, but I think something equally interesting is happening at the very front of the pipeline: the input side. Speech-to-Text has finally crossed into the territory of being a practical, everyday input method. To me, that feels like a real turning point.

Many engineers I know happily spend two or three hundred dollars on a keyboard — PFU's Happy Hacking Keyboard, custom-built mechanicals, you name it. I'm one of them; I went through a phase where I obsessed over keyboards. Custom keyboard meetups are thriving lately.

If we care that much about the keyboard as an input interface, maybe it's worth caring just as much about voice input as a new one. That's what this essay is about.

A bit about me

My name is Masaki Kondo. I am the CEO of Guide Inc. Vietnam, an IT company based in Vietnam. I'm involved in software development day to day, and lately I've put Claude Code at the very center of how I work.

Concretely, I keep a private repository called kondo-daily-ops where Claude Code helps me handle Backlog tickets from customers, internal team communication, daily work logs — basically everything. It pulls ticket context via the API, picks up history from past logs, runs my saved skills to draft a reply, and so on. I orchestrate it through natural-language instructions.

As you can imagine, these instructions are long natural-language sentences. Typing them all out on a keyboard is a chore.

Meeting AquaVoice

Around the end of 2025, I started using a voice input app called AquaVoice.

Being able to dictate those long instructions to Claude Code felt better than I expected. Voice input quickly became something I couldn't go without. "Read this ticket, pull context from the old logs, use this skill to draft a reply" — getting to speak that out loud is just unreasonably comfortable. Once you've felt it, you can't go back.

Why I decided to build my own

Honestly: I had no complaints about AquaVoice. I was perfectly happy with it.

So why build my own? Pure intellectual curiosity.

First, I'd been wanting to build something in Rust. I've already done this kind of thing before: Guidebook (a Rust static site generator), our in-house VPN with Headscale, and so on — tools I use at work, built by me. Sharpening your own tools is just plain fun.

I was also curious about the technical machinery inside AquaVoice. Speech-to-Text first, then an LLM cleanup pass — that multi-stage pipeline fascinated me.

Even before I discovered AquaVoice, I was impressed by how accurately ChatGPT transcribed Japanese voice input. I used to do something pretty silly: dictate into the ChatGPT desktop app, then copy the transcript and paste it into Claude Code. That was the workflow.

Then OpenAI shipped gpt-4o-mini-transcribe as an API. "Wait, I could build this myself" — that was the spark.

A tour of Speech-to-Text models

In the course of building koedesk I tried a wide range of STT models.

Groq Whisper Large v3 Turbo — Fast. But hallucinates a bit.

OpenAI GPT-4o Transcribe — Accurate, but its hallucinations during silence are distracting. Strong on Japanese and English, but it falls apart the moment you mix English into Vietnamese. We develop software in Vietnam, so this one really hurt.

Mistral (Voxtral) — Not great.

Gemini — I tried letting it do STT and post-processing in one shot. It normalized so aggressively that it hallucinated content I never said. A wild horse.

The difficulty of LLM post-processing

To compensate for what STT alone can't fix, I also evaluated LLM-based post-processing (filler removal, dictionary substitution, formatting). I ran 20 benchmark patterns across 6 models, and each one clearly had its own personality.

OpenAI family: Conservative and careful. Applies the dictionary, but leaves unknown words alone. Zero hallucinations.
Gemini family: Aggressive and bold. Best-in-class for dictionary application, but rewrites words it thinks it knows into something "more correct." Say "Gemini 3 Flash" and it becomes "Gemini 1.5 Flash." The more familiar the word, the more dangerous the hallucination.
Claude family: Humble and safe. Doesn't break anything, but lacks confidence when applying the dictionary.

My conclusion: post-processing shouldn't really be necessary, and it'll fade as the STT models themselves get better.

Meeting ElevenLabs Scribe V2

And then I found ElevenLabs Scribe V2.

I'd never heard of the company before, but this model was a shock. Vietnamese, Japanese, English — accuracy was high across the board. In Japanese specifically, subjectively, I felt it had pulled ahead of OpenAI's models.

It also has a clean parameter for dictionary biasing — developer-friendly by design. Output quality is good enough that I don't need any post-processing at all. Today, koedesk uses Scribe V2 as its default model.

No post-processing, and quality that beats other apps — at least in Japanese, I'm willing to say that with confidence.

My current development setup

After all this, the AI tools I actually use day to day are surprisingly few. Claude Code and koedesk. That's the lineup.

To put it another way, if Claude Code went down I couldn't even open a pull request on my own. That's how dependent I am.

How this very article was written

This article itself is a demonstration of voice input.

Dictate into koedesk to transcribe
Have Claude Code clean up the prose
Claude Code drives the Git repo and pushes
Zenn Connect (GitHub integration) auto-publishes

I barely touched the keyboard.

Why not try voice input?

If you already know AquaVoice you might be thinking, "Ah, that experience." But many people have never actually tried voice input.

I especially want Japanese-language users to feel the Japanese quality of ElevenLabs Scribe V2, the default model in koedesk. koedesk has a free plan with 5 minutes per day, no expiration, and no credit card required. If you like it, the Pro plan at $10/month gives you unlimited transcription.

I'd be happy if this article gives someone their first taste of voice input as a new kind of input interface.

To loop back to the opening: a small secret ambition of mine — I'd like to make koedesk the Happy Hacking Keyboard of voice input. …I'm kidding, sorry. But, well, half-kidding.