Why I Built My Own EPUB Media Overlay Editor

Why I Built My Own EPUB Media Overlay Editor

Take an ebook. Take an audiobook. Smash the two together and you get a “TalkingBook”: synchronized text and audio that highlights what’s being spoken and lets you click phrases to trigger their audio content.

In principle, it’s a simple idea. In fact, when I show it to people, it seems so intuitive: of course this is a thing. Why wouldn’t it do that? It’s like automated captions on a YouTube video, or how some social media apps automatically put spoken words on screen. A no-brainer. Doesn’t surprise anyone.

But here’s the catch: if you want to do this at scale, for book-length content, and you don’t only want the synchronizations to be “good enough”—you want them nailed down perfectly—you’re in for a world of hurt.

Turns out that in 2025, not only is this still quite an undertaking, but it’s also somewhat of a niche interest.

My journey with these TalkingBooks goes back quite a few years. I’ve gone from early experiments with forced-alignment tools like aeneas, to the crushing realization that support for EPUB+audio on major ebook platforms is basically non-existent, to more recent adventures with LLM-supported auto-synchronization tools like Storyteller.

I documented (more like infodumped) a lot of this journey in this gargantuan blog post on the LearnOutLive blog. Earlier, I had already gotten fed up with existing TalkingBook reading software and built my own web-based reader that actually worked properly. But now I was facing a different problem: the editing side of things.

Building A TalkingBook Editing Suite That Doesn’t Suck

As I started working on my latest TalkingBook (currently #9 in a German learning series), I again turned to Storyteller (shout-out to Shane Friedman, what a cool project!) to give me a rough starting point for the synchronizations.

In a nutshell Storyteller works like this: you run a local LLM with Whisper, provide it an EPUB file and some MP3s, and it tries to “automagically” synchronize the two.

And yes, it works, better than forced-alignment, and it’s certainly faster than doing it manually. But guess what: it’s just not accurate enough. Sometimes it will “draw” the highlights inaccurately, or it will synchronize the audio just a few milliseconds off, so that instead of the word “apple” being synced with the audio “apple,” you click on it and just hear “-pple.”

For my use case of language-learning books, I need phrases perfectly synchronized, with no weird fragmentation of highlighted phrases and perfect timing.

And there’s a reason I’m so hellbent on getting this right: TalkingBooks aren’t just great for pronunciation practice: click on a phrase, repeat, follow along with eyes and ears. The real magic happens when you highlight phrases in just the right way. You can teach people about grammar and syntactical units almost subconsciously. This is especially powerful in German, where just visualizing sentence boundaries and clause structures can do more than hours of dry grammatical study. It’s incredibly intuitive! The brain picks up patterns when it can see and hear the language’s rhythm simultaneously.

But that only works if the synchronization is absolutely precise. Text fragments resulting from auto-synchronization will often drop a direct speech quotation mark or otherwise split phrases weirdly—this is common. But it’s mostly fine, as long as it follows the same pattern. Nothing a bit of regex can’t fix.

That still leaves the timing issue. In the past I’ve been using this FOSS tool called Tobi, which gives you a basic GUI to perfectly position audio boundaries where you need them.

I’ve been using it for the past few TalkingBooks, but the audio-adjustment GUI is the only good thing about it. Everything else is a bit painful. No disrespect to the devs, it just doesn’t fit my workflow.

I need to be able to load EPUBs, quickly adjust some fragments and timings, and export EPUB again. But with Tobi? The workflow is a mess: import that EPUB, convert all MP3s to WAV files, recreate all the EPUB content in some proprietary format, dump hundreds of files everywhere, then compress everything back to MP3s on export. Every cycle loses audio quality. And don’t even get me started on trying to edit their Lovecraftian XML horror if you need to adjust how highlights are placed.

To make a long story short, I finally had enough last week and decided to build my own TalkingBook editor. Sometimes you just gotta do it yourself.

So yes, I very quickly put together a prototype in React that can parse EPUB files, display waveforms with WaveSurfer.js, draw regions over the waveforms, and let me select them.

This application is designed for a single purpose: fine-tune TalkingBooks with surgical precision.

Key features:

  • Drag and drop audio regions: Move boundaries directly on the waveform until timing is pixel-perfect
  • Text-to-audio linking: Select text fragments to instantly hear their corresponding audio and vice versa
  • Fine-tune fragments: edit timing values precisely, delete unwanted fragments directly from the GUI
  • Inline HTML editing: Make quick adjustments to highlight placement without leaving the app
  • Fragment splitting: A scissor tool for breaking up sprawling sentences into digestible chunks

The goal isn’t to replace the initial auto-sync process. Storyteller and similar tools are great for getting 80% there. But when you need that last 20% dialed in perfectly, you need something built for precision work.

Here’s the beautiful thing about building tools for your own use case: I’m developing this editor while actually using it on TalkingBook #9. No abstract specs or theoretical workflows, just real content that needs real solutions. When I notice something annoying or missing, I just implement it. The result is a tool that behaves exactly how I want it to, with an interface that feels intuitive because it evolved from actual use rather than pure guesswork.

Now, this tool probably won’t be a good fit for everyone. It’s designed specifically around my own idiosyncratic workflow and the particular needs of creating German language learning content. But who knows, maybe I’ll turn it into an open source project one day. After all, there are dozens of us who care about EPUB media overlays. Dozens!

Here are some screenshot:

We start with loading an EPUB file

After parsing the EPUB structure you can select audio and text fragments and adjust boundaries or just type in values directly.

Quick fixes? Pop in to the integrated HTML editor.