Why I Built My Own EPUB Media Overlay Editor

Why I Built My Own EPUB Media Overlay Editor

Take an ebook. Take an audiobook. Smash the two together and you get a “TalkingBook”: synchronized text and audio that highlights what’s being spoken and lets you click phrases to trigger their audio content.

In principle, it’s a simple idea. In fact, when I show it to people, it seems so intuitive: of course that’s a thing. Why wouldn’t it do that? It’s like automated captions on a YouTube video, or how some social media apps automatically put spoken words on screen. A no-brainer. Doesn’t surprise anyone.

But here’s the catch: this isn’t actually like those auto-generated captions at all. YouTube generates text as it listens to the audio. Here, you already have both pieces: the book text and the audio. It’s much more like syncing subtitles to movies. But while there’s a whole industry, established standards, and plenty of polished tools for film subtitling, the ebook world is still very much in experimental territory. The few tools that do exist are either outdated, produce inaccurate results, or require significant technical expertise to wrangle into something usable.

Turns out that in 2025, not only is this still quite an undertaking, but it’s also somewhat of a niche interest.

My journey with these TalkingBooks goes back quite a few years. I’ve gone from early experiments with forced-alignment tools like aeneas, to the crushing realization that support for EPUB+audio on major ebook platforms is basically non-existent: even when you manage to create a perfectly synced TalkingBook, there’s nowhere decent to actually use it.

I documented (more like infodumped) various points on this journey in this old blog post on the LearnOutLive blog. Earlier, I had already gotten fed up with existing TalkingBook reading software and built my own web-based reader with full integration into our book shop. So I’d solved half the problem (at least for my readers): where to read TalkingBooks. But now I was facing the other half again: how to create them efficiently without tearing your hair out.

From Basic Sync To Surgical Precision

As I started working on my latest TalkingBook (currently #9 in a German learning series), I turned to Storyteller (shout-out to Shane Friedman, this project rocks!) to give me a basic starting point for the synchronizations.

In a nutshell Storyteller works like this: you run a local large language model with Whisper, provide it an EPUB file and some MP3s, and it tries to “automagically” synchronize the two.

And yes, it works, better than forced-alignment, and it’s certainly faster than doing it manually. Storyteller is genuinely one of the few bright spots in this ecosystem. But even with this excellent foundation it’s just not always accurate enough. Sometimes it will “draw” the text regions inaccurately, or its synchronizations will be a few milliseconds off, splintering words or syllables. That’s not necessarily the fault of Storyteller, it just comes with the territory of letting LLMs solve these problems. Better than doing it manually, but far from perfect.

For my use case of language-learning books, I need phrases perfectly synchronized, with no weird fragmentation of text phrases and perfect timing.

And there’s a reason I’m so hellbent on getting this right: TalkingBooks aren’t just great for pronunciation practice: click on a phrase, repeat, follow along with eyes and ears. The real magic happens when you highlight phrases in just the right way. You can teach people about grammar and syntactical units almost subconsciously. This is especially powerful in German, where just visualizing sentence boundaries and clause structures can do more than hours of dry grammatical study. It’s incredibly intuitive! The brain picks up patterns when it can see and hear the language’s rhythm simultaneously.

But that only works if the synchronization is absolutely precise. Text fragments resulting from auto-synchronization will often drop a direct speech quotation mark or otherwise split phrases weirdly—this is common. But it’s mostly fine, as long as it follows the same pattern. Nothing a bit of regex or quick edits in Sigil can’t fix.

That still leaves the timing issue. In the past I’ve been using this FOSS tool called Tobi, which gives you a basic GUI to perfectly position audio boundaries where you need them.

I’ve been using it for the past few TalkingBooks, and the GUI for audio has been extremely helpful, but the overall workflow doesn’t quite fit my needs.

I need to be able to load EPUBs, quickly adjust some fragments and timings, and export EPUB again. But with Tobi? The workflow is a mess: import that EPUB, convert all MP3s to WAV files, recreate all the EPUB content in this unwieldy XUK format, dump hundreds of files everywhere, then compress everything back to MP3s on export. Every cycle loses audio quality, unless you maintain XUK as single source of truth. Want to adjust some text fragments? Sorry, not possible, unless you are willing to wade into a world of Lovecraftian XML:

Snapshot of code from a Tobi XUK project. Can you see the text content in that sea of markup?

And for a long while, this was as good as it got: bouncing between Sigil and Tobi, XUK and EPUB. But last week, I finally had enough and decided to try my hand at building my own TalkingBook editor. Because you can complain about existing solutions all you like; sometimes, you just gotta do it yourself.

A TalkingBook Editor That Doesn’t Suck

I very quickly put together a basic prototype in React that can parse EPUB files, display waveforms with WaveSurfer.js, draw regions over the waveforms, and let me select and adjust them. I expected to run into major obstacles, but surprisingly, it was pretty smooth-sailing, and after a few iterations I had a robust solution.

This application is designed for a single purpose: fine-tune TalkingBooks with surgical precision.

Key features:

  • Drag and drop audio regions: Move boundaries directly on the waveform until timing is pixel-perfect
  • Text-to-audio linking: Select text fragments to instantly hear their corresponding audio and vice versa
  • Fine-tune fragments: edit timing values precisely, delete unwanted fragments directly from the GUI
  • Inline HTML editing: Make quick adjustments to highlight placement without leaving the app
  • Fragment splitting: A scissor tool for breaking up sprawling sentences into digestible chunks

The goal isn’t to replace the initial auto-sync process. Storyteller and similar tools are great for getting 80% there. But when you need that last 20% dialed in perfectly, you need something built for precision work.

Here’s the beautiful thing about building tools for your own use case: I’m developing this editor while actually using it on TalkingBook #9. No abstract specs or theoretical workflows, just real content that needs real solutions. When I notice something annoying or missing, I just implement it. The result is a tool that behaves exactly how I want it to, with an interface that feels intuitive because it evolved from actual use rather than pure guesswork.

Now, this tool probably won’t be a good fit for everyone. It’s designed specifically around my own idiosyncratic workflow and the particular needs of creating German language learning content. But who knows, maybe I’ll turn it into an open source project one day. After all, there are dozens of us who care about EPUB media overlays. Dozens!

Here are some screenshots:

Upload Screen:

The first step is to load the EPUB file from disk. I may add in some kind of project management system with persistent saving, “recent projects”, etc. later.

Main Workspace:

For the main workspace I went with a basic four panel layout:

  • left panel shows all chapters in the EPUB (example in screenshot is a test EPUB with just one chapter)
  • middle panel shows the content of that chapter with highlighting of text fragments (CSS forces fragments onto separate lines for easier debugging)
  • right panel shows detailed timing information of selected audio fragment, splitting, delete, etc.
  • bottom panel shows the associated audio file and all regions, i.e. audio fragments

After parsing the EPUB structure you can select audio and text fragments and adjust boundaries or just type in values directly.

HTML editor:

Quick fixes? Pop in to the integrated HTML editor. No extra messy markup (apart from the mess I wrote myself).

Done? Just hit Export and get an EPUB file containing all changes. Need more adjustments? Just load that EPUB file again and continue working. Throughout the EPUB remains our single source of truth. Encoding of audio never gets touched. We only update SMIL timestamps (and optional XHTML edits).