Quelle: Extensible E-Book Scraper

Architectural Overview

Web scraping is inherently brittle. Websites change their DOM structure, API endpoints, or pagination logic constantly. If your scraper is a monolithic application, every minor website update requires a patch to the core engine, a recompile, and a new release pushed to users.

I built Quelle to fix this coupling. It’s a CLI library manager that searches, downloads, and exports web novels to EPUB or PDF formats. The extraction logic doesn’t live in the core application, it’s isolated entirely in WebAssembly extensions.

Quelle is technically the not the first attempt at this problem. novelsave came before, a Python tool I maintained for a long time, but it never had a proper extension system. Scraper logic was baked in, which meant every source update was my problem to fix and ship. Quelle is the version I actually wanted to build.

I wanted an extension system that users could install and update independently. When a source website changes, only that specific .wasm extension needs updating. The core library manager stays untouched.

“If a user has to wait for a core application update just because a website changed a CSS class, the architecture has failed.”

The sandbox constraint

Allowing users to install arbitrary third-party scrapers introduces a real security risk. Standard plugins run in the same process as the host application, a malicious or poorly written scraper could read local files or exfiltrate environment variables.

WebAssembly solves this through a capability-based sandbox. The core engine, built with Rust and Wasmtime, defines a strict WebAssembly Interface Type (WIT) boundary. Extensions can return structured novel data, but they have zero access to the host filesystem or unapproved network sockets. Users can install extensions from a remote Git registry without worrying about what they’re actually running.

Getting the WIT layer right

This was the part I underestimated the most. I went through a couple of full rewrites trying to settle on a WIT design that felt right, one that was expressive enough to handle varied scraping patterns without leaking too much host-side detail into the extension API. Each rewrite taught me something, but it took longer than I’d like to admit. Where I eventually landed draws heavily from Zed editor’s extension system, credit goes to them for the approach. It gave me a solid reference point to stop second-guessing the design and start building on top of it.

Wasmtime’s documentation didn’t help much either. It covers the basics well enough, but past a certain point you’re mostly learning by trying things and seeing what breaks. The component model in particular had a steep ramp, I pieced together a lot of it from GitHub issues, reading other projects’ source, and sheer trial and error.

Moving HTML parsing to the host

One decision that paid off significantly: I moved HTML parsing out of the extensions and into the host engine. Initially, each .wasm extension bundled its own parsing logic, which ballooned the file sizes considerably. After a lot of deliberation, partly because it meant tightening the host/extension contract, I shifted to a model where the host handles all HTML parsing and extensions receive structured data they can work with directly. The reduction in .wasm size was substantial, and extensions became meaningfully simpler to write.

Where it stands

Quelle is still a work in progress. It’s also become more of a technical exploration than a practical tool for me personally, I don’t really use anything like this anymore. That’s fine by me. It functions as a CLI tool today, but the Wasm architecture sets it up well for what comes next. The plan is to carry the Rust runtime into every target platform, mobile, desktop, wherever, rather than rewrite the host layer in a different language for each one. The extensions stay the same, the core stays the same, and the surface that changes per platform is kept as thin as possible. The groundwork is there, it’s just a matter of building on top of it.