Home › The atomic unit

Your chunk boundary is an event

For your domain the atomic unit is almost certainly an event: something happened, to someone, at a time, with an outcome. Make that the chunk. Not sentences, not tokens, not paragraphs.

Define my atomic unit →

Why text spans fail at scale

Token windows, sentence splits, and overlap are guesses about where meaning ends. They drift the moment your corpus grows or your documents change shape. You end up tuning a splitter forever, chasing a target that keeps moving.

An event has natural edges

An event already knows where it begins and ends. A transaction, a diagnosis, a job change: each is one bounded thing with one outcome. The boundary lives in the world, not in your tokenizer.

What every event carries

A schema

Each event type has a fixed set of typed fields: who, what, when, how much, what result. Types make retrieval deterministic instead of fuzzy.

A timestamp

Every event is anchored in time, so you can rank by recency, build baselines, and reason about cause and sequence.

Typed fields

Locked in week one

Migrations, not edits

You define the schema once and treat every change as a formal migration. That discipline turns chunking from an open research problem into a closed configuration decision.

The cost of leaving the unit undefined

Leave the atomic unit undefined and every retrieval bug becomes unfixable by design. You cannot rank what you cannot name, so you keep tuning splitters and hoping. Each new document format reopens a debate you thought was closed.

Worse, the cost compounds in silence. By the time quality visibly drops, you already hold months of data chunked the wrong way. Re-chunking then is a migration under load, not a quiet tweak.

What you get when the unit is fixed

Retrieval becomes fast, deterministic, and cheap, because the system always knows exactly what one row of meaning is. You stop debating chunk size and start shipping. A new question becomes a query, not a re-indexing project.

Common questions

What if my data is messy prose, not clean events?

We extract events from prose at ingestion. The prose stays as the source of record, and the extracted event becomes the chunk you retrieve against.

Isn't this just structured logging?

It is structured logging with a retrieval contract: typed, timestamped, and built from day one to be embedded, ranked, and reasoned over.

Define your atomic unit in 20 minutes

Bring one domain. We name the event that is your chunk boundary and sketch its first schema on the call.

Book my technical call →