Home › The atomic unit
Your chunk boundary is an event
For your domain the atomic unit is almost certainly an event: something happened, to someone, at a time, with an outcome. Make that the chunk. Not sentences, not tokens, not paragraphs.
Why text spans fail at scale
Token windows, sentence splits, and overlap are guesses about where meaning ends. They drift the moment your corpus grows or your documents change shape. You end up tuning a splitter forever, chasing a target that keeps moving.
An event has natural edges
An event already knows where it begins and ends. A transaction, a diagnosis, a job change: each is one bounded thing with one outcome. The boundary lives in the world, not in your tokenizer.
What every event carries
A schema
Each event type has a fixed set of typed fields: who, what, when, how much, what result. Types make retrieval deterministic instead of fuzzy.
A timestamp
Every event is anchored in time, so you can rank by recency, build baselines, and reason about cause and sequence.
Typed fields
Locked in week one
Migrations, not edits
You define the schema once and treat every change as a formal migration. That discipline turns chunking from an open research problem into a closed configuration decision.
The cost of leaving the unit undefined
Leave the atomic unit undefined and every retrieval bug becomes unfixable by design. You cannot rank what you cannot name, so you keep tuning splitters and hoping. Each new document format reopens a debate you thought was closed.
Worse, the cost compounds in silence. By the time quality visibly drops, you already hold months of data chunked the wrong way. Re-chunking then is a migration under load, not a quiet tweak.
What you get when the unit is fixed
Retrieval becomes fast, deterministic, and cheap, because the system always knows exactly what one row of meaning is. You stop debating chunk size and start shipping. A new question becomes a query, not a re-indexing project.
Common questions
What if my data is messy prose, not clean events?
We extract events from prose at ingestion. The prose stays as the source of record, and the extracted event becomes the chunk you retrieve against.
Isn't this just structured logging?
It is structured logging with a retrieval contract: typed, timestamped, and built from day one to be embedded, ranked, and reasoned over.
Define your atomic unit in 20 minutes
Bring one domain. We name the event that is your chunk boundary and sketch its first schema on the call.
Book my technical call →