Haeru: Architecture of a Companion That Remembers
Haeru is a life-companion AI. Not a chatbot, not a therapist, not a journal with a language model stapled to it. A companion that witnesses your life — pays attention when you aren't, notices what you can't see about yourself, and reaches out when you go quiet.
The north star: "Know your life."
Without its intelligence layer, Haeru is a nice chatbot that forgets you tomorrow. With it, Haeru is the thing that was paying attention when you weren't.
This post is a look under the hood.
The Topology
Three pillars. Platform is the spine: Gleam on the BEAM VM, HTTP and WebSocket API, 17 OTP-supervised workers, 92 database migrations. Brain is where the thinking happens: Python/FastAPI, chat, memory extraction, knowledge graph, insight engine. iOS is the client: SwiftUI, MVVM, voice mode, the whole native experience.
Each one deserves its own post. But the interesting stuff lives in Brain.
Why Gleam
Haeru started in Rust.
I know, I know. But hear me out — the compiler errors made AI-assisted coding genuinely fast. Type errors were obvious, fixes were mechanical. The problem? Build times killed iteration speed. A change-compile-test cycle that should take seconds was taking minutes. When you're a solo founder trying to ship, minutes are not seconds. They're death.
Gleam offered the same proposition. Functional, compiled, exhaustive pattern matching, a compiler that catches real bugs, but with instant test runs. The BEAM VM gives you OTP supervision for free, which matters enormously when you're running 17 long-lived RabbitMQ consumers that absolutely cannot take down the HTTP listener if one crashes.
And they will crash. Trust me. I learned this the hard way exactly once. An unsupervised consumer crashed, cascaded to its parent, and took down the HTTP listener. Every long-lived process now runs under OTP supervision. OneForOne strategy. Non-negotiable.
The Intelligence Layer
This is the part that makes Haeru more than a chatbot. Everything else is plumbing. This is the soul.
Three engines, each grounded in specific psychology research:
| Engine | Mode | Core Question |
|---|---|---|
| Grounding | Observing | "Where are you right now?" |
| Insight | Observing | "What can't you see about yourself?" |
| Biographer | Asking | "What don't I know about your life?" |
The Research That Shaped Everything
Two findings shaped the system's design. The Bias Blind Spot (Pronin, 2007) — people reliably detect cognitive biases in others while remaining blind to their own, and intelligence amplifies this. An external observer with memory can see what you genuinely cannot. The 43% Problem — 43% of significant life events are systematically omitted from personal narratives. A system that tracks what you've mentioned, and what you haven't, can detect these structural holes.
The delivery principle: "I notice" not "You are." "You have a pattern of avoidance" triggers defensiveness. "I notice you pull back when things get close. What happens there?" opens a door.
The Knowledge Graph: Making an AI Actually Remember
Why Not Just RAG
Vanilla RAG — chunk text, embed it, retrieve by similarity — works fine for static documents. It does not work for a life companion. The problem is temporal. People change. You might tell Haeru in January that you're excited about your new job. In March, you're burned out. In June, you quit. A vector database treats these as three similar chunks about "work." A knowledge graph models them as an evolution — the same entity, three states, a trajectory.
That trajectory is the entire point. Haeru's insight engine needs to say "you were excited about this job five months ago — what changed?" You can't do that with cosine similarity over frozen embeddings. You need structure. You need nodes that update, relationships that carry timestamps, entities that evolve.
Graphiti and What I Built Around It
Haeru uses Graphiti (built on FalkorDB/Redis) as the foundation. Out of the box, Graphiti handles entity and relationship extraction from text and provides basic hybrid search — semantic similarity plus keyword matching. It's good infrastructure. But a life companion needs more than basic hybrid search.
Everything else is custom work built around Graphiti without forking the library.
Prompt interception. A custom InterceptingOpenAIClient wraps Graphiti's extraction calls to control how entities and relationships are extracted, giving me fine-grained control over what goes into the graph without modifying Graphiti's internals.
PCST for multi-hop retrieval. Standard retrieval fails on multi-hop inference. If a user asks "Why do I always get anxious before visiting my parents?", the answer might require traversing: User → anxiety → childhood home → parents → unresolved argument from six months ago. No single embedding similarity search will find that path. I added the Prize-Collecting Steiner Tree (PCST) algorithm via pcst-fast on top of Graphiti's search results. Given seed nodes from initial search, PCST finds the minimum-cost subgraph that connects them — discovering bridging nodes and relationships that pure similarity search misses completely. Benchmarks confirm it finds answers where reranking alone fails.
The 90x query optimization. Graphiti's default FalkorDB queries used naive pattern matching for edge traversal. The OptimizedFalkorSearchInterface replaces this with startNode()/endNode() calls. 90x speedup. That's not a typo.
Reranking layer. ZeroEntropy's zerank-2 model sits on top as an optional reranking pass. Benchmarks told a painful story: ~636ms additional latency on top of ~1088ms for candidate retrieval, with only 22% overlap with the baseline RRF ranking in top-5 results. The system is selective about when reranking triggers — not every memory lookup needs the expensive path.
Extraction model choice. GPT-4o mini. I tried smaller models first — they were unreliable. When a model hallucinates a relationship that doesn't exist, it poisons the graph. The cost difference wasn't worth the cleanup.
Voice Pipeline
iOS Microphone (16kHz, 80ms chunks)
→ AudioCaptureService → WebSocket → Platform → Deepgram Flux (STT)
→ Brain (FastAPI) → LLM (streaming)
→ SSE stream → Platform → WebSocket → iOS (TTS)The interesting part is endpoint detection. Two-phase system:
- Eager threshold (0.6) — start processing, begin thinking about a response
- Final threshold (0.9) — confirm the turn is complete, commit to responding
- TurnResumed — if the user keeps talking after eager detection, cancel in-flight processing
This gives Haeru fast response times — it starts thinking before you're done talking — while avoiding the painful experience of being interrupted by a premature response.
Getting this right took many iterations. The commit history tells that story better than I can.
Audio is pre-warmed at app launch — the microphone session initializes in the background so there's no cold-start delay when entering voice mode.
The Through Line
Every technical decision in Haeru comes back to one question: does this help the system pay closer attention? The knowledge graph exists because similarity search can't model change. PCST exists because people's lives aren't one hop away from an answer. The voice pipeline starts thinking before you finish talking because latency kills presence.
The hardest part of building Haeru was never the infrastructure. It was resisting the temptation to make the AI sound smart instead of making it actually observe. Those are very different engineering problems.