ARIA Platform Architecture
ARIA reading in a garden

Personal Knowledge Engine
& Knowledge Graph

How ARIA builds a comprehensive, structured understanding of your life — extracting entities and facts from conversations, photos, emails, and proactive intelligence — and uses it to serve you better over time.

Three-Layer Knowledge Architecture

The Personal Knowledge Engine organizes understanding across three layers — from raw entities and facts, to semantic summaries, to a compact knowledge map. Each layer serves a different use case and query pattern.

1
Entities & Facts knowledge_entities + knowledge_facts

The foundation. Every person, place, topic, event, and thing ARIA knows about is an entity. Every piece of knowledge about those entities is a fact with confidence scoring, temporal validity, evidence tracking, and vector embeddings for semantic search. Facts link to entities via many-to-many relationships with typed roles (subject, object, location, topic).

2
Summaries knowledge_summaries

AI-generated narrative summaries that synthesize raw facts into human-readable profiles. Four types: entity profiles (person, place, topic), domain overviews (health, career, preferences), relationship summaries (Nic & Ryan), and life chapters (Career: 2020–2026). Marked stale when new facts arrive.

3
Knowledge Map knowledge_map (materialized view)

A compact, pre-computed view of "what ARIA knows" — entity names ranked by fact count. Provides instant lookup without scanning the full graph. Refreshed automatically after summary generation cycles.

Entity Types

Every named concept in ARIA's understanding is an entity. Entities have a canonical name, optional aliases (phone numbers, emails, nicknames), and are linked to the unified contacts table when applicable.

👤

Person

People in your life. Linked to contacts. Aliases for phone/email.

📍

Place

Cities, landmarks, restaurants. Metadata for GPS coordinates.

💬

Topic

Career, hobbies, interests. Abstract concepts and domains.

📅

Event

Birthdays, trips, milestones. Temporal bounds tracked.

📦

Thing

Objects, devices, pets. Anything physical or conceptual.

Anatomy of a Knowledge Fact

Facts are the atomic units of knowledge. Each fact is a structured statement about the world with rich metadata — confidence, evidence, temporal bounds, semantic embedding, and version history.

// Example fact record { domain: "people", category: "relationship", key: "alex_thompson_occupation", value: "As of 2024, Alex works as a software engineer at a major tech company.", // Quality signals confidence: 0.92, // 0.0–1.0, rises with evidence evidence_count: 7, // how many sources confirmed this sources: ["imessage", "chat", "photo_describe"], // Temporal validity valid_from: "2024-01-15", // when this became true valid_until: null, // null = still current // Versioning superseded_by: null, // if replaced, points to new fact // Semantic search embedding: vector(768) // text-embedding-004 for similarity }

Fact Domains

Eight knowledge domains organize facts by topic.

● people
● preferences
● health
● events
● places
● lifestyle
● interests
● communication

Fact-Entity Roles

Facts link to entities via typed many-to-many relationships.

subject — who/what the fact is about
object — the target of the relationship
location — where the fact applies
topic — what domain it belongs to
Fact versioning preserves history. When a fact changes (e.g., job title update), the old fact is superseded — not deleted. A superseded_by pointer links old→new, maintaining a complete audit trail. Queries filter with WHERE superseded_by IS NULL for current facts.
ARIA in an art gallery, contemplating knowledge

Entity Resolution

When new information arrives, ARIA must determine whether it refers to an existing entity or a new one. The resolution system uses a confidence-ordered cascade of matching signals — from definitive phone number matches to cautious fuzzy name comparisons.

SignalThresholdActionExample
Phone number Definitive Auto-merge, upgrade name +1 (617) 555-1234 → 6175551234
Contact ID Definitive Auto-merge (from iOS) Linked via unified contacts table
Exact name Definitive Auto-merge (case-insensitive) "alex thompson" = "Alex Thompson"
Alias match Definitive Auto-merge (nicknames, emails) "AT" in aliases → Alex Thompson
Fuzzy name (high) ≥ 0.85 Auto-merge (bigram similarity) "Jon Smith" ≈ "John Smith"
Fuzzy name (low) 0.55 – 0.85 Create new + flag for review Near-match → merge queue
No match Create new entity Brand new person/place/thing
Family-aware disambiguation. Same last name + different first name scores only 0.45 — well below the auto-merge threshold. This prevents merging family members (e.g., "Alex Thompson" and "Sarah Thompson") who share a surname but are distinct people.

Knowledge Ingestion Pipeline

Knowledge flows into the graph from six primary sources. Each source has a specialized handler that extracts structured facts, resolves entities, and generates embeddings.

Extract

Signal Source

Raw data from iMessage, photos, email, voice, chat, or imports.

Analyze

LLM Extraction

Claude identifies entities, facts, relationships, and temporal context.

Resolve

Entity Resolution

Match to existing entities or create new ones. Flag near-matches.

Store

Upsert + Embed

Deduplicate, version, link, and generate vector embeddings.

💬

iMessage Analysis

Processes conversation history in watermarked batches. Extracts people, preferences, life events, and recurring patterns. Identity-aware — distinguishes your facts from others'.

📷

Photo Analysis

Clusters photos by date and GPS proximity (~50km). Extracts places visited, trips, social events, activities, food preferences, and relationship patterns.

📧

Email Import

Parses Gmail Takeout .mbox archives. Filters spam, trash, and promotions. Groups by thread. Ingests as conversation chunks with entity resolution.

OpenAI Embeddings
🎤

Voice Transcription

Transcribes audio files via OpenAI Whisper or AWS Transcribe. Extracts facts from transcripts with full entity resolution.

🗨

Chat Conversations

Ongoing conversations with ARIA are a signal source. Memory updates extracted from Claude's responses feed into core memory and eventually the KG.

Passive

PIE Patterns

Proactive Intelligence Engine detects behavioral patterns. High-confidence patterns are promoted to core memory, then backfilled into the knowledge graph.

Auto-promoted

Multi-Tier Embedding Strategy

Not all knowledge requires the same embedding model. The system uses a content-aware router that selects the optimal model based on source type, balancing cost, dimensionality, and search quality.

Content ClassModelProviderDimensionsUsed For
Facts text-embedding-004 Google 768 Knowledge facts (compact, cost-effective)
Conversations text-embedding-3-large OpenAI 1536 iMessage and email conversation chunks
Email text-embedding-3-large OpenAI 1536 Email thread ingestion
Transcripts text-embedding-3-large OpenAI 1536 Voice memo transcription
Journal text-embedding-3-large OpenAI 1536 ARIA's reflective journal entries
Embeddings are generated asynchronously after transaction commit. Fact ingestion never blocks on embedding generation — if the embedding API is unavailable, the fact is still saved. Vector search uses HNSW indexes with cosine similarity for sub-millisecond queries.

Fact Ingestion Pipeline

When a new fact arrives — from any source — it passes through a six-step pipeline that handles deduplication, confidence updating, entity linking, and embedding generation.

1. Resolve Entities
Match hints → canonical IDs
Create new if no match
2. Deduplicate
Same value → bump evidence
Different → supersede if higher confidence
3. Link Entities
Create fact↔entity edges
with typed roles
4. Update Stats
Entity fact_count, last_seen_at
Merge sources arrays
5. Mark Stale
Affected summaries → stale
Triggers re-generation
6. Generate Embedding
Async, fire-and-forget
Non-blocking

Deduplication Rules

Same value → Increment evidence_count, merge sources, bump confidence
Different value, higher confidence → Supersede old fact, preserve history
Different value, lower confidence → Just bump evidence (don't downgrade)

Unique Constraint

Only one active fact per (domain, category, key) combination. Enforced by a unique index filtered on superseded_by IS NULL. Old versions are preserved for audit.

Knowledge Summaries

Every 6 hours, the knowledge-summarize handler generates AI-written narrative summaries from raw facts. Four summary types serve different query patterns.

Entity Profiles

100–300 word profiles for people, places, and topics. Summarizes all linked facts grouped by domain with confidence signals.

// Stored in knowledge_entities.summary "Alex Thompson is a software engineer at a major tech company (as of 2024). He is Nic's close friend since college, sharing interests in hiking, coffee, and board games. They typically meet regularly..."

Domain Overviews

150–300 word snapshots of entire knowledge domains. Generated when a domain has 10+ facts.

summary_type: "domain_overview" title: "Health & Wellness Overview" // Covers top facts by confidence

Relationship Summaries

200–400 word relationship profiles for top people by fact count. Includes conversation chunk statistics.

summary_type: "relationship" entity_ids: ["uuid-of-person"] // Links facts + conversation history

Life Chapters

Temporal narrative summaries spanning significant periods — career transitions, relocations, relationship milestones.

summary_type: "life_chapter" time_range: "2020-01 → 2026-03" // Multi-year narrative arc
Automatic staleness tracking. When new facts arrive for entities linked to a summary, that summary is marked stale = true. The next summarize cycle regenerates stale summaries first, ensuring knowledge stays current without redundant work.

Three Generations of Memory

ARIA's knowledge system has evolved through three generations. Each builds on the last — the Knowledge Graph doesn't replace older systems, it unifies them.

Gen 1: Core Memory

Freeform (category, key, value) tuples. No confidence. No entities. Injected into every Claude request as system context. Still active and read by PIE Gate 1.

Table: core_memory

Gen 2: Owner Profile

Structured facts with confidence, evidence, sources, and temporal validity. Domain/category/key organization. Written by iMessage and photo analysis handlers.

Table: owner_profile

Gen 3: Knowledge Graph

Full entity-fact graph with semantic embeddings, version history, many-to-many entity links, AI summaries, and merge queue. Unifies all prior generations.

Tables: knowledge_*
core_memory
Freeform facts
knowledge-backfill
Extracts entities, deduplicates
knowledge_facts
+ knowledge_entities
owner_profile
Structured facts
knowledge-backfill
Links to contacts, resolves names
knowledge_facts
+ knowledge_entities

PIE ↔ Knowledge Graph Integration

The Proactive Intelligence Engine and the Knowledge Graph form a bidirectional feedback loop. PIE monitors for changes and detects patterns; the Knowledge Graph stores and organizes what's learned. Each makes the other more effective over time.

Gate 3: Anticipation Analyze
Detects behavioral patterns
anticipation_patterns
confidence grows with observations
confidence ≥ 0.85 & observations ≥ 5
Pattern Maintenance (daily)
Promotes to core_memory
knowledge-backfill
Migrates to knowledge_facts + entities
core_memory
Updated facts (from KG promotion, chat, etc.)
Gate 1: Context Accumulate
Detects memory changes via watermark
Gate 2: Significance Check
Classifies memory change importance
Gate 3: Anticipation Analyze
Generates insights from memory evolution

User asks question
Chat endpoint
knowledge_search
Vector similarity + filters
Facts + Summaries
Injected as context
ARIA responds
Informed by knowledge

PIE → Knowledge Graph

● Gate 3 detects behavioral patterns
● Patterns accumulate confidence over days
● Daily maintenance promotes high-confidence patterns
● Promoted patterns write to core_memory
● Backfill migrates to knowledge_facts
● Knowledge-summarize includes in entity profiles

Knowledge Graph → PIE

● Gate 1 monitors core_memory for changes
● New or updated facts trigger significance check
● Memory evolution itself is a PIE signal
● Gate 3 can reference entity summaries for context
● Pattern maintenance reads user feedback from insights
● Self-reinforcing: better knowledge → better patterns

Knowledge Graph Tools

ARIA can query the knowledge graph during conversations using three specialized tools. These enable semantic search, entity lookup, and graph exploration.

🔍

knowledge_search

Natural language semantic search across facts and conversation chunks. Filters by entity type and domain. Returns ranked results by vector similarity.

knowledge_search({ query: "hiking trips in 2024", entity_type: "place", include_conversations: true })
👤

knowledge_entity_lookup

Detailed profile for a specific entity. Includes all linked facts, summaries, and conversation statistics.

knowledge_entity_lookup({ name: "Alex Thompson", include_facts: true })
📋

knowledge_entity_list

Browse all entities with filtering and sorting. Useful for "who does ARIA know about" or "what places are tracked."

knowledge_entity_list({ entity_type: "person", sort_by: "fact_count", min_facts: 5 })

Full System Architecture

iMessage
Photos
Email
Voice
Chat
Imports
LLM Extraction + Entity Resolution

knowledge_entities
People, Places, Topics, Events, Things
knowledge_facts
Confidence, Evidence, Embeddings
knowledge_summaries
AI-generated profiles

core_memory
owner_profile
↑ backfill migrates to KG ↑
PIE Pipeline
core_memory
↑ patterns promoted ↑ memory monitored ↓
Chat Tools
API Routes
Semantic search + entity lookup

Key Design Decisions

Family-Aware Resolution

Same last name + different first name = 0.45 score (below auto-merge). Prevents accidentally merging family members who share a surname.

Temporal Context in Values

All extracted facts require "As of DATE..." format. A fact without temporal context becomes stale without anyone knowing.

Two-Tier Embeddings

Facts use compact 768-dim vectors (cheaper). Conversations use 1536-dim for higher semantic precision where nuance matters most.

Watermarked Batch Processing

All analysis handlers use resumable watermarks. They self-chain through history — process a batch, save progress, enqueue the next batch.

Identity Disambiguation

LLM extraction prompts distinguish whose fact it is. "Mom had surgery" → fact about Nic's mother, not about Nic.

Version Preservation

Facts are superseded, never deleted. Old versions remain for audit trail. Current state filtered with superseded_by IS NULL.