Personal Knowledge Engine & Knowledge Graph

Architecture

Three-Layer Knowledge Architecture

The Personal Knowledge Engine organizes understanding across three layers — from raw entities and facts, to semantic summaries, to a compact knowledge map. Each layer serves a different use case and query pattern.

Entities & Facts knowledge_entities + knowledge_facts

The foundation. Every person, place, topic, event, and thing ARIA knows about is an entity. Every piece of knowledge about those entities is a fact with confidence scoring, temporal validity, evidence tracking, and vector embeddings for semantic search. Facts link to entities via many-to-many relationships with typed roles (subject, object, location, topic).

Summaries knowledge_summaries

AI-generated narrative summaries that synthesize raw facts into human-readable profiles. Four types: entity profiles (person, place, topic), domain overviews (health, career, preferences), relationship summaries (Nic & Ryan), and life chapters (Career: 2020–2026). Marked stale when new facts arrive.

Knowledge Map knowledge_map (materialized view)

A compact, pre-computed view of "what ARIA knows" — entity names ranked by fact count. Provides instant lookup without scanning the full graph. Refreshed automatically after summary generation cycles.

Layer 1

Entity Types

Every named concept in ARIA's understanding is an entity. Entities have a canonical name, optional aliases (phone numbers, emails, nicknames), and are linked to the unified contacts table when applicable.

👤

Person

People in your life. Linked to contacts. Aliases for phone/email.

📍

Place

Cities, landmarks, restaurants. Metadata for GPS coordinates.

💬

Topic

Career, hobbies, interests. Abstract concepts and domains.

📅

Event

Birthdays, trips, milestones. Temporal bounds tracked.

📦

Thing

Objects, devices, pets. Anything physical or conceptual.

Layer 1

Anatomy of a Knowledge Fact

Facts are the atomic units of knowledge. Each fact is a structured statement about the world with rich metadata — confidence, evidence, temporal bounds, semantic embedding, and version history.

// Example fact record
{
  domain:     "people",
  category:   "relationship",
  key:        "alex_thompson_occupation",
  value:      "As of 2024, Alex works as a software engineer at a major tech company.",

  // Quality signals
  confidence: 0.92,          // 0.0–1.0, rises with evidence
  evidence_count: 7,       // how many sources confirmed this
  sources:    ["imessage", "chat", "photo_describe"],

  // Temporal validity
  valid_from: "2024-01-15",  // when this became true
  valid_until: null,        // null = still current

  // Versioning
  superseded_by: null,      // if replaced, points to new fact

  // Semantic search
  embedding: vector(768)      // text-embedding-004 for similarity
}
      

Fact Domains

Eight knowledge domains organize facts by topic.

● people

● preferences

● health

● events

● places

● lifestyle

● interests

● communication

Fact-Entity Roles

Facts link to entities via typed many-to-many relationships.

subject — who/what the fact is about

object — the target of the relationship

location — where the fact applies

topic — what domain it belongs to

ⓘ

Fact versioning preserves history. When a fact changes (e.g., job title update), the old fact is superseded — not deleted. A superseded_by pointer links old→new, maintaining a complete audit trail. Queries filter with WHERE superseded_by IS NULL for current facts.

Intelligence

Entity Resolution

When new information arrives, ARIA must determine whether it refers to an existing entity or a new one. The resolution system uses a confidence-ordered cascade of matching signals — from definitive phone number matches to cautious fuzzy name comparisons.

Signal	Threshold	Action	Example
Phone number	Definitive	Auto-merge, upgrade name	+1 (617) 555-1234 → 6175551234
Contact ID	Definitive	Auto-merge (from iOS)	Linked via unified contacts table
Exact name	Definitive	Auto-merge (case-insensitive)	"alex thompson" = "Alex Thompson"
Alias match	Definitive	Auto-merge (nicknames, emails)	"AT" in aliases → Alex Thompson
Fuzzy name (high)	≥ 0.85	Auto-merge (bigram similarity)	"Jon Smith" ≈ "John Smith"
Fuzzy name (low)	0.55 – 0.85	Create new + flag for review	Near-match → merge queue
No match	—	Create new entity	Brand new person/place/thing

⚠

Family-aware disambiguation. Same last name + different first name scores only 0.45 — well below the auto-merge threshold. This prevents merging family members (e.g., "Alex Thompson" and "Sarah Thompson") who share a surname but are distinct people.

Data Flow

Knowledge Ingestion Pipeline

Knowledge flows into the graph from six primary sources. Each source has a specialized handler that extracts structured facts, resolves entities, and generates embeddings.

Extract

Signal Source

Raw data from iMessage, photos, email, voice, chat, or imports.

Analyze

LLM Extraction

Claude identifies entities, facts, relationships, and temporal context.

Resolve

Entity Resolution

Match to existing entities or create new ones. Flag near-matches.

Store

Upsert + Embed

Deduplicate, version, link, and generate vector embeddings.

💬

iMessage Analysis

Processes conversation history in watermarked batches. Extracts people, preferences, life events, and recurring patterns. Identity-aware — distinguishes your facts from others'.

Claude Sonnet

📷

Photo Analysis

Clusters photos by date and GPS proximity (~50km). Extracts places visited, trips, social events, activities, food preferences, and relationship patterns.

Claude Sonnet

📧

Email Import

Parses Gmail Takeout .mbox archives. Filters spam, trash, and promotions. Groups by thread. Ingests as conversation chunks with entity resolution.

OpenAI Embeddings

🎤

Voice Transcription

Transcribes audio files via OpenAI Whisper or AWS Transcribe. Extracts facts from transcripts with full entity resolution.

Whisper + Claude

🗨

Chat Conversations

Ongoing conversations with ARIA are a signal source. Memory updates extracted from Claude's responses feed into core memory and eventually the KG.

Passive

⚡

PIE Patterns

Proactive Intelligence Engine detects behavioral patterns. High-confidence patterns are promoted to core memory, then backfilled into the knowledge graph.

Auto-promoted

Semantic Layer

Multi-Tier Embedding Strategy

Not all knowledge requires the same embedding model. The system uses a content-aware router that selects the optimal model based on source type, balancing cost, dimensionality, and search quality.

Content Class	Model	Provider	Dimensions	Used For
Facts	text-embedding-004	Google	768	Knowledge facts (compact, cost-effective)
Conversations	text-embedding-3-large	OpenAI	1536	iMessage and email conversation chunks
Email	text-embedding-3-large	OpenAI	1536	Email thread ingestion
Transcripts	text-embedding-3-large	OpenAI	1536	Voice memo transcription
Journal	text-embedding-3-large	OpenAI	1536	ARIA's reflective journal entries

✓

Embeddings are generated asynchronously after transaction commit. Fact ingestion never blocks on embedding generation — if the embedding API is unavailable, the fact is still saved. Vector search uses HNSW indexes with cosine similarity for sub-millisecond queries.

Core Process

Fact Ingestion Pipeline

When a new fact arrives — from any source — it passes through a six-step pipeline that handles deduplication, confidence updating, entity linking, and embedding generation.

Per-Fact Ingestion Steps

1. Resolve Entities
Match hints → canonical IDs
Create new if no match

2. Deduplicate
Same value → bump evidence
Different → supersede if higher confidence

3. Link Entities
Create fact↔entity edges
with typed roles

4. Update Stats
Entity fact_count, last_seen_at
Merge sources arrays

5. Mark Stale
Affected summaries → stale
Triggers re-generation

6. Generate Embedding
Async, fire-and-forget
Non-blocking

Deduplication Rules

Same value → Increment evidence_count, merge sources, bump confidence

Different value, higher confidence → Supersede old fact, preserve history

Different value, lower confidence → Just bump evidence (don't downgrade)

Unique Constraint

Only one active fact per (domain, category, key) combination. Enforced by a unique index filtered on superseded_by IS NULL. Old versions are preserved for audit.

Layer 2

Knowledge Summaries

Every 6 hours, the knowledge-summarize handler generates AI-written narrative summaries from raw facts. Four summary types serve different query patterns.

Entity Profiles

100–300 word profiles for people, places, and topics. Summarizes all linked facts grouped by domain with confidence signals.

// Stored in knowledge_entities.summary
"Alex Thompson is a software engineer at a major tech company
(as of 2024). He is Nic's close friend since college,
sharing interests in hiking, coffee, and board
games. They typically meet regularly..."
        

Domain Overviews

150–300 word snapshots of entire knowledge domains. Generated when a domain has 10+ facts.

summary_type: "domain_overview"
title: "Health & Wellness Overview"
// Covers top facts by confidence
        

Relationship Summaries

200–400 word relationship profiles for top people by fact count. Includes conversation chunk statistics.

summary_type: "relationship"
entity_ids: ["uuid-of-person"]
// Links facts + conversation history
        

Life Chapters

Temporal narrative summaries spanning significant periods — career transitions, relocations, relationship milestones.

summary_type: "life_chapter"
time_range: "2020-01 → 2026-03"
// Multi-year narrative arc
        

↻

Automatic staleness tracking. When new facts arrive for entities linked to a summary, that summary is marked stale = true. The next summarize cycle regenerates stale summaries first, ensuring knowledge stays current without redundant work.

Evolution

Three Generations of Memory

ARIA's knowledge system has evolved through three generations. Each builds on the last — the Knowledge Graph doesn't replace older systems, it unifies them.

Gen 1: Core Memory

Freeform (category, key, value) tuples. No confidence. No entities. Injected into every Claude request as system context. Still active and read by PIE Gate 1.

Table: core_memory

Gen 2: Owner Profile

Structured facts with confidence, evidence, sources, and temporal validity. Domain/category/key organization. Written by iMessage and photo analysis handlers.

Table: owner_profile

Gen 3: Knowledge Graph

Full entity-fact graph with semantic embeddings, version history, many-to-many entity links, AI summaries, and merge queue. Unifies all prior generations.

Tables: knowledge_*

Migration Pathway

core_memory
Freeform facts

➔

knowledge-backfill
Extracts entities, deduplicates

➔

knowledge_facts
+ knowledge_entities

owner_profile
Structured facts

➔

knowledge-backfill
Links to contacts, resolves names

➔

knowledge_facts
+ knowledge_entities

Feedback Loop

PIE ↔ Knowledge Graph Integration

The Proactive Intelligence Engine and the Knowledge Graph form a bidirectional feedback loop. PIE monitors for changes and detects patterns; the Knowledge Graph stores and organizes what's learned. Each makes the other more effective over time.

PIE feeds into KG

Gate 3: Anticipation Analyze
Detects behavioral patterns

⤓

anticipation_patterns
confidence grows with observations

⤓

confidence ≥ 0.85 & observations ≥ 5

Pattern Maintenance (daily)
Promotes to core_memory

⤓

knowledge-backfill
Migrates to knowledge_facts + entities

KG feeds into PIE

core_memory
Updated facts (from KG promotion, chat, etc.)

⤓

Gate 1: Context Accumulate
Detects memory changes via watermark

⤓

Gate 2: Significance Check
Classifies memory change importance

⤓

Gate 3: Anticipation Analyze
Generates insights from memory evolution

Chat Integration

User asks question
Chat endpoint

➔

knowledge_search
Vector similarity + filters

➔

Facts + Summaries
Injected as context

➔

ARIA responds
Informed by knowledge

PIE → Knowledge Graph

● Gate 3 detects behavioral patterns
● Patterns accumulate confidence over days
● Daily maintenance promotes high-confidence patterns
● Promoted patterns write to core_memory
● Backfill migrates to knowledge_facts
● Knowledge-summarize includes in entity profiles

Knowledge Graph → PIE

● Gate 1 monitors core_memory for changes
● New or updated facts trigger significance check
● Memory evolution itself is a PIE signal
● Gate 3 can reference entity summaries for context
● Pattern maintenance reads user feedback from insights
● Self-reinforcing: better knowledge → better patterns

Access Layer

Knowledge Graph Tools

ARIA can query the knowledge graph during conversations using three specialized tools. These enable semantic search, entity lookup, and graph exploration.

🔍

knowledge_search

Natural language semantic search across facts and conversation chunks. Filters by entity type and domain. Returns ranked results by vector similarity.

knowledge_search({
  query: "hiking trips in 2024",
  entity_type: "place",
  include_conversations: true
})
        

👤

knowledge_entity_lookup

Detailed profile for a specific entity. Includes all linked facts, summaries, and conversation statistics.

knowledge_entity_lookup({
  name: "Alex Thompson",
  include_facts: true
})
        

📋

knowledge_entity_list

Browse all entities with filtering and sorting. Useful for "who does ARIA know about" or "what places are tracked."

knowledge_entity_list({
  entity_type: "person",
  sort_by: "fact_count",
  min_facts: 5
})
        

Complete Picture

Full System Architecture

Signal Sources

iMessage

Photos

Voice

Chat

Imports

⤓

LLM Extraction + Entity Resolution

⤓

Knowledge Graph

knowledge_entities
People, Places, Topics, Events, Things

knowledge_facts
Confidence, Evidence, Embeddings

knowledge_summaries
AI-generated profiles

Legacy Systems

core_memory

owner_profile

↑ backfill migrates to KG ↑

Proactive Intelligence

PIE Pipeline

⇄

core_memory

↑ patterns promoted ↑ memory monitored ↓

Query Layer

Chat Tools

API Routes

Semantic search + entity lookup

Design Principles

Key Design Decisions

Family-Aware Resolution

Same last name + different first name = 0.45 score (below auto-merge). Prevents accidentally merging family members who share a surname.

Temporal Context in Values

All extracted facts require "As of DATE..." format. A fact without temporal context becomes stale without anyone knowing.

Two-Tier Embeddings

Facts use compact 768-dim vectors (cheaper). Conversations use 1536-dim for higher semantic precision where nuance matters most.

Watermarked Batch Processing

All analysis handlers use resumable watermarks. They self-chain through history — process a batch, save progress, enqueue the next batch.

Identity Disambiguation

LLM extraction prompts distinguish whose fact it is. "Mom had surgery" → fact about Nic's mother, not about Nic.

Version Preservation

Facts are superseded, never deleted. Old versions remain for audit trail. Current state filtered with superseded_by IS NULL.