ContextFS Sync Service Architecture¶

Overview¶

The ContextFS sync service enables multi-device memory synchronization using vector clocks for conflict detection. This document describes the current implementation architecture.

System Components¶

┌─────────────────────────────────────────────────────────────────┐
│                         DEVICES                                  │
├─────────────────┬─────────────────┬─────────────────────────────┤
│   Laptop        │   Desktop       │   Linux Server              │
│   (macOS)       │   (Windows)     │   (Ubuntu)                  │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐             │
│ │  ContextFS  │ │ │  ContextFS  │ │ │  ContextFS  │             │
│ │    Core     │ │ │    Core     │ │ │    Core     │             │
│ └──────┬──────┘ │ └──────┬──────┘ │ └──────┬──────┘             │
│        │        │        │        │        │                     │
│ ┌──────┴──────┐ │ ┌──────┴──────┐ │ ┌──────┴──────┐             │
│ │   SQLite    │ │ │   SQLite    │ │ │   SQLite    │             │
│ │  + ChromaDB │ │ │  + ChromaDB │ │ │  + ChromaDB │             │
│ └─────────────┘ │ └─────────────┘ │ └─────────────┘             │
└────────┬────────┴────────┬────────┴────────┬────────────────────┘
         │                 │                 │
         │    Push/Pull    │    Push/Pull    │    Push/Pull
         │    (HTTP)       │    (HTTP)       │    (HTTP)
         ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                      SYNC SERVER                                 │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    FastAPI Application                   │    │
│  │                      (Port 8766)                         │    │
│  ├─────────────────────────────────────────────────────────┤    │
│  │  /api/sync/register  │  Device registration             │    │
│  │  /api/sync/push      │  Push local changes              │    │
│  │  /api/sync/pull      │  Pull server changes             │    │
│  │  /api/sync/status    │  Get sync status                 │    │
│  └──────────────────────┬──────────────────────────────────┘    │
│                         │                                        │
│  ┌──────────────────────┴──────────────────────────────────┐    │
│  │              PostgreSQL + pgvector                       │    │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐    │    │
│  │  │  memories   │ │  sessions   │ │  memory_edges   │    │    │
│  │  │  (synced)   │ │  (synced)   │ │    (synced)     │    │    │
│  │  └─────────────┘ └─────────────┘ └─────────────────┘    │    │
│  │  ┌─────────────┐ ┌─────────────┐                        │    │
│  │  │   devices   │ │ sync_state  │                        │    │
│  │  └─────────────┘ └─────────────┘                        │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Key Files¶

File	Purpose
`src/contextfs/sync/vector_clock.py`	VectorClock and DeviceTracker classes
`src/contextfs/sync/protocol.py`	Pydantic models for sync protocol
`src/contextfs/sync/client.py`	SyncClient for device-side operations
`src/contextfs/sync/cli.py`	CLI commands (`contextfs sync ...`)
`src/contextfs/sync/path_resolver.py`	Cross-machine path normalization
`service/api/sync_routes.py`	Server-side FastAPI endpoints
`service/db/models.py`	SQLAlchemy models for PostgreSQL

Vector Clock Implementation¶

Data Structure¶

class VectorClock(BaseModel):
    clock: dict[str, int]  # {device_id: counter}

Example:

{"laptop-abc123": 5, "desktop-def456": 3, "server-xyz789": 1}

Operations¶

Operation	Description
`increment(device_id)`	Increment counter for device
`merge(other)`	Take max of each component
`happens_before(other)`	Check if self causally precedes other
`concurrent_with(other)`	Check if clocks are concurrent (conflict)
`dominates(other)`	Check if self causally follows other
`prune(active_devices, max_devices)`	Limit clock size

Happens-Before Relation¶

VC1 < VC2  iff  ∀d: VC1[d] ≤ VC2[d]  AND  ∃d: VC1[d] < VC2[d]

Conflict Detection¶

Client Clock	Server Clock	Result
`{A:2, B:1}`	`{A:1, B:1}`	Accept (server behind)
`{A:1, B:1}`	`{A:2, B:1}`	Reject (client behind/stale)
`{A:2, B:1}`	`{A:1, B:2}`	Conflict (concurrent)
`{A:2, B:2}`	`{A:2, B:2}`	Accept (equal)

Sync Protocol¶

Push Flow¶

Client                                    Server
  │                                         │
  │  1. Query local memories (since last sync)
  │                                         │
  │  2. Increment vector clock for each     │
  │                                         │
  │  3. Extract embeddings from ChromaDB    │
  │                                         │
  │  POST /api/sync/push ─────────────────► │
  │  {device_id, memories[], sessions[]}    │
  │                                         │
  │                        4. For each memory:
  │                           - Compare clocks
  │                           - Accept/Reject/Conflict
  │                           - Store embedding
  │                                         │
  │ ◄───────────────────── Response ────────│
  │  {accepted, rejected, conflicts[]}      │
  │                                         │
  │  5. Update local vector clocks          │
  │                                         │

Pull Flow¶

Client                                    Server
  │                                         │
  │  POST /api/sync/pull ─────────────────► │
  │  {device_id, since_timestamp}           │
  │                                         │
  │                        1. Query memories
  │                           WHERE updated_at > since
  │                                         │
  │                        2. Include embeddings
  │                                         │
  │ ◄───────────────────── Response ────────│
  │  {memories[], sessions[], has_more}     │
  │                                         │
  │  3. Batch insert to SQLite (skip_rag)   │
  │                                         │
  │  4. Insert embeddings to ChromaDB       │
  │                                         │
  │  5. Update last_sync timestamp          │
  │                                         │

Data Models¶

SyncedMemory (Protocol)¶

class SyncedMemory(SyncableEntity):
    # Core content
    content: str
    type: str = "fact"
    tags: list[str]
    summary: str | None

    # Namespace
    namespace_id: str = "global"

    # Portable source (cross-machine)
    repo_url: str | None        # git@github.com:user/repo.git
    repo_name: str | None       # Human-readable name
    relative_path: str | None   # Path from repo root

    # Legacy fields
    source_file: str | None
    source_repo: str | None
    source_tool: str | None

    # Sync metadata (inherited)
    vector_clock: dict[str, int]
    content_hash: str | None
    deleted_at: datetime | None
    last_modified_by: str | None

    # Embedding (synced!)
    embedding: list[float] | None  # 384-dim vector

SyncedMemoryModel (PostgreSQL)¶

class SyncedMemoryModel(Base):
    __tablename__ = "memories"

    id: Mapped[str] = mapped_column(Text, primary_key=True)
    content: Mapped[str]
    type: Mapped[str]
    tags: Mapped[list[str]] = mapped_column(ARRAY(Text))

    # Sync fields
    vector_clock: Mapped[dict] = mapped_column(JSONB)
    content_hash: Mapped[str | None]
    deleted_at: Mapped[datetime | None]
    last_modified_by: Mapped[str | None]

    # Embedding (pgvector)
    embedding = mapped_column(Vector(384), nullable=True)

Embedding Synchronization¶

Problem¶

Traditional approaches require either: 1. Centralized queries - Network dependency for every search 2. Local recomputation - Expensive (10-50ms/memory on CPU)

Solution¶

Sync embeddings alongside content:

Push: ChromaDB → Extract → HTTP → PostgreSQL (pgvector)
Pull: PostgreSQL → HTTP → Insert → ChromaDB (no recompute!)

Implementation¶

Push (client.py:_get_embeddings_from_chroma)

def _get_embeddings_from_chroma(self, memory_ids: list[str]) -> dict[str, list[float]]:
    collection = self.ctx.rag._collection
    result = collection.get(ids=memory_ids, include=["embeddings"])
    return {id: list(emb) for id, emb in zip(result["ids"], result["embeddings"])}

Pull (client.py:_add_embeddings_to_chroma)

def _add_embeddings_to_chroma(self, embeddings: list[tuple]):
    collection = self.ctx.rag._collection
    collection.upsert(ids=ids, embeddings=vectors, documents=documents)

Device Management¶

Device Registration¶

class DeviceRegistration(BaseModel):
    device_id: str       # laptop-abc123 (auto-generated)
    device_name: str     # "My Laptop"
    platform: str        # darwin, linux, windows
    client_version: str  # "0.1.0"

Device ID Generation¶

def _get_or_create_device_id(self) -> str:
    hostname = socket.gethostname()
    mac = uuid.getnode()
    return f"{hostname}-{mac:012x}"[:32]

Stored in ~/.contextfs/device_id

Sync State Persistence¶

Stored in SQLite sync_state table:

Column	Type	Description
device_id	TEXT	Primary key
server_url	TEXT	Sync server URL
last_sync_at	TIMESTAMP	Last successful sync
last_push_at	TIMESTAMP	Last push
last_pull_at	TIMESTAMP	Last pull
device_tracker	TEXT (JSON)	DeviceTracker data

Path Normalization¶

Problem¶

Absolute paths differ across machines: - macOS: /Users/mlong/code/myrepo/src/file.py - Linux: /home/mlong/code/myrepo/src/file.py - Windows: C:\Users\mlong\code\myrepo\src\file.py

Solution: Portable Paths¶

Store repository URL + relative path:

class PortablePath:
    repo_url: str        # git@github.com:user/repo.git
    repo_name: str       # myrepo
    relative_path: str   # src/file.py

PathResolver¶

def normalize(self, absolute_path: str) -> PortablePath:
    """Convert absolute path to portable format."""
    repo_root = find_git_root(absolute_path)
    repo_url = get_git_remote(repo_root)
    relative = os.path.relpath(absolute_path, repo_root)
    return PortablePath(repo_url=repo_url, relative_path=relative)

def resolve(self, portable: PortablePath) -> Path | None:
    """Convert portable path to local absolute path."""
    # Look up local clone of repo_url
    local_root = find_local_clone(portable.repo_url)
    return local_root / portable.relative_path

CLI Commands¶

# Register device with sync server
contextfs sync register --server http://localhost:8766 --name "My Device"

# Push local changes
contextfs sync push --server http://localhost:8766
contextfs sync push --server http://localhost:8766 --all  # Push ALL memories

# Pull from server
contextfs sync pull --server http://localhost:8766
contextfs sync pull --server http://localhost:8766 --all  # Initial sync

# Full bidirectional sync
contextfs sync all --server http://localhost:8766

# Check status
contextfs sync status --server http://localhost:8766

# Run sync daemon
contextfs sync daemon --server http://localhost:8766 --interval 300

Deployment¶

Docker Compose¶

# docker-compose.sync.yml
services:
  sync-postgres:
    image: pgvector/pgvector:pg15
    environment:
      POSTGRES_DB: contextfs_sync
      POSTGRES_USER: contextfs
      POSTGRES_PASSWORD: contextfs
    ports:
      - "5432:5432"

  sync-server:
    build: .
    environment:
      DATABASE_URL: postgresql+asyncpg://contextfs:contextfs@sync-postgres/contextfs_sync
    ports:
      - "8766:8766"
    depends_on:
      - sync-postgres

Start Services¶

# Start infrastructure
docker-compose -f docker-compose.sync.yml up -d sync-postgres

# Run server locally (development)
python -m service.api.main

# Or run everything in Docker
docker-compose -f docker-compose.sync.yml up -d

Conflict Resolution¶

Current Strategy¶

Conflicts are returned to the client for manual resolution:

class ConflictInfo(BaseModel):
    entity_id: str
    entity_type: str  # "memory", "session", "edge"
    client_clock: dict[str, int]
    server_clock: dict[str, int]
    client_content: str | None
    server_content: str | None
    client_updated_at: datetime
    server_updated_at: datetime

Future Options¶

Last-Write-Wins (LWW) - Use timestamp to auto-resolve
LLM-Powered Merge - Use AI to intelligently merge versions
CRDT-Style - Automatic merge for compatible operations (tag unions)

Performance Characteristics¶

Operation	1K memories	10K memories
Push (with embeddings)	~350ms	~3.7s
Pull (with embeddings)	~390ms	~3.8s
Incremental sync	~50ms	~55ms

Optimizations¶

Batch Save - Single transaction for bulk inserts
Skip RAG on Pull - Embeddings inserted directly, no recompute
Pagination - Large syncs paginated (1000 items/page)
Content Hashing - Skip identical content (deduplication)

Security Considerations¶

Current State¶

Device ID-based identification (no authentication)
Plain HTTP (use HTTPS in production)

Recommended for Production¶

API key or OAuth authentication
TLS for transport security
Server-side encryption for sensitive memories
Rate limiting per device

Future Enhancements¶

CRDT Integration - Automatic conflict resolution for compatible ops
Selective Sync - Namespace-based sync policies
Federated Architecture - Multiple sync servers
End-to-End Encryption - Client-side encryption with key sharing
WebSocket Push - Real-time sync notifications