The Hybrid Philosophy
Most file management software forces a choice: fast, dumb directory listing (Explorer/Finder) or slow, heavy database ingestion (Lightroom/Photos). Spacedrive does both simultaneously by decoupling Discovery from Persistence.The Ephemeral Layer (“File Manager” Mode)
When you open a location that hasn’t been added to your library—an external drive, network share, or local directory—Spacedrive runs only Phase 1 (Discovery) of the indexing pipeline:- Memory-Resident: The index lives entirely in RAM
- Highly Optimized: Custom slab allocators (NodeArena) and string interning (NameCache) compress file entries down to ~50 bytes
- Massive Scale: Can index millions of files into RAM for accelerated local search
- Zero Database I/O: Bypasses SQLite entirely for maximum throughput
The Persistent Layer (“Library” Mode)
For files you want to track across devices, Spacedrive persists data to a synchronized SQLite database using the full multi-phase pipeline with deep content analysis, deduplication, and closure-table hierarchy management.Seamless State Promotion
The critical innovation is how these two layers communicate. When you add a location to your library for a folder you’re currently browsing ephemerally, the system performs an Intelligent Promotion:- UUID Preservation: The persistent indexer detects the existing ephemeral index and carries over UUIDs assigned during the browsing session into the database
- UI Consistency: Because UUIDs remain stable, the UI doesn’t flicker or reset. Selections, active tabs, and view states remain intact
- Phase Continuation: The indexer essentially “resumes” from Phase 1, flushing discovered entries to SQLite and proceeding to Phase 2 (Processing) and Phase 3 (Content Analysis)
This architecture allows Spacedrive to act as your daily driver file explorer. You get instant access to files immediately, with the option to progressively “deepen” the index for files that matter.
Architecture Overview
The indexing system consists of specialized components working together: IndexerJob orchestrates the entire indexing process as a resumable job. It maintains state across application restarts and provides detailed progress reporting. IndexerState preserves all necessary information to resume indexing from any interruption point. This includes the current phase, directories to process, accumulated statistics, and ephemeral UUID mappings for preserving user metadata across browsing-to-persistent transitions. DatabaseStorage provides the low-level database CRUD layer. All database operations (create, update, move, delete) flow through this module for consistency. DatabaseAdapter implements bothChangeHandler (for filesystem watcher events) and IndexPersistence (for indexer job batches). Both pipelines use the same code to write entries to the database via DatabaseStorage.
MemoryAdapter implements both ChangeHandler (for filesystem watcher events) and IndexPersistence (for indexer job batches). Both pipelines use the same code to write entries to the in-memory EphemeralIndex.
This dual-implementation architecture unifies watcher and job pipelines, eliminating code duplication between real-time filesystem monitoring and batch indexing operations.
FileTypeRegistry identifies files through extensions, magic bytes, and content analysis.
The system integrates deeply with Spacedrive’s job infrastructure, which provides automatic state persistence through MessagePack serialization. When you pause an indexing operation, the entire job state is saved to a dedicated jobs database, allowing seamless resumption even after application restarts.
Indexing jobs can run for hours on large directories. The resumable
architecture ensures no work is lost if interrupted.
Database Architecture
The indexing system uses a closure table for hierarchy management instead of recursive queries:Closure Table
Parent-child relationships are stored in theentry_closure table with precomputed ancestor-descendant pairs. This makes “find all descendants” queries O(1) regardless of nesting depth, at the cost of additional storage (worst-case N² for deeply nested trees).
/home/user/docs/report.pdf, entries exist for:
- (home_id, report_id, depth=3)
- (user_id, report_id, depth=2)
- (docs_id, report_id, depth=1)
- (report_id, report_id, depth=0)
Directory Paths Cache
Thedirectory_paths table provides O(1) absolute path lookups for directories:
Entries Table
Indexing Phases
The pipeline is broken into atomic, resumable phases. The Ephemeral engine runs only Phase 1. The Persistent engine runs all five phases.Phase 1: Discovery
Used by: Ephemeral & Persistent A parallel, asynchronous filesystem walk designed for raw speed:- Parallelism: Work-stealing architecture where workers consume directories and directly enqueue subdirectories. On systems with 8+ cores, multiple threads scan concurrently, communicating via channels to maximize disk throughput
- Rules Engine: Filters system files (
.git,node_modules) at the discovery edge throughIndexerRuler, which applies toggleable system rules (NO_HIDDEN,NO_DEV_DIRS) and dynamically loaded.gitignorepatterns when inside a Git repository - Output: A stream of lightweight
DirEntryobjects
Phase 2: Processing
Used by: Persistent Only Converts discovered entries into database records:- Topology Sorting: Entries are sorted by depth (parents before children) to maintain referential integrity during batch insertion
- Batching: Writes occur in transactions of 1,000 items to minimize SQLite locking overhead
ChangeDetector loads existing database entries for the indexing path, then compares against filesystem state to identify:
- New: Paths not in database
- Modified: Size or mtime differs
- Moved: Same inode at different path
- Deleted: In database but missing from filesystem
state.ephemeral_uuids). This prevents orphaning user metadata like tags and notes attached during browsing sessions.
The processing phase validates that the indexing path stays within location boundaries, preventing catastrophic cross-location deletion if watcher routing bugs send events for the wrong path.
Phase 3: Aggregation
Used by: Persistent Only To allow sorting folders by “True Size” (the size of all children recursively), we aggregate statistics from the bottom up:- Closure Table: Uses the
entry_closuretable to perform O(1) descendant lookups - Leaf-to-Root: Calculates sizes for the deepest directories first, bubbling totals up to the root
aggregate_size: Total bytes including subdirectorieschild_count: Direct children onlyfile_count: Recursive file count
Phase 4: Content Identification
Used by: Persistent Only Enables Spacedrive’s deduplication capabilities through Content Addressable Storage (CAS):- BLAKE3 Hashing: Generates content hashes for files, linking entries to
content_identityrecords - Globally Deterministic UUIDs: Uses v5 UUIDs (namespace hash of
content_hashonly) so any device can independently identify identical files and arrive at the exact same Content UUID without communicating. This enables offline duplicate detection across all devices and libraries - Sync Order: Content identities must be synced before entries to avoid foreign key violations on receiving devices. The job system enforces this ordering
- File Type Identification: Runs via
FileTypeRegistryto populatekind_idandmime_type_idfields for new content
Phase 5: Finalizing
Used by: Persistent Only Finalizing handles post-processing tasks like directory aggregation updates and potential processor dispatch (thumbnail generation for Deep Mode).Change Detection System
The indexing system includes both batch and real-time change detection:Batch Change Detection
ChangeDetector compares database state against filesystem during indexer job scans:
Real-Time Change Detection
BothDatabaseAdapter and MemoryAdapter implement the ChangeHandler trait, which defines the interface for responding to filesystem watcher events:
DatabaseAdapter → database) or ephemeral session (MemoryAdapter → memory).
Indexing Modes and Scopes
The system provides flexible configuration through modes and scopes:Index Modes
Shallow Mode extracts only filesystem metadata (name, size, dates). Completes in under 500ms for typical directories. Content Mode adds BLAKE3 hashing to identify files by content. Enables deduplication and content tracking. Deep Mode performs full analysis including file type identification and metadata extraction. Triggers thumbnail generation for images and videos.Index Scopes
Current Scope indexes only immediate directory contents. Used for responsive UI navigation. Recursive Scope indexes the entire directory tree. Used for full location indexing.Persistence and Ephemeral Indexing
Spacedrive supports both persistent and ephemeral indexing modes:Persistent Indexing
Persistent indexing stores all data in the database permanently. This is the default for library locations:- Full change detection and history
- Syncs across devices
- Survives application restarts
- Enables offline search
Ephemeral Indexing
Ephemeral indexing keeps data in memory only, perfect for browsing external drives without permanent storage. The system uses highly memory-optimized structures (detailed in the Data Structures section below):- NodeArena: Slab allocator for
FileNodeentries with 32-bit entry IDs instead of 64-bit pointers - NameCache: Global string interning pool where one copy of “index.js” serves thousands of node_modules files
- NameRegistry: BTreeMap for fast name-based lookups without full-text indexing overhead
EphemeralIndexCache tracks which paths have been indexed, are currently being indexed, or are registered for filesystem watching. When a watched path receives filesystem events, the system updates the in-memory index in real-time through the unified ChangeHandler trait (shared with persistent storage).
Ephemeral mode lets you explore USB drives or network shares without
permanently adding them to your library.
Data Structures & Optimizations
Specific low-level optimizations make the hybrid architecture viable:NodeArena (Ephemeral)
The ephemeral index doesn’t use standard HashMaps. Instead, it uses a memory-mapped NodeArena—a contiguous slab of memory that stores file nodes using 32-bit integers as pointers rather than 64-bit pointers. This reduces memory overhead by 4-6x compared to naiveHashMap<PathBuf, Entry> implementations, enabling browsing of hundreds of thousands of files without database overhead.
Name Pooling (Ephemeral)
In typical filesystems, filenames likeindex.js, .DS_Store, or conf.yaml repeat thousands of times. The NameCache interns these strings, storing them once and referencing them by pointer. Multiple directory trees can coexist in the same EphemeralIndex (browsing both /mnt/nas and /media/usb simultaneously), sharing the string interning pool for maximum deduplication.
Future Roadmap: We plan to port the Name Pooling strategy from the ephemeral engine to the SQLite database schema. This will significantly reduce the storage footprint of the persistent library by deduplicating filename strings at the database level.
Directory Path Caching (Persistent)
While the database uses an adjacency list (parent_id) for structure, recursive queries are slow. The directory_paths table caches the full absolute path of every directory, enabling O(1) path resolution for any file without recursive parent traversal.
Indexer Rules
TheIndexerRuler applies filtering rules during discovery to skip unwanted files:
System Rules are toggleable patterns like:
NO_HIDDEN: Skip dotfiles (.git,.DS_Store)NO_DEV_DIRS: Skipnode_modules,target,distNO_SYSTEM: Skip OS folders (System32,Windows)
.gitignore files. This automatically excludes build artifacts and local configuration.
Rules return a RulerDecision (Accept/Reject) for each path during discovery, preventing unwanted entries from ever reaching the processing phase.
Index Integrity Verification
TheIndexVerifyAction checks integrity by running a fresh ephemeral scan and comparing metadata against the existing persistent index:
- MissingFromIndex: Files created outside Spacedrive
- StaleInIndex: Deleted files not yet purged from database
- SizeMismatch: Files modified externally
- ModifiedTimeMismatch: Timestamp drift (with 1-second tolerance)
- InodeMismatch: File replacement or filesystem corruption
IntegrityReport with per-file diagnostics.
Job System Integration
The indexing system leverages Spacedrive’s job infrastructure for reliability and monitoring.State Persistence
When interrupted, the entire job state is serialized:Progress Tracking
Real-time progress flows through multiple channels:Error Handling
Non-critical errors are accumulated but don’t stop indexing:- Permission denied on individual files
- Corrupted metadata
- Unsupported file types
- Database connection lost
- Filesystem unmounted
- Out of disk space
Performance Characteristics
Indexing performance varies by mode and scope:| Configuration | Performance | Use Case |
|---|---|---|
| Current + Shallow | <500ms | UI navigation |
| Recursive + Shallow | ~10K files/sec | Quick scan |
| Recursive + Content | ~1K files/sec | Normal indexing |
| Recursive + Deep | ~100 files/sec | Media libraries |
Optimization Techniques
Batch Processing: Groups operations into transactions of 1,000 items, reducing database overhead by 30x. Parallel Discovery: Work-stealing model with atomic counters for directory traversal, using half of available CPU cores by default. Entry ID Cache: Eliminates redundant parent lookups during hierarchy construction, critical for deep directory trees. Checkpoint Strategy: Checkpoints occur every 5,000 items or 30 seconds, balancing durability with performance.Usage Examples
Quick UI Navigation
For responsive directory browsing:External Drive Browsing
Explore without permanent storage:Full Library Location
Full indexing with content identification:CLI Commands
The indexer is fully accessible through the CLI:Troubleshooting
Common Issues
Slow Indexing: Check for largenode_modules or build directories. System rules automatically skip common patterns, or use .gitignore to exclude project-specific artifacts.
High Memory Usage: Reduce batch size for directories over 1M files. Ephemeral mode uses around 50 bytes per entry, so 100K files requires roughly 5MB.
Resume Not Working: Ensure the jobs database isn’t corrupted. Check logs for serialization errors.
Debug Tools
Enable detailed logging:Platform Notes
Windows: Uses file indices for change detection where available, falling back to path-only matching. Supports long paths transparently. Network drives may require polling. macOS: Leverages FSEvents and native inodes. Integrates with Time Machine exclusions. APFS provides efficient cloning. Linux: Full inode support with detailed permissions. Handles diverse filesystems from ext4 to ZFS. Symbolic links supported with cycle detection.Best Practices
- Start shallow for new locations to verify configuration before deep scans
- Use Git repositories to automatically inherit
.gitignoreexclusions - Monitor progress through the job system instead of polling the database
- Schedule deep scans during low-usage periods for large photo/video libraries
- Enable checkpointing for locations over 100K files to survive interruptions
Related Documentation
- Jobs - Job system architecture
- Locations - Directory management
- Search - Querying indexed data
- Performance - Optimization guide
