Architecture Overview
The indexing system consists of several key components working together: IndexerJob orchestrates the entire indexing process as a resumable job. It maintains state across application restarts and provides detailed progress reporting. IndexerState preserves all necessary information to resume indexing from any interruption point. This includes the current phase, directories to process, and accumulated statistics. EntryProcessor handles the complex task of creating and updating database records while maintaining referential integrity through materialized paths. FileTypeRegistry identifies files through a combination of extensions, magic bytes, and content analysis to provide accurate type detection. The system integrates deeply with Spacedrive’s job infrastructure, which provides automatic state persistence through MessagePack serialization. When you pause an indexing operation, the entire job state is saved to a dedicated jobs database, allowing seamless resumption even after application restarts.Indexing jobs can run for hours on large directories. The resumable
architecture ensures no work is lost if interrupted.
Indexing Phases
The indexer operates through four distinct phases, each designed to be interruptible and resumable:Phase 1: Discovery
The discovery phase walks your filesystem to build a list of all files and directories. This phase is optimized for speed, collecting just enough information to plan the work ahead:Phase 2: Processing
Processing creates or updates database entries for each discovered item. This is where Spacedrive builds its understanding of your file structure:Phase 3: Aggregation
Aggregation calculates sizes and counts for directories by traversing the tree bottom-up. This phase provides the statistics you see in the UI:- Total size including subdirectories
- Direct child count
- Recursive file count
- Aggregate content types
Phase 4: Content Identification
The final phase generates content-addressed storage (CAS) identifiers and performs deep file analysis:Indexing Modes and Scopes
The system provides flexible configuration through modes and scopes:Index Modes
Shallow Mode extracts only filesystem metadata (name, size, dates). Completes in under 500ms for typical directories. Perfect for responsive UI navigation. Content Mode adds cryptographic hashing to identify files by content. Enables deduplication and content tracking. Moderate performance impact. Deep Mode performs full analysis including thumbnails and media metadata extraction. Best for photo and video libraries.Index Scopes
Current Scope indexes only the immediate directory contents:Persistence and Ephemeral Indexing
One of Spacedrive’s key innovations is supporting both persistent and ephemeral indexing modes.Persistent Indexing
Persistent indexing stores all data in the database permanently. This is the default for library locations:- Full change detection and history
- Syncs across devices
- Survives application restarts
- Enables offline search
Ephemeral Indexing
Ephemeral indexing keeps data in memory only, perfect for browsing external drives:- No database writes
- Session-based lifetime
- Memory-efficient storage
- Automatic expiration
Ephemeral mode lets you explore USB drives or network shares without
permanently adding them to your library.
Job System Integration
The indexing system leverages Spacedrive’s job infrastructure for reliability and monitoring.State Persistence
When interrupted, the entire job state is serialized:Progress Tracking
Real-time progress flows through multiple channels:- Sent to UI via channels
- Persisted to database
- Available through job queries
- Used for time estimates
Error Handling
The job system provides structured error handling: Non-critical errors are accumulated but don’t stop indexing:- Permission denied on individual files
- Corrupted metadata
- Unsupported file types
- Database connection lost
- Filesystem unmounted
- Out of disk space
Database Schema
The indexer populates several key tables designed for query performance.Entries Table
The core table uses materialized paths for efficient queries:Content Identities Table
Enables deduplication across your library:Performance Characteristics
Indexing performance varies by mode and scope:| Configuration | Performance | Use Case |
|---|---|---|
| Current + Shallow | <500ms | UI navigation |
| Recursive + Shallow | ~10K files/sec | Quick scan |
| Recursive + Content | ~1K files/sec | Normal indexing |
| Recursive + Deep | ~100 files/sec | Media libraries |
Optimization Techniques
Batch Processing: Groups operations into transactions of 1,000 items, reducing database overhead by 30x. Parallel I/O: Content identification runs on multiple threads, saturating disk bandwidth on fast storage. Smart Caching: The entry ID cache eliminates redundant parent lookups, critical for deep directory trees. Checkpoint Strategy: Checkpoints occur every 5,000 items or 30 seconds, balancing durability with performance.Change Detection
The indexer efficiently detects changes without full rescans:- New files: Appear with unknown inodes
- Modified files: Same inode, different size/mtime
- Moved files: Known inode at new path
- Deleted files: Missing from filesystem walk
Usage Examples
Quick UI Navigation
For responsive directory browsing:External Drive Browsing
Explore without permanent storage:Full Library Location
Comprehensive indexing with all features:CLI Commands
The indexer is fully accessible through the CLI:Troubleshooting
Common Issues
Slow Indexing: Check for large node_modules or build directories. Use.spacedriveignore files to exclude them.
High Memory Usage: Reduce batch size or avoid ephemeral mode for very large directories.
Resume Not Working: Ensure the jobs database isn’t corrupted. Check logs for serialization errors.
Debug Tools
Enable detailed logging:Platform Notes
Windows: Uses file indices for change detection. Supports long paths transparently. Network drives may require polling. macOS: Leverages FSEvents and native inodes. Integrates with Time Machine exclusions. APFS provides efficient cloning. Linux: Full inode support with detailed permissions. Handles diverse filesystems from ext4 to ZFS. Symbolic links supported with cycle detection.Best Practices
- Start shallow for new locations to verify configuration
- Use filters to exclude build artifacts and caches
- Monitor progress through the job system instead of polling
- Schedule deep scans during low-usage periods
- Enable checkpointing for locations over 100K files
Related Documentation
- Jobs - Job system architecture
- Locations - Directory management
- Search - Querying indexed data
- Performance - Optimization guide
