Staging & Snapshots
How VersionForge stores and manages chunked snapshots in the staging area for diff comparison.
Staging & Snapshots
The staging area is the intermediate storage layer between extraction and diffing. Every sync writes a new snapshot to the staging area, and the change detection engine uses these snapshots to determine what changed. Understanding staging helps you debug unexpected diff results and manage storage costs.
What Is a Snapshot
A snapshot is a point-in-time copy of the data extracted from a source system during a single sync run. It contains:
- All rows returned by the connector during extraction
- Canonical hashes computed for every row
- Row keys used to identify rows across syncs
- Metadata including the connector ID, timestamp, and row count
Snapshots are immutable. Once written, a snapshot is never modified -- new syncs produce new snapshots.
Chunked Storage
Large datasets are broken into chunks to keep memory usage predictable and enable parallel processing. The default chunk size is 5,000 rows, but you can adjust this per pipeline.
Snapshot #47 (250,000 rows)
├── chunk-001.json (rows 1–5,000)
├── chunk-002.json (rows 5,001–10,000)
├── ...
└── chunk-050.json (rows 245,001–250,000)
Chunking provides three benefits:
- Memory efficiency -- VersionForge processes one chunk at a time during diffing, so even million-row datasets stay within bounded memory
- Parallel hashing -- Multiple chunks can be hashed concurrently during the staging phase
- Incremental upload -- If staging is interrupted, completed chunks are retained and only remaining chunks need to be re-staged
You can view chunk details for any snapshot on the pipeline's Staging tab. Each chunk shows its row range, byte size, and hash count.
Snapshot Retention
VersionForge retains a configurable number of previous snapshots per pipeline. The default retention is 10 snapshots. When a new snapshot exceeds the retention limit, the oldest snapshot is automatically purged.
| Retention Setting | Use Case | |---|---| | 5 snapshots | Cost-sensitive environments with frequent syncs | | 10 snapshots (default) | Standard setup, enough history for most debugging | | 30 snapshots | Audit-heavy environments needing deep historical comparison |
You can adjust retention in Pipeline Settings > Staging > Snapshot Retention.
Reducing retention below 2 is not allowed. The change detection engine always requires at least one previous snapshot to compare against.
How Staging Feeds the Change Detection Engine
When a sync reaches the Diff stage, the change detection engine performs the following:
Load current snapshot hashes
The engine reads the canonical hash index from the new snapshot (snapshot N). This index maps each row key to its SHA-256 hash.
Load previous snapshot hashes
The engine reads the same hash index from the prior snapshot (snapshot N-1).
Compare hash sets
Row keys present in N but not N-1 are ADDs. Keys in both with different hashes are UPDATEs. Keys in N-1 but not N are DELETEs.
Hydrate changed rows
For every UPDATE and DELETE, the engine fetches the full row data from both snapshots so the Safety Gate can display before/after values.
Inspecting Snapshots for Debugging
When a sync produces unexpected results -- missing changes, phantom updates, or incorrect deletes -- snapshot inspection is your first debugging tool.
From the pipeline dashboard:
- Navigate to the Staging tab for the pipeline in question
- Select the snapshot you want to inspect
- Use the Search field to locate specific rows by key
- Compare any two snapshots side-by-side using the Compare button
You can also export a snapshot to CSV for offline analysis.
Canonical Hashing Details
The canonical hash for each row is computed as follows:
- Sort fields alphabetically by key name
- Normalize values -- trim whitespace, format numbers to fixed precision, convert dates to ISO 8601
- Serialize as a pipe-delimited string:
key1=value1|key2=value2|... - Hash the serialized string with SHA-256
This deterministic process ensures that field ordering or minor formatting variance in the source system does not produce false-positive changes.
If you change the row key configuration for a pipeline, the next sync will treat every row as an ADD and every row from the previous snapshot as a DELETE. Re-keying effectively resets the diff baseline. Plan accordingly.