Staging & Snapshots

How VersionForge stores and manages chunked snapshots in the staging area for diff comparison.

Staging & Snapshots

The staging area is the intermediate storage layer between extraction and diffing. Every sync writes a new snapshot to the staging area, and the change detection engine uses these snapshots to determine what changed. Understanding staging helps you debug unexpected diff results and manage storage costs.

What Is a Snapshot

A snapshot is a point-in-time copy of the data extracted from a source system during a single sync run. It contains:

All rows returned by the connector during extraction
Canonical hashes computed for every row
Row keys used to identify rows across syncs
Metadata including the connector ID, timestamp, and row count

Snapshots are immutable. Once written, a snapshot is never modified -- new syncs produce new snapshots.

Chunked Storage

Large datasets are broken into chunks to keep memory usage predictable and enable parallel processing. The default chunk size is 5,000 rows, but you can adjust this per pipeline.

Snapshot #47 (250,000 rows)
  ├── chunk-001.json  (rows 1–5,000)
  ├── chunk-002.json  (rows 5,001–10,000)
  ├── ...
  └── chunk-050.json  (rows 245,001–250,000)

Chunking provides three benefits:

Memory efficiency -- VersionForge processes one chunk at a time during diffing, so even million-row datasets stay within bounded memory
Parallel hashing -- Multiple chunks can be hashed concurrently during the staging phase
Incremental upload -- If staging is interrupted, completed chunks are retained and only remaining chunks need to be re-staged

You can view chunk details for any snapshot on the pipeline's Staging tab. Each chunk shows its row range, byte size, and hash count.

Snapshot Retention

VersionForge retains a configurable number of previous snapshots per pipeline. The default retention is 10 snapshots. When a new snapshot exceeds the retention limit, the oldest snapshot is automatically purged.

| Retention Setting | Use Case | |---|---| | 5 snapshots | Cost-sensitive environments with frequent syncs | | 10 snapshots (default) | Standard setup, enough history for most debugging | | 30 snapshots | Audit-heavy environments needing deep historical comparison |

You can adjust retention in Pipeline Settings > Staging > Snapshot Retention.

Reducing retention below 2 is not allowed. The change detection engine always requires at least one previous snapshot to compare against.

How Staging Feeds the Change Detection Engine

When a sync reaches the Diff stage, the change detection engine performs the following:

Load current snapshot hashes
The engine reads the canonical hash index from the new snapshot (snapshot N). This index maps each row key to its SHA-256 hash.
Load previous snapshot hashes
The engine reads the same hash index from the prior snapshot (snapshot N-1).
Compare hash sets
Row keys present in N but not N-1 are ADDs. Keys in both with different hashes are UPDATEs. Keys in N-1 but not N are DELETEs.
Hydrate changed rows
For every UPDATE and DELETE, the engine fetches the full row data from both snapshots so the Safety Gate can display before/after values.

Inspecting Snapshots for Debugging

When a sync produces unexpected results -- missing changes, phantom updates, or incorrect deletes -- snapshot inspection is your first debugging tool.

From the pipeline dashboard:

Navigate to the Staging tab for the pipeline in question
Select the snapshot you want to inspect
Use the Search field to locate specific rows by key
Compare any two snapshots side-by-side using the Compare button

You can also export a snapshot to CSV for offline analysis.

Canonical Hashing Details

The canonical hash for each row is computed as follows:

Sort fields alphabetically by key name
Normalize values -- trim whitespace, format numbers to fixed precision, convert dates to ISO 8601
Serialize as a pipe-delimited string: key1=value1|key2=value2|...
Hash the serialized string with SHA-256

This deterministic process ensures that field ordering or minor formatting variance in the source system does not produce false-positive changes.

If you change the row key configuration for a pipeline, the next sync will treat every row as an ADD and every row from the previous snapshot as a DELETE. Re-keying effectively resets the diff baseline. Plan accordingly.

Staging & Snapshots

What Is a Snapshot

Chunked Storage

Snapshot Retention

How Staging Feeds the Change Detection Engine

Inspecting Snapshots for Debugging

Canonical Hashing Details

See It Running on Your Own Data in 30 Minutes