Canonical Hashing for Change Detection

How VersionForge creates deterministic row hashes to efficiently detect what changed between sync snapshots.

The Problem with Naive Comparison

Comparing two full data extractions row by row, field by field, is expensive. For a NetSuite GL extraction with 500,000 rows and 20 fields, a brute-force comparison means 10 million field comparisons every sync. Worse, source systems often return the same data with different formatting -- extra whitespace, inconsistent casing, different null representations -- which causes false positives in naive equality checks.

VersionForge solves both problems with canonical hashing.

What Canonical Means

A canonical hash is deterministic: the same logical data always produces the same hash, regardless of superficial differences in how the data is formatted. Two representations of the same row must hash identically, and two rows with genuinely different data must hash differently.

VersionForge achieves this through a normalization pipeline that runs before hashing.

The Hashing Process

Normalize field values
Each field value is run through a normalization chain:
- Trim whitespace from both ends
- Lowercase string values (configurable per field for case-sensitive identifiers)
- Null handling -- null, empty string, and the literal "null" are all normalized to a canonical null token (\0)
- Number formatting -- Numeric values are rounded to the configured precision and serialized without trailing zeros (4892341.00 becomes 4892341)
- Date formatting -- Dates are converted to ISO 8601 format (2026-04-12T00:00:00Z) regardless of source format

Sort fields alphabetically

Field key-value pairs are sorted by field name in lexicographic order. This ensures that two extractions returning the same fields in different column orders produce the same hash.

// Before sorting (source order varies between API calls)
{ amount: "4892341", account: "6100", period: "2026-03" }

// After sorting (deterministic)
{ account: "6100", amount: "4892341", period: "2026-03" }

Serialize and hash
The sorted, normalized key-value pairs are serialized to a deterministic string representation and hashed with SHA-256. The resulting 64-character hex digest is the canonical hash for that row.
```
Serialized: account=6100|amount=4892341|period=2026-03
Hash:       a3f8c1d2e9b0...  (SHA-256 digest)
```

How This Enables Efficient Diffing

With canonical hashes computed for both the previous snapshot and the current extraction, diffing becomes a hash-set comparison instead of a field-by-field comparison:

Unchanged rows: Hash exists in both sets. Skip entirely -- no further inspection needed.
Added rows: Hash exists in the current set but not in the previous. New record.
Deleted rows: Hash exists in the previous set but not in the current. Record was removed from source.
Modified rows: The diff key (e.g., employee ID) matches between sets, but the hash differs. VersionForge then does a field-level comparison only on these rows to identify exactly which fields changed.

For a typical sync where 95% of rows are unchanged, this reduces the comparison work by orders of magnitude.

The diff key and the canonical hash serve different purposes. The diff key identifies which row you are looking at (like a primary key). The canonical hash tells you whether that row's data has changed.

Hash Collision Handling

SHA-256 collisions are theoretically possible but practically impossible -- the probability is approximately 1 in 2^128 for any given pair. VersionForge does not implement secondary verification for hash matches. If you are operating in a regulatory environment that requires belt-and-suspenders verification, you can enable fullFieldVerification in the sync profile, which performs a field-level comparison on a configurable sample percentage of "unchanged" rows per run.

diffing:
  canonicalHash:
    algorithm: "sha256"
    normalizeCase: true
    numberPrecision: 2
    fullFieldVerification:
      enabled: false
      samplePercent: 5    # verify 5% of unchanged rows

Why This Matters