Post-Extraction Validation

The validation checks VersionForge runs after pulling data from a source system to catch data quality issues before they enter the pipeline.

Why Validate After Extraction

Extraction is the first point where external data enters VersionForge. Everything downstream -- diffing, transforms, the Safety Gate, loading -- assumes the extracted data meets minimum quality standards. Post-extraction validation catches problems at the boundary, before bad data propagates through the pipeline and becomes expensive to diagnose.

Validation Checks

VersionForge runs five categories of validation checks on every extraction. Each check produces a PASS, WARN, or FAIL result.

Row Count Validation

Verifies that the number of extracted rows falls within expected bounds.

Minimum threshold: Catches cases where the source API returned a partial result (e.g., pagination failed silently and only the first page arrived).
Maximum threshold: Catches runaway extractions where a filter was misconfigured and the query returned the entire table instead of the expected subset.
Delta threshold: Compares the current row count to the previous extraction. A sudden 40% drop in rows is suspicious even if it is above the minimum.

validation:
  rowCount:
    min: 1000
    max: 100000
    maxDeltaPercent: 25    # fail if count changes by more than 25%

Required Field Completeness

Checks that key fields do not contain unexpected null or empty values. You define which fields are required in the sync profile. If more than the allowed percentage of rows have a null value for a required field, the check fails.

validation:
  requiredFields:
    - field: "account_code"
      maxNullPercent: 0      # zero tolerance -- every row must have an account
    - field: "department"
      maxNullPercent: 5      # up to 5% null is acceptable (some accounts are undepartmented)

Required field validation runs on the raw extraction, before any transform rules apply. If a transform fills in default values for null fields, the pre-transform validation still catches the source-level gap.

Type Validation

Confirms that field values match their expected data types. Dates should parse as dates. Numeric fields should contain numbers. Boolean fields should not contain arbitrary strings.

Type validation catches issues like:

An amount field containing the string "N/A" instead of a number
A posting_date field returning "00/00/0000" for records with no date
An is_active field containing "Yes" instead of true

When type validation fails, the report includes the offending row IDs and the actual values so you can trace the issue back to the source.

Referential Integrity

Validates that foreign key references in the extracted data point to valid targets. For example:

Every subsidiary_id in GL journal entries should reference a subsidiary that exists in the dimension extraction.
Every employee_id in a headcount extract should reference an active employee record.

Referential integrity checks require that the referenced dimension data has been extracted in the same run or is available from a previous snapshot. Configure your sync profile to extract dimensions before transactional data.

Tenant Profiling

Compares the statistical shape of the current extraction to a historical profile built from previous successful runs. This catches anomalies that individual checks miss:

Cardinality shifts: The number of unique values in a field changed dramatically (e.g., cost_center suddenly has 500 unique values instead of the typical 120).
Distribution changes: The average amount value shifted by more than two standard deviations from the historical mean.
New values: Values appear in a field that have never been seen before (useful for detecting test data or misconfigured environments).

Tenant profiling is opt-in and builds its baseline automatically over the first 10 successful runs.

What Happens on Validation Failure

| Result | Behavior | |--------|----------| | PASS | Pipeline continues to diffing and transform | | WARN | Pipeline continues, but the run is flagged with warnings visible in the Safety Gate | | FAIL | Pipeline halts immediately. No data reaches the diff, transform, or load stages. The team is notified with the validation report. |

The fail-fast behavior is deliberate. It is always cheaper to re-run an extraction than to debug bad data after it has been loaded into a planning model and used in a forecast.

Configuring Validation Rules

Validation rules are defined per sync profile under the validation key. You can also define global validation defaults that apply to all profiles and override them at the profile level. Every validation check can be set to fail, warn, or off.