Dataset Compositions: JOIN and UNION Across Sources

Combine multiple source connections into a single dataset with JOIN or UNION — saved once and reused across sync profiles. Alias-prefixed fields, connector-agnostic, full diff and HITL coverage.

Why Compositions Exist

Most planning workflows need data from more than one system. Bookings live in BigQuery, territory mappings live in Pigment, headcount lives in Workday, invoices are split between NetSuite and Stripe. Pushing that logic into the target system (or running an offline script between syncs) defeats the point of VersionForge — you lose row-level hashing, the Safety Gate review queue, and the audit trail.

Compositions bring the join back into the pipeline. A Dataset Composition is a saved, reusable definition that combines two or more source connections into a single output dataset before the pipeline stages, diffs, and pushes. Every Safety Gate guarantee — content hashing, the review queue, override enforcement — applies to the composed dataset.

JOIN vs UNION — Which One You Want

| If you want to… | Use | | --- | --- | | Look values up between tables (e.g. attach territory names to bookings) | JOIN (INNER / LEFT / RIGHT / FULL) | | Stack rows from multiple similar sources into one dataset (e.g. invoices from NetSuite + Stripe) | UNION (STRICT or PERMISSIVE) | | Combine three or more sources on a shared key | JOIN (N-way left-deep reduction — same join type at every step) | | De-duplicate rows that appear across sources | UNION with CONTENT_HASH or PRIMARY_KEY dedup |

The composition editor uses plain-language labels ("Look values up between them (like VLOOKUP)" vs "Stack them on top of each other (like UNION)") so finance teams don't have to reason in SQL terms.

How Composed Rows Flow Through the Pipeline

extract source A ─┐
                  ├─► compose (JOIN or UNION) ─► stage ─► diff ─► review ─► push
extract source B ─┘

Parallel extract. Every SOURCE endpoint in the composition runs in parallel (capped at four concurrent connectors to stay under API rate limits).
Composition stage. The composer merges records into a single RawRecord stream. Fields are prefixed with their source alias (e.g. territory.region, bookings.amount) so two sources with a shared column name never silently collide.
Stable identity. Each composed row gets a deterministic sourceRecordId — join:<composition>:<leftId>×<rightId> for joins, <alias>:<originalId> for unions — so re-running the sync detects real changes instead of treating every run as churn.
Recomputed hash. The row's content hash is computed over the composed payload, which is what the target system actually sees.

From there the pipeline is unchanged: the snapshot is staged, the streaming diff engine compares against the prior run, the Safety Gate queues material changes for review, and only approved rows push.

Saved, Reusable, Reviewable

Compositions are their own object (DatasetComposition) under Configure → Compositions. You build them once, then attach one to a sync profile by toggling the source step from "Connection" to "Composition". The same composition can back multiple profiles — one for a weekly Adaptive push, another for a monthly CSV backfill, a third for month-end reconciliation — without duplicating the join logic.

Every composition run emits:

Per-source telemetry on the SyncRun (which sources succeeded, how many rows each contributed, API calls made, bytes received).
DiffRecord.contributingSources — the aliases whose fields produced each change row, so reviewers can tell at a glance whether a movement was driven by the territory side, the bookings side, or both.
CloudWatch EMF metrics under the VersionForge/Composition namespace with dimensions for Kind, Halted, and per-alias Alias — row-count prediction deviation over 20% is your canary for a silently broken join key.

What Gets Saved, What Doesn't

Saved as composition config:

Which SOURCE endpoints participate, in what order, under which aliases.
Whether it's a JOIN (and what join type) or a UNION (and what mode + dedup).
The match-key pairs, per-key coercion hints (string / numeric / exact), null policy.
Row-count ceiling (hard fan-out cap) and soft review threshold.

Not saved:

Sample records or extract payloads. Previews and sync runs always hit the live source.
Credentials. Compositions reference credentials via the existing SOURCE endpoints — no secrets are stored on the composition itself.

When You Shouldn't Use a Composition

Single-source syncs. A plain profile with one SOURCE and one TARGET remains the right shape for 80% of workflows. Compositions add a stage you only need when two or more systems have to merge.
Transformations that aren't joins or stacks. A composition is not a general ETL engine — for cell-level rewrites, use field mappings and the transform config modal. The composition is there to decide which rows exist; field mapping decides what each row looks like.
One-off reports. If you're running a sync once and never again, export the raw sources and combine them in a spreadsheet. The value of a composition is in the fact that it's saved and re-runnable.