Knowledge Base ConnectorsCSV Encoding and Delimiter Detection

Updated 2026-04-12

CSV Encoding and Delimiter Detection

How VersionForge auto-detects file encoding, delimiter characters, column types, and header rows in CSV files.

Overview

CSV files arrive from dozens of different source systems, each with its own encoding conventions, delimiter preferences, and formatting quirks. A file exported from a German SAP instance uses semicolons and Latin-1 encoding. A file from a US-based Workday export uses commas and UTF-8. A tab-delimited file from an Excel "Save As" contains Windows-1252 encoded characters. VersionForge's CSV connector handles all of these transparently through auto-detection.

Encoding Detection

VersionForge detects the character encoding of incoming CSV files before parsing. The detection process works in order of specificity:

BOM Detection

If the file starts with a Byte Order Mark (BOM), the encoding is determined immediately:

| BOM Bytes | Encoding | |-----------|----------| | EF BB BF | UTF-8 | | FF FE | UTF-16 LE | | FE FF | UTF-16 BE |

Statistical Detection

When no BOM is present, VersionForge analyzes the first 8 KB of the file to determine encoding:

UTF-8: Valid multi-byte sequences (continuation bytes follow leading bytes correctly). The most common encoding for modern exports.
Windows-1252: Contains bytes in the 0x80-0x9F range that are not valid UTF-8 continuation bytes. Common in files exported from Excel on Windows.
Latin-1 (ISO 8859-1): Contains bytes in the 0xA0-0xFF range that are valid single-byte characters but not part of valid UTF-8 multi-byte sequences.

If the file contains only ASCII characters (bytes 0x00-0x7F), all three encodings produce identical results. VersionForge defaults to UTF-8 in this case.

Manual Override

If auto-detection picks the wrong encoding (rare but possible with short files or ambiguous byte patterns), you can force a specific encoding in the sync profile:

{
  "filePath": "legacy-export.csv",
  "encoding": "windows-1252"
}

Supported values: utf-8, utf-16le, utf-16be, latin-1, windows-1252.

Delimiter Detection

VersionForge inspects the first 20 lines of the file to determine the delimiter character. The algorithm:

Count occurrences of candidate delimiters (,, ;, \t, |) in each line
Check consistency -- the correct delimiter appears the same number of times in every line (since every row should have the same number of columns)
Select the candidate with the highest consistent count

Detection Confidence

| Scenario | Result | |----------|--------| | Comma appears 5 times in every line, semicolons appear 0-2 times | Comma selected | | Semicolons appear 8 times in every line, commas only inside quoted fields | Semicolon selected | | Tabs appear 12 times in every line, no other candidates are consistent | Tab selected | | Pipes appear 6 times in every line | Pipe selected |

Auto-detection can be fooled if your data values frequently contain the delimiter character inside quoted fields. If you know the delimiter, set it explicitly in the sync profile configuration to skip detection entirely.

Manual Override

{
  "filePath": "pipe-delimited-export.csv",
  "delimiter": "|"
}

Column Type Inference

After parsing the CSV, VersionForge infers the data type of each column by examining sample values. This information feeds the schema discovery UI and helps the transform layer apply appropriate type coercion.

| Inferred Type | Detection Rule | Examples | |---|---|---| | number | Matches -?\d+(\.\d+)? (optional negative, optional decimal) | 120000, -45.50, 0 | | date | Matches ISO 8601 pattern YYYY-MM-DD with optional time component | 2026-04-12, 2026-04-12T10:30:00 | | boolean | Exact match for true or false (case-sensitive) | true, false | | string | Everything else | Engineering, Jane Doe, US-CA |

Type inference uses the first data row as the sample. If the first row has an atypical value (e.g., an empty cell in a normally numeric column), the inferred type may be string. The transform layer handles type coercion gracefully regardless of the inferred type.

Header Row Detection

VersionForge assumes the first non-empty line of the CSV file is the header row. This is the standard convention for CSV files (RFC 4180). Headers are trimmed of leading and trailing whitespace.

If your file does not have a header row, you can supply column names explicitly:

{
  "filePath": "headerless-export.csv",
  "columns": ["employee_id", "first_name", "last_name", "department", "salary"]
}

When explicit columns are provided, the first line of the file is treated as data, not a header.

Practical Examples

German SAP Export

{
  "filePath": "sap-gl-export.csv",
  "encoding": "latin-1",
  "delimiter": ";"
}

Handles files like:

Konto;Bezeichnung;Soll;Haben;Saldo
4100;Personalkosten;250.000,00;0,00;250.000,00

Excel Tab-Delimited Export

{
  "filePath": "excel-export.tsv",
  "encoding": "windows-1252",
  "delimiter": "\t"
}

Standard UTF-8 Comma-Delimited

{
  "filePath": "workforce.csv"
}

No overrides needed -- VersionForge auto-detects UTF-8 encoding, comma delimiter, and the header row.

When auto-detection works correctly, you do not need to specify encoding or delimiter at all. The detection runs on every file read, so it adapts automatically if the source system changes its export format.

Overview

Encoding Detection

BOM Detection

Statistical Detection

Manual Override

Delimiter Detection

Detection Confidence

Manual Override

Column Type Inference

Header Row Detection

Practical Examples

German SAP Export

Excel Tab-Delimited Export

Standard UTF-8 Comma-Delimited

See It Running on Your Own Data in 30 Minutes