Getting Started
Go from raw CSV to typed, validated DuckDB table in five commands.
Every analyst knows the feeling: you open a CSV and every column is a string. Is that column of numbers a price or a postal code? Are the dates ISO 8601 or DD/MM/YYYY? Is that blank cell a missing value or an empty string?
This guide walks you through the full FineType pipeline — from raw CSV to a typed, validated DuckDB table — in five steps. Along the way, you will catch real data quality issues that would otherwise surface as silent errors downstream.
Prerequisites
FineType and DuckDB run on macOS, Linux, and Windows (via WSL).
Install DuckDB
# macOS
brew install duckdb
# Linux (apt)
sudo apt install duckdb
# Windows
winget install DuckDB.cliInstall FineType
# macOS / Linux (Homebrew)
brew install meridian-online/tap/finetype
# Any platform with Rust
cargo install finetype-cliVerify both are working:
duckdb --version
finetype --versionThe Sample Dataset
We will use a small contacts dataset with realistic quality issues — the kind you find in every export from a CRM, spreadsheet, or third-party API.
Download the file directly: contacts.csv
Or copy and save it locally:
id,name,email,created_at,ip_address,amount
1,Alice Chen,[email protected],2024-01-15T09:30:00Z,192.168.1.10,149.99
2,Bob Smith,[email protected],2024-02-20T14:15:00Z,10.0.0.42,2500.00
3,Carol Wu,[email protected],2024-03-08T11:00:00Z,172.16.0.1,89.50
4,Dan Reeves,[email protected],2024-04-12T16:45:00Z,192.168.0.5,1200.00
5,Eve Nakamura,[email protected],2024-05-01T08:00:00Z,10.10.10.1,340.75
6,Frank Osei,not-an-email,2024-06-15T10:20:00Z,203.0.113.7,75.00
7,Grace Patel,[email protected],15/07/2024,10.0.0.99,1850.25
8,Hiro Tanaka,[email protected],2024-08-03T13:45:00Z,,425.00
9,Isla Morgan,[email protected],2024-09-19T07:30:00Z,192.168.1.200,N/A
10,Jin Li,[email protected],2024-10-05T16:00:00Z,999.999.999.999,3200.00
11,Kara Novak,kara@example,2024-11-11 09:00:00,10.0.1.50,620.00
12,Leo Santos,[email protected],2024-12-01T12:00:00Z,172.16.5.20,This looks tidy at first glance, but there are problems hiding in plain sight:
- Row 6:
not-an-emailis not a valid email address - Row 7:
15/07/2024uses DD/MM/YYYY instead of ISO 8601 - Row 8: missing IP address
- Row 9:
N/Ain the amount column — not a number - Row 10:
999.999.999.999is not a valid IP address - Row 11:
kara@examplehas no TLD, and the timestamp uses a different format - Row 12: missing amount
DuckDB will happily read this file and load everything as VARCHAR. The problems only surface later, when a CAST fails or an aggregate silently ignores nulls. FineType catches these issues upfront.
Step 1: Classify Individual Values
Before profiling the whole file, build intuition for how FineType thinks. The infer command classifies a single value:
$ finetype infer -i "[email protected]" --confidence
identity.person.email 1.0000
$ finetype infer -i "192.168.1.10" --confidence
technology.internet.ip_v4 1.0000
$ finetype infer -i "2024-01-15T09:30:00Z" --confidence
datetime.timestamp.iso_8601 1.0000Each value gets a three-part label: domain.category.type. The confidence score tells you how certain the model is.
Now try a bad value:
$ finetype infer -i "not-an-email" --confidence
identity.person.username 0.9963FineType does not crash or return an error — it classifies what it sees. not-an-email looks more like a username than an email. This is exactly the kind of signal that matters when profiling a whole column: if most values are emails but one is a username, something is wrong with the data, not the column.
Step 2: Profile the Dataset
The profile command analyses every column and picks the best semantic type for each:
finetype profile -f contacts.csvFineType Column Profile — "contacts.csv" (12 rows, 6 columns)
════════════════════════════════════════════════════════════════════════════════
COLUMN TYPE BROAD CONF
──────────────────────────────────────────────────────────────────────────────
id representation.identifier.increment BIGINT 50.0% [sense_header_hint_generic:id]
name identity.person.full_name VARCHAR 60.0% [sense_header_hint:name]
email identity.person.email VARCHAR 93.3% [sense_mask:format]
created_at datetime.timestamp.iso_8601 TIMESTAMP 93.3%
ip_address technology.internet.ip_v4 VARCHAR 100.0% [ipv4_detection]
amount finance.currency.amount DECIMAL 60.0% [sense_header_hint_cross_domain:amount]
6/6 columns typed, 12 rows analyzedA few things to notice:
- BROAD is the DuckDB storage type FineType recommends. VARCHAR columns stay as strings (emails and IPs are semantically typed but stored as text);
created_atbecomes TIMESTAMP;amountbecomes DECIMAL. - CONF is the column-level confidence. The
emailcolumn is at 93.3% — not 100% — becausenot-an-emailin row 6 does not match the email pattern. That gap is a data quality signal. - Sense hints (in brackets) show which detection strategy fired.
sense_header_hint:namemeans the column headernamehelped resolve an ambiguous classification.
FineType has now told you what each column should be. The next step is to capture that as a formal schema.
Step 3: Generate a Schema
Coming Soon
The finetype schema command for files is under active development. The syntax below shows the intended interface — it may change before release. See the CLI reference for the current status.
The schema command takes the profile results and produces a JSON Schema that describes the expected shape and types of your data:
finetype schema contacts.csv{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://meridian.online/schemas/contacts",
"title": "contacts",
"description": "Generated by FineType from contacts.csv (12 rows, 6 columns)",
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "integer",
"x-finetype-key": "representation.identifier.increment",
"x-finetype-broad": "BIGINT"
},
"name": {
"type": "string",
"x-finetype-key": "identity.person.full_name",
"x-finetype-broad": "VARCHAR"
},
"email": {
"type": "string",
"format": "email",
"x-finetype-key": "identity.person.email",
"x-finetype-broad": "VARCHAR"
},
"created_at": {
"type": "string",
"format": "date-time",
"x-finetype-key": "datetime.timestamp.iso_8601",
"x-finetype-broad": "TIMESTAMP",
"x-finetype-expression": "strptime(created_at, '%Y-%m-%dT%H:%M:%SZ')"
},
"ip_address": {
"type": "string",
"format": "ipv4",
"x-finetype-key": "technology.internet.ip_v4",
"x-finetype-broad": "VARCHAR"
},
"amount": {
"type": "number",
"x-finetype-key": "finance.currency.amount",
"x-finetype-broad": "DECIMAL(18,2)"
}
},
"required": ["id", "name", "email", "created_at", "ip_address", "amount"]
}
}Save this to a file — it becomes the contract between your raw data and your typed tables:
finetype schema contacts.csv > schema.jsonThe x-finetype-* extension fields carry the semantic type key, the recommended DuckDB storage type, and (where applicable) the cast expression. Standard JSON Schema validators ignore these extensions, but FineType uses them in the next step.
Step 4: Validate the Data
Coming Soon
The finetype validate command is under active development. The syntax below shows the intended interface — it may change before release. See the CLI reference for the current status.
Now that you have a schema, you can validate every row against it. The validate command reads your CSV and schema, then splits the data into valid and invalid rows:
finetype validate contacts.csv schema.jsonFineType Validation — "contacts.csv" against "schema.json"
════════════════════════════════════════════════════════════
12 rows checked, 7 valid, 5 invalid
Output files:
contacts.valid.csv 7 rows (clean, ready to load)
contacts.invalid.csv 5 rows (need attention)
contacts.errors.jsonl 5 error recordsThe five invalid rows are exactly the problems we spotted earlier:
| Row | Column | Issue |
|---|---|---|
| 6 | not-an-email fails email format validation | |
| 7 | created_at | 15/07/2024 does not match ISO 8601 |
| 9 | amount | N/A is not a valid number |
| 10 | ip_address | 999.999.999.999 fails IPv4 range check |
| 11 | kara@example fails email format validation (no TLD) |
Rows 8 and 12 have missing values but are not invalid — they are valid rows with nullable fields. The distinction matters: a missing value is expected; N/A typed as a string where a number should be is a data quality error.
The sidecar files give you three options for what to do next:
contacts.valid.csv— load directly into DuckDB, every cast will succeedcontacts.invalid.csv— review and fix, or route to a data quality workflowcontacts.errors.jsonl— machine-readable error records for logging or alerting
Step 5: Load into DuckDB
The load command generates a CREATE TABLE AS SELECT statement with the correct cast for every column:
finetype load -f contacts.valid.csv-- Generated by FineType (250 types, Sense→Sharpen pipeline)
-- Source: contacts.csv (12 rows, 6 columns)
CREATE TABLE contacts AS
SELECT
CAST(id AS BIGINT) AS id, -- representation.identifier.increment
name, -- identity.person.full_name
email, -- identity.person.email
strptime(created_at, '%Y-%m-%dT%H:%M:%SZ') AS created_at, -- datetime.timestamp.iso_8601
ip_address, -- technology.internet.ip_v4
CAST(REGEXP_REPLACE(...) AS DECIMAL(18,2)) AS amount -- finance.currency.amount
FROM read_csv('contacts.csv', all_varchar=true);Paste this into a DuckDB session and you have a fully typed table — no manual CAST statements, no guessing at date formats, no silent type coercion errors:
duckdb contacts.db < load.sql-- Query with proper types
SELECT name, amount, created_at
FROM contacts
WHERE created_at > TIMESTAMP '2024-06-01'
ORDER BY amount DESC;The full pipeline in five commands:
finetype infer -i "[email protected]" --confidence # Build intuition
finetype profile -f contacts.csv # Discover column types
finetype schema contacts.csv > schema.json # Capture as JSON Schema
finetype validate contacts.csv schema.json # Quality gate
finetype load -f contacts.valid.csv > load.sql # Generate typed DDLBonus: DuckDB Extension
If you prefer to stay in SQL, the FineType DuckDB extension lets you classify values directly in queries:
INSTALL finetype FROM community;
LOAD finetype;
-- Classify a single value
SELECT finetype('[email protected]');
-- → identity.person.emailSee the DuckDB Extension docs for the full function reference.
What's Next?
Keep exploring
Dive deeper into the tools and the type system.