MeridianMERIDIAN

Getting Started

Go from raw CSV to typed, validated DuckDB table in five commands.

Every analyst knows the feeling: you open a CSV and every column is a string. Is that column of numbers a price or a postal code? Are the dates ISO 8601 or DD/MM/YYYY? Is that blank cell a missing value or an empty string?

This guide walks you through the full FineType pipeline — from raw CSV to a typed, validated DuckDB table — in five steps. Along the way, you will catch real data quality issues that would otherwise surface as silent errors downstream.

Prerequisites

FineType and DuckDB run on macOS, Linux, and Windows (via WSL).

ToolPurpose
DuckDBIn-memory analytical SQL engine
FineTypeSemantic type detection (250 types)

Install DuckDB

# macOS
brew install duckdb

# Linux (apt)
sudo apt install duckdb

# Windows
winget install DuckDB.cli

Install FineType

# macOS / Linux (Homebrew)
brew install meridian-online/tap/finetype

# Any platform with Rust
cargo install finetype-cli

Verify both are working:

duckdb --version
finetype --version

The Sample Dataset

We will use a small contacts dataset with realistic quality issues — the kind you find in every export from a CRM, spreadsheet, or third-party API.

Download the file directly: contacts.csv

Or copy and save it locally:

id,name,email,created_at,ip_address,amount
1,Alice Chen,[email protected],2024-01-15T09:30:00Z,192.168.1.10,149.99
2,Bob Smith,[email protected],2024-02-20T14:15:00Z,10.0.0.42,2500.00
3,Carol Wu,[email protected],2024-03-08T11:00:00Z,172.16.0.1,89.50
4,Dan Reeves,[email protected],2024-04-12T16:45:00Z,192.168.0.5,1200.00
5,Eve Nakamura,[email protected],2024-05-01T08:00:00Z,10.10.10.1,340.75
6,Frank Osei,not-an-email,2024-06-15T10:20:00Z,203.0.113.7,75.00
7,Grace Patel,[email protected],15/07/2024,10.0.0.99,1850.25
8,Hiro Tanaka,[email protected],2024-08-03T13:45:00Z,,425.00
9,Isla Morgan,[email protected],2024-09-19T07:30:00Z,192.168.1.200,N/A
10,Jin Li,[email protected],2024-10-05T16:00:00Z,999.999.999.999,3200.00
11,Kara Novak,kara@example,2024-11-11 09:00:00,10.0.1.50,620.00
12,Leo Santos,[email protected],2024-12-01T12:00:00Z,172.16.5.20,

This looks tidy at first glance, but there are problems hiding in plain sight:

  • Row 6: not-an-email is not a valid email address
  • Row 7: 15/07/2024 uses DD/MM/YYYY instead of ISO 8601
  • Row 8: missing IP address
  • Row 9: N/A in the amount column — not a number
  • Row 10: 999.999.999.999 is not a valid IP address
  • Row 11: kara@example has no TLD, and the timestamp uses a different format
  • Row 12: missing amount

DuckDB will happily read this file and load everything as VARCHAR. The problems only surface later, when a CAST fails or an aggregate silently ignores nulls. FineType catches these issues upfront.

Step 1: Classify Individual Values

Before profiling the whole file, build intuition for how FineType thinks. The infer command classifies a single value:

$ finetype infer -i "[email protected]" --confidence
identity.person.email	1.0000

$ finetype infer -i "192.168.1.10" --confidence
technology.internet.ip_v4	1.0000

$ finetype infer -i "2024-01-15T09:30:00Z" --confidence
datetime.timestamp.iso_8601	1.0000

Each value gets a three-part label: domain.category.type. The confidence score tells you how certain the model is.

Now try a bad value:

$ finetype infer -i "not-an-email" --confidence
identity.person.username	0.9963

FineType does not crash or return an error — it classifies what it sees. not-an-email looks more like a username than an email. This is exactly the kind of signal that matters when profiling a whole column: if most values are emails but one is a username, something is wrong with the data, not the column.

Step 2: Profile the Dataset

The profile command analyses every column and picks the best semantic type for each:

finetype profile -f contacts.csv
FineType Column Profile — "contacts.csv" (12 rows, 6 columns)
════════════════════════════════════════════════════════════════════════════════

  COLUMN                    TYPE                                      BROAD   CONF
  ──────────────────────────────────────────────────────────────────────────────
  id                        representation.identifier.increment      BIGINT  50.0% [sense_header_hint_generic:id]
  name                      identity.person.full_name               VARCHAR  60.0% [sense_header_hint:name]
  email                     identity.person.email                   VARCHAR  93.3% [sense_mask:format]
  created_at                datetime.timestamp.iso_8601            TIMESTAMP  93.3%
  ip_address                technology.internet.ip_v4               VARCHAR 100.0% [ipv4_detection]
  amount                    finance.currency.amount                 DECIMAL  60.0% [sense_header_hint_cross_domain:amount]

6/6 columns typed, 12 rows analyzed

A few things to notice:

  • BROAD is the DuckDB storage type FineType recommends. VARCHAR columns stay as strings (emails and IPs are semantically typed but stored as text); created_at becomes TIMESTAMP; amount becomes DECIMAL.
  • CONF is the column-level confidence. The email column is at 93.3% — not 100% — because not-an-email in row 6 does not match the email pattern. That gap is a data quality signal.
  • Sense hints (in brackets) show which detection strategy fired. sense_header_hint:name means the column header name helped resolve an ambiguous classification.

FineType has now told you what each column should be. The next step is to capture that as a formal schema.

Step 3: Generate a Schema

Coming Soon

The finetype schema command for files is under active development. The syntax below shows the intended interface — it may change before release. See the CLI reference for the current status.

The schema command takes the profile results and produces a JSON Schema that describes the expected shape and types of your data:

finetype schema contacts.csv
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://meridian.online/schemas/contacts",
  "title": "contacts",
  "description": "Generated by FineType from contacts.csv (12 rows, 6 columns)",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "id": {
        "type": "integer",
        "x-finetype-key": "representation.identifier.increment",
        "x-finetype-broad": "BIGINT"
      },
      "name": {
        "type": "string",
        "x-finetype-key": "identity.person.full_name",
        "x-finetype-broad": "VARCHAR"
      },
      "email": {
        "type": "string",
        "format": "email",
        "x-finetype-key": "identity.person.email",
        "x-finetype-broad": "VARCHAR"
      },
      "created_at": {
        "type": "string",
        "format": "date-time",
        "x-finetype-key": "datetime.timestamp.iso_8601",
        "x-finetype-broad": "TIMESTAMP",
        "x-finetype-expression": "strptime(created_at, '%Y-%m-%dT%H:%M:%SZ')"
      },
      "ip_address": {
        "type": "string",
        "format": "ipv4",
        "x-finetype-key": "technology.internet.ip_v4",
        "x-finetype-broad": "VARCHAR"
      },
      "amount": {
        "type": "number",
        "x-finetype-key": "finance.currency.amount",
        "x-finetype-broad": "DECIMAL(18,2)"
      }
    },
    "required": ["id", "name", "email", "created_at", "ip_address", "amount"]
  }
}

Save this to a file — it becomes the contract between your raw data and your typed tables:

finetype schema contacts.csv > schema.json

The x-finetype-* extension fields carry the semantic type key, the recommended DuckDB storage type, and (where applicable) the cast expression. Standard JSON Schema validators ignore these extensions, but FineType uses them in the next step.

Step 4: Validate the Data

Coming Soon

The finetype validate command is under active development. The syntax below shows the intended interface — it may change before release. See the CLI reference for the current status.

Now that you have a schema, you can validate every row against it. The validate command reads your CSV and schema, then splits the data into valid and invalid rows:

finetype validate contacts.csv schema.json
FineType Validation — "contacts.csv" against "schema.json"
════════════════════════════════════════════════════════════

  12 rows checked, 7 valid, 5 invalid

  Output files:
    contacts.valid.csv     7 rows (clean, ready to load)
    contacts.invalid.csv   5 rows (need attention)
    contacts.errors.jsonl  5 error records

The five invalid rows are exactly the problems we spotted earlier:

RowColumnIssue
6emailnot-an-email fails email format validation
7created_at15/07/2024 does not match ISO 8601
9amountN/A is not a valid number
10ip_address999.999.999.999 fails IPv4 range check
11emailkara@example fails email format validation (no TLD)

Rows 8 and 12 have missing values but are not invalid — they are valid rows with nullable fields. The distinction matters: a missing value is expected; N/A typed as a string where a number should be is a data quality error.

The sidecar files give you three options for what to do next:

  • contacts.valid.csv — load directly into DuckDB, every cast will succeed
  • contacts.invalid.csv — review and fix, or route to a data quality workflow
  • contacts.errors.jsonl — machine-readable error records for logging or alerting

Step 5: Load into DuckDB

The load command generates a CREATE TABLE AS SELECT statement with the correct cast for every column:

finetype load -f contacts.valid.csv
-- Generated by FineType (250 types, Sense→Sharpen pipeline)
-- Source: contacts.csv (12 rows, 6 columns)
CREATE TABLE contacts AS
SELECT
    CAST(id AS BIGINT) AS id,                       -- representation.identifier.increment
    name,                                            -- identity.person.full_name
    email,                                           -- identity.person.email
    strptime(created_at, '%Y-%m-%dT%H:%M:%SZ') AS created_at,  -- datetime.timestamp.iso_8601
    ip_address,                                      -- technology.internet.ip_v4
    CAST(REGEXP_REPLACE(...) AS DECIMAL(18,2)) AS amount        -- finance.currency.amount
FROM read_csv('contacts.csv', all_varchar=true);

Paste this into a DuckDB session and you have a fully typed table — no manual CAST statements, no guessing at date formats, no silent type coercion errors:

duckdb contacts.db < load.sql
-- Query with proper types
SELECT name, amount, created_at
FROM contacts
WHERE created_at > TIMESTAMP '2024-06-01'
ORDER BY amount DESC;

The full pipeline in five commands:

finetype infer -i "[email protected]" --confidence  # Build intuition
finetype profile -f contacts.csv                     # Discover column types
finetype schema contacts.csv > schema.json           # Capture as JSON Schema
finetype validate contacts.csv schema.json           # Quality gate
finetype load -f contacts.valid.csv > load.sql       # Generate typed DDL

Bonus: DuckDB Extension

If you prefer to stay in SQL, the FineType DuckDB extension lets you classify values directly in queries:

INSTALL finetype FROM community;
LOAD finetype;

-- Classify a single value
SELECT finetype('[email protected]');
-- → identity.person.email

See the DuckDB Extension docs for the full function reference.

What's Next?

Keep exploring

Dive deeper into the tools and the type system.

On this page