MeridianMERIDIAN

Build a Typed DuckDB Pipeline

Go from raw CSV to a fully typed DuckDB database — profile, generate SQL, load, and query with proper types.

Goal: Turn a raw CSV into a fully typed DuckDB database in three commands — profile, generate the load SQL, and run it.

Prerequisites

ToolPurpose
FineTypeSemantic type detection and SQL generation
DuckDBAnalytical SQL engine

Sample data

Save this as orders.csv:

order_id,customer_email,order_date,ship_date,amount,currency,country
ORD-001,[email protected],2024-01-15,2024-01-18,149.99,USD,United States
ORD-002,[email protected],2024-02-20,2024-02-22,2500.00,GBP,United Kingdom
ORD-003,[email protected],03/08/2024,03/10/2024,89.50,USD,United States
ORD-004,[email protected],2024-04-12,2024-04-15,1200.00,EUR,Germany
ORD-005,[email protected],2024-05-01,2024-05-03,340.75,AUD,Australia

Notice the date formats are inconsistent — rows 1, 2, 4, and 5 use ISO format (2024-01-15) while row 3 uses US slash format (03/08/2024). This is exactly the kind of problem FineType catches.

Steps

1. Profile the CSV

Run profile to discover the semantic type of every column:

finetype profile -f orders.csv
Column          Type                              Confidence
──────────────  ────────────────────────────────  ──────────
order_id        representation.identifier.custom   0.99
customer_email  identity.contact.email             0.98
order_date      datetime.date.iso_8601             0.87
ship_date       datetime.date.iso_8601             0.87
amount          representation.numeric.decimal     0.99
currency        finance.reference.currency_code    0.98
country         geography.location.country         0.96

FineType has classified all seven columns. The confidence on order_date and ship_date is lower because of the mixed date formats — FineType picks the majority format but flags the inconsistency through the confidence score.

2. Generate the load SQL

Run load to generate a CREATE TABLE AS statement with proper casts:

finetype load -f orders.csv
CREATE TABLE orders AS
SELECT
    order_id::VARCHAR AS order_id,
    customer_email::VARCHAR AS customer_email,
    strptime(order_date, '%Y-%m-%d')::TIMESTAMP AS order_date,
    strptime(ship_date, '%Y-%m-%d')::TIMESTAMP AS ship_date,
    amount::DECIMAL(18,2) AS amount,
    currency::VARCHAR AS currency,
    country::VARCHAR AS country
FROM read_csv('orders.csv', header=true, all_varchar=true);

Every cast is derived from FineType's type detection. The all_varchar=true flag tells DuckDB to read everything as text first, so FineType's casts control the conversion.

3. Load into DuckDB

Pipe the generated SQL directly into DuckDB:

finetype load -f orders.csv | duckdb orders.duckdb

The database file orders.duckdb now contains a fully typed orders table.

4. See the difference

Without FineType, DuckDB reads CSV columns as VARCHAR by default. Compare:

duckdb -c "SELECT column_name, column_type FROM (DESCRIBE SELECT * FROM read_csv('orders.csv'))"
┌─────────────────┬──────────────┐
│   column_name   │ column_type  │
│     varchar     │   varchar    │
├─────────────────┼──────────────┤
│ order_id        │ VARCHAR      │
│ customer_email  │ VARCHAR      │
│ order_date      │ VARCHAR      │
│ ship_date       │ VARCHAR      │
│ amount          │ VARCHAR      │
│ currency        │ VARCHAR      │
│ country         │ VARCHAR      │
└─────────────────┴──────────────┘

Now check the typed table:

duckdb orders.duckdb -c "SELECT column_name, column_type FROM (DESCRIBE orders)"
┌─────────────────┬───────────────┐
│   column_name   │  column_type  │
│     varchar     │    varchar    │
├─────────────────┼───────────────┤
│ order_id        │ VARCHAR       │
│ customer_email  │ VARCHAR       │
│ order_date      │ TIMESTAMP     │
│ ship_date       │ TIMESTAMP     │
│ amount          │ DECIMAL(18,2) │
│ currency        │ VARCHAR       │
│ country         │ VARCHAR       │
└─────────────────┴───────────────┘

Dates are TIMESTAMP, amounts are DECIMAL. Queries that filter by date or sum amounts will now work correctly.

5. Query with proper types

With proper types in place, date arithmetic and numeric aggregation just work:

-- Orders shipped within 2 days
SELECT order_id, customer_email, order_date, ship_date,
       ship_date - order_date AS shipping_time
FROM orders
WHERE ship_date - order_date <= INTERVAL 2 DAY;
┌──────────┬────────────────────┬─────────────────────┬─────────────────────┬───────────────┐
│ order_id │  customer_email    │     order_date      │     ship_date       │ shipping_time │
├──────────┼────────────────────┼─────────────────────┼─────────────────────┼───────────────┤
│ ORD-002  │ [email protected]    │ 2024-02-20 00:00:00 │ 2024-02-22 00:00:00 │ 2 days        │
│ ORD-003  │ [email protected]   │ 2024-03-08 00:00:00 │ 2024-03-10 00:00:00 │ 2 days        │
│ ORD-005  │ [email protected]   │ 2024-05-01 00:00:00 │ 2024-05-03 00:00:00 │ 2 days        │
└──────────┴────────────────────┴─────────────────────┴─────────────────────┴───────────────┘
-- Total revenue by currency
SELECT currency, sum(amount) AS total, count(*) AS orders
FROM orders
GROUP BY currency
ORDER BY total DESC;
┌──────────┬─────────┬────────┐
│ currency │  total  │ orders │
├──────────┼─────────┼────────┤
│ GBP      │ 2500.00 │      1 │
│ EUR      │ 1200.00 │      1 │
│ AUD      │  340.75 │      1 │
│ USD      │  239.49 │      2 │
└──────────┴─────────┴────────┘

These queries would fail or produce wrong results on VARCHAR columns. With typed data, sum() returns a decimal and date subtraction returns an interval.

What you learned

  • finetype profile discovers semantic types in a CSV
  • finetype load generates a DuckDB CREATE TABLE AS statement with proper casts
  • Piping load output directly into DuckDB creates a typed database in one step
  • Typed columns enable date arithmetic, numeric aggregation, and correct sorting

See also

On this page