MeridianMERIDIAN

Profile a Parquet File

Profile data stored in Parquet format using FineType — via CSV export or the DuckDB extension.

Goal: Profile a Parquet file to discover the semantic types in your data, using either CSV export or the DuckDB extension.

Prerequisites

ToolPurpose
FineTypeSemantic type detection
DuckDBReading Parquet files and running SQL
A .parquet fileAny Parquet file — a data warehouse export, a public dataset, your own data

Why Parquet needs a different path

FineType's CLI profiles CSV files. Parquet files store data in a columnar binary format that FineType can't read directly. You have two options:

  1. Export to CSV — use DuckDB to extract a sample, then profile with finetype profile
  2. Use the DuckDB extension — classify values directly inside SQL queries

Both approaches give you the same type labels. Choose whichever fits your workflow.

Option A: Export to CSV and profile

1. Sample and export

DuckDB reads Parquet natively. Extract a sample to CSV:

duckdb -c "COPY (SELECT * FROM 'data.parquet' LIMIT 1000) TO 'sample.csv' (HEADER)"
1000 rows exported.

The LIMIT 1000 keeps the export fast. FineType typically needs a few hundred rows to classify columns accurately — 1,000 is more than enough.

2. Profile the CSV

Run profile on the exported sample:

finetype profile -f sample.csv
Column        Type                           Confidence
────────────  ─────────────────────────────  ──────────
user_id       representation.numeric.integer  0.99
email         identity.contact.email          0.98
signup_date   datetime.date.iso_8601          0.97
country       geography.location.country      0.94
ip_address    technology.internet.ip_v4       0.99

You now know the semantic types in your Parquet file. From here you can generate a schema (finetype schema), create a typed table (finetype load), or simply use the profile as documentation.

3. Clean up

Remove the intermediate CSV when you're done:

rm sample.csv

Option B: Use the DuckDB extension

The FineType DuckDB extension classifies values directly inside SQL — no CSV export needed.

1. Install and load the extension

INSTALL finetype FROM community;
LOAD finetype;

2. Classify individual columns

Use the finetype() function on columns from your Parquet file:

SELECT finetype(email) AS detected_type, email
FROM 'data.parquet'
LIMIT 5;
┌──────────────────────────┬──────────────────────┐
│      detected_type       │        email         │
│         varchar          │       varchar        │
├──────────────────────────┼──────────────────────┤
│ identity.contact.email   │ [email protected]
│ identity.contact.email   │ [email protected]
│ identity.contact.email   │ [email protected]
│ identity.contact.email   │ [email protected]
│ identity.contact.email   │ [email protected]
└──────────────────────────┴──────────────────────┘

3. Profile all columns with UNPIVOT

To classify every column at once, unpivot the table and pass each value through finetype():

WITH samples AS (
    SELECT * FROM 'data.parquet' LIMIT 100
),
unpivoted AS (
    UNPIVOT samples ON COLUMNS(*) INTO NAME column_name VALUE text_value
)
SELECT
    column_name,
    finetype(text_value) AS detected_type,
    count(*) AS n
FROM unpivoted
GROUP BY column_name, detected_type
ORDER BY column_name, n DESC;

This gives you a frequency table of detected types per column — the equivalent of finetype profile but running entirely inside DuckDB.

What you learned

  • FineType's CLI profiles CSV files; Parquet requires either a CSV export step or the DuckDB extension
  • DuckDB's COPY ... TO command extracts a sample from Parquet to CSV in one line
  • The DuckDB extension classifies values in-place — useful when you want to stay in SQL
  • Both paths produce the same FineType type labels

See also

On this page