Profile a Parquet File
Profile data stored in Parquet format using FineType — via CSV export or the DuckDB extension.
Goal: Profile a Parquet file to discover the semantic types in your data, using either CSV export or the DuckDB extension.
Prerequisites
| Tool | Purpose |
|---|---|
| FineType | Semantic type detection |
| DuckDB | Reading Parquet files and running SQL |
A .parquet file | Any Parquet file — a data warehouse export, a public dataset, your own data |
Why Parquet needs a different path
FineType's CLI profiles CSV files. Parquet files store data in a columnar binary format that FineType can't read directly. You have two options:
- Export to CSV — use DuckDB to extract a sample, then profile with
finetype profile - Use the DuckDB extension — classify values directly inside SQL queries
Both approaches give you the same type labels. Choose whichever fits your workflow.
Option A: Export to CSV and profile
1. Sample and export
DuckDB reads Parquet natively. Extract a sample to CSV:
duckdb -c "COPY (SELECT * FROM 'data.parquet' LIMIT 1000) TO 'sample.csv' (HEADER)"1000 rows exported.The LIMIT 1000 keeps the export fast. FineType typically needs a few hundred rows to classify columns accurately — 1,000 is more than enough.
2. Profile the CSV
Run profile on the exported sample:
finetype profile -f sample.csvColumn Type Confidence
──────────── ───────────────────────────── ──────────
user_id representation.numeric.integer 0.99
email identity.contact.email 0.98
signup_date datetime.date.iso_8601 0.97
country geography.location.country 0.94
ip_address technology.internet.ip_v4 0.99You now know the semantic types in your Parquet file. From here you can generate a schema (finetype schema), create a typed table (finetype load), or simply use the profile as documentation.
3. Clean up
Remove the intermediate CSV when you're done:
rm sample.csvOption B: Use the DuckDB extension
The FineType DuckDB extension classifies values directly inside SQL — no CSV export needed.
1. Install and load the extension
INSTALL finetype FROM community;
LOAD finetype;2. Classify individual columns
Use the finetype() function on columns from your Parquet file:
SELECT finetype(email) AS detected_type, email
FROM 'data.parquet'
LIMIT 5;┌──────────────────────────┬──────────────────────┐
│ detected_type │ email │
│ varchar │ varchar │
├──────────────────────────┼──────────────────────┤
│ identity.contact.email │ [email protected] │
│ identity.contact.email │ [email protected] │
│ identity.contact.email │ [email protected] │
│ identity.contact.email │ [email protected] │
│ identity.contact.email │ [email protected] │
└──────────────────────────┴──────────────────────┘3. Profile all columns with UNPIVOT
To classify every column at once, unpivot the table and pass each value through finetype():
WITH samples AS (
SELECT * FROM 'data.parquet' LIMIT 100
),
unpivoted AS (
UNPIVOT samples ON COLUMNS(*) INTO NAME column_name VALUE text_value
)
SELECT
column_name,
finetype(text_value) AS detected_type,
count(*) AS n
FROM unpivoted
GROUP BY column_name, detected_type
ORDER BY column_name, n DESC;This gives you a frequency table of detected types per column — the equivalent of finetype profile but running entirely inside DuckDB.
What you learned
- FineType's CLI profiles CSV files; Parquet requires either a CSV export step or the DuckDB extension
- DuckDB's
COPY ... TOcommand extracts a sample from Parquet to CSV in one line - The DuckDB extension classifies values in-place — useful when you want to stay in SQL
- Both paths produce the same FineType type labels
See also
profilecommand reference — all flags and output formats- DuckDB Extension — full function reference for the SQL extension
- Build a Typed DuckDB Pipeline — take profiling results and create a fully typed table