Profile a Parquet File

Profile data stored in Parquet format using FineType — via CSV export or the DuckDB extension.

Goal: Profile a Parquet file to discover the semantic types in your data, using either CSV export or the DuckDB extension.

Prerequisites

Tool	Purpose
FineType	Semantic type detection
DuckDB	Reading Parquet files and running SQL
A `.parquet` file	Any Parquet file — a data warehouse export, a public dataset, your own data

Why Parquet needs a different path

FineType's CLI profiles CSV files. Parquet files store data in a columnar binary format that FineType can't read directly. You have two options:

Export to CSV — use DuckDB to extract a sample, then profile with finetype profile
Use the DuckDB extension — classify values directly inside SQL queries

Both approaches give you the same type labels. Choose whichever fits your workflow.

Option A: Export to CSV and profile

1. Sample and export

DuckDB reads Parquet natively. Extract a sample to CSV:

duckdb -c "COPY (SELECT * FROM 'data.parquet' LIMIT 1000) TO 'sample.csv' (HEADER)"

1000 rows exported.

The LIMIT 1000 keeps the export fast. FineType typically needs a few hundred rows to classify columns accurately — 1,000 is more than enough.

2. Profile the CSV

Run profile on the exported sample:

finetype profile -f sample.csv

Column        Type                           Confidence
────────────  ─────────────────────────────  ──────────
user_id       representation.numeric.integer  0.99
email         identity.contact.email          0.98
signup_date   datetime.date.iso_8601          0.97
country       geography.location.country      0.94
ip_address    technology.internet.ip_v4       0.99

You now know the semantic types in your Parquet file. From here you can generate a schema (finetype schema), create a typed table (finetype load), or simply use the profile as documentation.

3. Clean up

Remove the intermediate CSV when you're done:

rm sample.csv

Option B: Use the DuckDB extension

The FineType DuckDB extension classifies values directly inside SQL — no CSV export needed.

1. Install and load the extension

INSTALL finetype FROM community;
LOAD finetype;

2. Classify individual columns

Use the finetype() function on columns from your Parquet file:

SELECT finetype(email) AS detected_type, email
FROM 'data.parquet'
LIMIT 5;

┌──────────────────────────┬──────────────────────┐
│      detected_type       │        email         │
│         varchar          │       varchar        │
├──────────────────────────┼──────────────────────┤
│ identity.contact.email   │ [email protected]    │
│ identity.contact.email   │ [email protected]      │
│ identity.contact.email   │ [email protected]     │
│ identity.contact.email   │ [email protected]   │
│ identity.contact.email   │ [email protected]     │
└──────────────────────────┴──────────────────────┘

3. Profile all columns with UNPIVOT

To classify every column at once, unpivot the table and pass each value through finetype():

WITH samples AS (
    SELECT * FROM 'data.parquet' LIMIT 100
),
unpivoted AS (
    UNPIVOT samples ON COLUMNS(*) INTO NAME column_name VALUE text_value
)
SELECT
    column_name,
    finetype(text_value) AS detected_type,
    count(*) AS n
FROM unpivoted
GROUP BY column_name, detected_type
ORDER BY column_name, n DESC;

This gives you a frequency table of detected types per column — the equivalent of finetype profile but running entirely inside DuckDB.

What you learned

FineType's CLI profiles CSV files; Parquet requires either a CSV export step or the DuckDB extension
DuckDB's COPY ... TO command extracts a sample from Parquet to CSV in one line
The DuckDB extension classifies values in-place — useful when you want to stay in SQL
Both paths produce the same FineType type labels