DuckDB Extension
Profile, validate, and classify text data directly inside DuckDB using the FineType community extension.
The FineType DuckDB extension brings semantic type detection into your SQL queries. Profile a table's columns, validate against a JSON Schema, classify individual values, and cast with confidence — all without leaving DuckDB.
Early Release
FineType is under active development. Expect breaking changes to taxonomy labels, function signatures, and model formats between releases. Pin to a specific version if stability matters for your use case.
Installation
Install from the DuckDB community extension registry:
INSTALL finetype FROM community;
LOAD finetype;
SELECT ft_version();
-- → 'finetype 0.6.23'The extension is built against the stable DuckDB C extension API, so a single signed artifact loads on DuckDB 1.2 through 1.5+. Model weights are embedded at compile time — no external files or network requests at runtime.
Function names
As of 0.6.23 the SQL functions are prefixed with ft_ (ft_profile, ft_validate, ft_infer, …). The previous un-prefixed names (finetype(), finetype_cast(), finetype_validate(), …) remain as aliases for one release to ease the transition, then will be removed. New code should use the ft_ names.
Profile a table
ft_profile(table) is a table macro: pass a catalog table name and it returns one row per column with the detected type, confidence, and the recommended DuckDB type to cast to.
CREATE TABLE contacts AS
SELECT * FROM read_csv('contacts.csv', all_varchar = true);
FROM ft_profile('contacts');┌─────────────┬─────────────────────────────────────┬────────────────────┬─────────────┐
│ column_name │ type │ confidence │ duckdb_type │
├─────────────┼─────────────────────────────────────┼────────────────────┼─────────────┤
│ amount │ finance.currency.amount │ 0.9960123300552368 │ VARCHAR │
│ created_at │ datetime.timestamp.iso_8601 │ 0.9695983529090881 │ TIMESTAMP │
│ email │ identity.person.email │ 1.0 │ VARCHAR │
│ id │ representation.identifier.increment │ 0.9105000495910645 │ BIGINT │
│ ip_address │ technology.internet.ip_v4 │ 1.0 │ INET │
│ name │ identity.person.full_name │ 0.9888034462928772 │ VARCHAR │
└─────────────┴─────────────────────────────────────┴────────────────────┴─────────────┘Because it returns a normal relation, you can filter and join it like any other table — for example, to list only the columns DuckDB should store as something other than VARCHAR:
SELECT column_name, type, duckdb_type
FROM ft_profile('contacts')
WHERE duckdb_type <> 'VARCHAR';Nested columns
ft_profile reads each column as text. STRUCT and LIST columns can't be classified directly — instead of guessing, FineType returns a message telling you how to flatten them first:
┌─────────────┬────────────────────────────────────────────────────────────────────────┬────────────┬─────────────┐
│ column_name │ type │ confidence │ duckdb_type │
├─────────────┼────────────────────────────────────────────────────────────────────────┼────────────┼─────────────┤
│ loc │ nested STRUCT(lat …, lon …) column — unnest / to_json / extract … │ NULL │ NULL │
│ tags │ nested VARCHAR[] column — unnest / to_json / extract before profiling │ NULL │ NULL │
└─────────────┴────────────────────────────────────────────────────────────────────────┴────────────┴─────────────┘Use unnest(), to_json(), or struct.field extraction to project the nested data into text columns, then profile those.
Validate against a schema
ft_validate(table, schema) checks every column against a FineType JSON Schema and returns the row count, reject count, and a sample failure message per column. The schema argument accepts an inline JSON literal, a getvariable() reference, or a file path.
FROM ft_validate('contacts', 'schema.json');┌─────────────┬───────┬─────────┬──────────────────────────────────────────────────────┐
│ column_name │ total │ rejects │ sample_message │
├─────────────┼───────┼─────────┼──────────────────────────────────────────────────────┤
│ amount │ 11 │ 1 │ "N/A" does not match "^[\$£¥₹₩₿\p{Sc}]?…" │
│ created_at │ 12 │ 0 │ NULL │
│ email │ 12 │ 1 │ "not-an-email" does not match "^[a-zA-Z0-9.…]+@…" │
│ id │ 12 │ 0 │ NULL │
│ ip_address │ 11 │ 1 │ "999.999.999.999" does not match "^(?:(?:25[0-5]…" │
│ name │ 12 │ 0 │ NULL │
└─────────────┴───────┴─────────┴──────────────────────────────────────────────────────┘A non-zero rejects count points you straight at the offending value and the constraint it broke.
Scalar functions
For ad-hoc, value-level work the scalar functions operate on a single string.
ft_infer() — classify a value
SELECT ft_infer('192.168.1.10');
-- → 'technology.internet.ip_v4'
SELECT ft_infer('[email protected]');
-- → 'identity.person.email'ft_detail() — classification with confidence
Returns a JSON object with the type, confidence, recommended DuckDB type, and the votes behind the decision.
SELECT ft_detail('192.168.1.10');
-- → {"type": "technology.internet.ip_v4", "confidence": 0.795,
-- "duckdb_type": "INET", "samples": 1, "disambiguation": "multi-branch",
-- "votes": {"technology.internet.ip_v4": 0.795}}ft_validate_text() — check one value against a constraint
Returns a struct of valid, the constraint that was checked, and a message when it fails.
SELECT ft_validate_text('not-an-email', '{"pattern":"^[^@]+@[^@]+\\.[^@]+$"}');
-- → {'valid': false, 'constraint': pattern,
-- 'message': '"not-an-email" does not match "^[^@]+@[^@]+\\.[^@]+$"'}ft_cast() and ft_unpack()
ft_cast(value) normalises a value into a form DuckDB will accept for a TRY_CAST. ft_unpack(value) returns the full inference record (type, confidence, DuckDB type, and the normalised value) as JSON:
SELECT ft_unpack('149.99');
-- → {"confidence":0.98,"duckdb_type":"DOUBLE",
-- "type":"representation.numeric.decimal_number","value":"149.99"}See also
- FineType overview — CLI usage, installation, and profiling
- DuckDB Pipeline guide — profile → schema → validated table
- Type Registry — browse all 240 semantic types