MeridianMERIDIAN

DuckDB Extension

Profile, validate, and classify text data directly inside DuckDB using the FineType community extension.

The FineType DuckDB extension brings semantic type detection into your SQL queries. Profile a table's columns, validate against a JSON Schema, classify individual values, and cast with confidence — all without leaving DuckDB.

Early Release

FineType is under active development. Expect breaking changes to taxonomy labels, function signatures, and model formats between releases. Pin to a specific version if stability matters for your use case.

Installation

Install from the DuckDB community extension registry:

INSTALL finetype FROM community;
LOAD finetype;

SELECT ft_version();
-- → 'finetype 0.6.23'

The extension is built against the stable DuckDB C extension API, so a single signed artifact loads on DuckDB 1.2 through 1.5+. Model weights are embedded at compile time — no external files or network requests at runtime.

Function names

As of 0.6.23 the SQL functions are prefixed with ft_ (ft_profile, ft_validate, ft_infer, …). The previous un-prefixed names (finetype(), finetype_cast(), finetype_validate(), …) remain as aliases for one release to ease the transition, then will be removed. New code should use the ft_ names.

Profile a table

ft_profile(table) is a table macro: pass a catalog table name and it returns one row per column with the detected type, confidence, and the recommended DuckDB type to cast to.

CREATE TABLE contacts AS
  SELECT * FROM read_csv('contacts.csv', all_varchar = true);

FROM ft_profile('contacts');
┌─────────────┬─────────────────────────────────────┬────────────────────┬─────────────┐
│ column_name │                type                 │     confidence     │ duckdb_type │
├─────────────┼─────────────────────────────────────┼────────────────────┼─────────────┤
│ amount      │ finance.currency.amount             │ 0.9960123300552368 │ VARCHAR     │
│ created_at  │ datetime.timestamp.iso_8601         │ 0.9695983529090881 │ TIMESTAMP   │
│ email       │ identity.person.email               │                1.0 │ VARCHAR     │
│ id          │ representation.identifier.increment │ 0.9105000495910645 │ BIGINT      │
│ ip_address  │ technology.internet.ip_v4           │                1.0 │ INET        │
│ name        │ identity.person.full_name           │ 0.9888034462928772 │ VARCHAR     │
└─────────────┴─────────────────────────────────────┴────────────────────┴─────────────┘

Because it returns a normal relation, you can filter and join it like any other table — for example, to list only the columns DuckDB should store as something other than VARCHAR:

SELECT column_name, type, duckdb_type
FROM ft_profile('contacts')
WHERE duckdb_type <> 'VARCHAR';

Nested columns

ft_profile reads each column as text. STRUCT and LIST columns can't be classified directly — instead of guessing, FineType returns a message telling you how to flatten them first:

┌─────────────┬────────────────────────────────────────────────────────────────────────┬────────────┬─────────────┐
│ column_name │                                  type                                  │ confidence │ duckdb_type │
├─────────────┼────────────────────────────────────────────────────────────────────────┼────────────┼─────────────┤
│ loc         │ nested STRUCT(lat …, lon …) column — unnest / to_json / extract …       │       NULL │ NULL        │
│ tags        │ nested VARCHAR[] column — unnest / to_json / extract before profiling   │       NULL │ NULL        │
└─────────────┴────────────────────────────────────────────────────────────────────────┴────────────┴─────────────┘

Use unnest(), to_json(), or struct.field extraction to project the nested data into text columns, then profile those.

Validate against a schema

ft_validate(table, schema) checks every column against a FineType JSON Schema and returns the row count, reject count, and a sample failure message per column. The schema argument accepts an inline JSON literal, a getvariable() reference, or a file path.

FROM ft_validate('contacts', 'schema.json');
┌─────────────┬───────┬─────────┬──────────────────────────────────────────────────────┐
│ column_name │ total │ rejects │                   sample_message                     │
├─────────────┼───────┼─────────┼──────────────────────────────────────────────────────┤
│ amount      │    11 │       1 │ "N/A" does not match "^[\$£¥₹₩₿\p{Sc}]?…"            │
│ created_at  │    12 │       0 │ NULL                                                 │
│ email       │    12 │       1 │ "not-an-email" does not match "^[a-zA-Z0-9.…]+@…"    │
│ id          │    12 │       0 │ NULL                                                 │
│ ip_address  │    11 │       1 │ "999.999.999.999" does not match "^(?:(?:25[0-5]…"   │
│ name        │    12 │       0 │ NULL                                                 │
└─────────────┴───────┴─────────┴──────────────────────────────────────────────────────┘

A non-zero rejects count points you straight at the offending value and the constraint it broke.

Scalar functions

For ad-hoc, value-level work the scalar functions operate on a single string.

ft_infer() — classify a value

SELECT ft_infer('192.168.1.10');
-- → 'technology.internet.ip_v4'

SELECT ft_infer('[email protected]');
-- → 'identity.person.email'

ft_detail() — classification with confidence

Returns a JSON object with the type, confidence, recommended DuckDB type, and the votes behind the decision.

SELECT ft_detail('192.168.1.10');
-- → {"type": "technology.internet.ip_v4", "confidence": 0.795,
--    "duckdb_type": "INET", "samples": 1, "disambiguation": "multi-branch",
--    "votes": {"technology.internet.ip_v4": 0.795}}

ft_validate_text() — check one value against a constraint

Returns a struct of valid, the constraint that was checked, and a message when it fails.

SELECT ft_validate_text('not-an-email', '{"pattern":"^[^@]+@[^@]+\\.[^@]+$"}');
-- → {'valid': false, 'constraint': pattern,
--    'message': '"not-an-email" does not match "^[^@]+@[^@]+\\.[^@]+$"'}

ft_cast() and ft_unpack()

ft_cast(value) normalises a value into a form DuckDB will accept for a TRY_CAST. ft_unpack(value) returns the full inference record (type, confidence, DuckDB type, and the normalised value) as JSON:

SELECT ft_unpack('149.99');
-- → {"confidence":0.98,"duckdb_type":"DOUBLE",
--    "type":"representation.numeric.decimal_number","value":"149.99"}

See also

On this page