Finetype

Semantic type detection for text data. 245 types, DuckDB integration, pure Rust.

Finetype classifies text into 245 semantic types — dates, emails, IP addresses, coordinates, financial identifiers, and more.

Early Release

Finetype is under active development. Expect breaking changes to taxonomy labels, CLI arguments, library APIs, and model formats between releases. Pin to a specific version if stability matters for your use case.

Why it matters

You download a dataset. DuckDB reads it instantly, but every text column is VARCHAR. Is that column of numbers a postal code, a year, or a price? Are those dates US or European format?

Finetype answers these questions:

$ finetype profile -f orders.csv

Column        Type                           Confidence
────────────  ─────────────────────────────  ──────────
order_date    datetime.date.mdy_slash          0.97
amount        representation.numeric.decimal_number  0.98
customer      identity.person.full_name       0.93
country       geography.location.country      0.95
ip_address    technology.internet.ip_v4       0.99

Every type maps to a DuckDB SQL expression. Finetype says order_date is datetime.date.mdy_slash — that means strptime(order_date, '%m/%d/%Y') will succeed on every matching value. Profile first, then cast with confidence.

Installation

curl -fsSL https://install.meridian.online/finetype | bash

brew install meridian-online/tap/finetype

cargo install finetype-cli

irm https://install.meridian.online/finetype/win | iex

CLI Usage

# Classify a single value (the bundled model is column-aware — pass --mode column)
finetype infer -i "bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90" --mode column

# Classify a column of values from a file, JSON output
finetype infer -f column_values.txt --mode column --output json

# Profile a CSV file — detect column types
finetype profile -f data.csv

# Export a JSON Schema for the whole file
finetype profile -f data.csv -o json-schema > schema.json

# Export a JSON Schema for a single type (or glob)
finetype taxonomy "datetime.timestamp.*" -o json-schema

# Export a Frictionless Data Package descriptor (portable schema + provenance)
finetype profile -f data.csv -o datapackage > datapackage.json

# Validate data against a schema — and materialise a typed DuckDB table
finetype validate data.csv schema.json --db out.db --table orders

# Start the MCP server for AI agent integration
finetype mcp

# Show taxonomy (filter by domain, category, priority)
finetype taxonomy --domain datetime

Column-Mode Inference

Single-value classification can be ambiguous: is 01/02/2024 a US date (Jan 2) or EU date (Feb 1)? Is 1995 a year, postal code, or plain number?

Column-mode analyses the distribution of values in a column and applies disambiguation rules:

Date format — US vs EU slash dates, short vs long dates
Year detection — 4-digit integers predominantly in 1900–2100 range
Coordinate resolution — latitude vs longitude based on value ranges
Numeric types — ports, increments, postal codes, street numbers

# CLI column-mode
finetype infer -f column_values.txt --mode column

# CSV profiling (uses column-mode automatically)
finetype profile -f data.csv

Schema Export

Once you know your column types, export a JSON Schema contract — for the whole file, or for an individual type:

# Whole-file schema (add --stats for observed-data constraints)
$ finetype profile -f orders.csv -o json-schema > schema.json

# Single-type schema (or a glob of types)
$ finetype taxonomy "datetime.date.*" -o json-schema

Each schema carries the regex pattern, length constraints, examples, and the DuckDB cast expression — a machine-readable contract you can commit and validate against.

Need to hand the dataset to someone outside Finetype? Export a Frictionless Data Package — the open, tool-agnostic standard for shipping a dataset with its schema:

# Conformant Frictionless v2.0 descriptor: schema + resource provenance
$ finetype profile -f orders.csv -o datapackage > datapackage.json

Every column maps to a standard Frictionless type/format any Frictionless-aware tool reads, with the resource's path, bytes, and sha256 hash captured for provenance. Finetype's semantic richness rides along losslessly as x-finetype-* properties (label, confidence, pii, enum-domain) — so the package is portable by default and lossless when you want the full detail. See profile for the full descriptor.

Validate & Materialise

Pass the schema back to validate to gate your data. Run check-only for a quality signal, or add --db/--table to cast the valid rows into a typed DuckDB table in the same pass:

# Check-only — exit code 1 if any row is rejected
$ finetype validate orders.csv schema.json

# Validate AND materialise a typed table + reject sidecar
$ finetype validate orders.csv schema.json --db out.db --table orders

The typed orders table holds valid rows with per-column transforms applied; a finetype_reject_errors sidecar captures everything that failed validation or the typed cast. Exit codes: 0 no rejects, 1 rejects, 2 error. duckdb must be on your PATH when --db is used.

Performance

Accuracy

Evaluated on 21 real-world datasets (116 annotated columns):

Metric	Result
Label accuracy	97.4%
Domain accuracy	98.3%
Actionability	99.7% (DuckDB casts succeed on real data)

The inference pipeline uses a two-stage architecture — Sense (broad classification) followed by Sharpen (fine-grained disambiguation) — with column-mode distribution analysis for ambiguous types like dates and coordinates.

Latency & Throughput

Metric	Value
Model load	66 ms cold, 25–30 ms warm
Single inference	p50 = 26 ms, p95 = 41 ms
Batch throughput	600–750 values/sec
Memory footprint	8.5 MB peak RSS

Acknowledgements

QSV — High-performance CSV toolkit that inspired Finetype's approach to data profiling