FineType
Semantic type detection for text data. 240 types, DuckDB integration, pure Rust.
FineType classifies text into 240 semantic types — dates, emails, IP addresses, coordinates, financial identifiers, and more.
Early Release
FineType is under active development. Expect breaking changes to taxonomy labels, CLI arguments, library APIs, and model formats between releases. Pin to a specific version if stability matters for your use case.
Why it matters
You download a dataset. DuckDB reads it instantly, but every text column is VARCHAR. Is that column of numbers a postal code, a year, or a price? Are those dates US or European format?
FineType answers these questions:
$ finetype profile -f orders.csv
Column Type Confidence
──────────── ───────────────────────────── ──────────
order_date datetime.date.mdy_slash 0.97
amount representation.numeric.decimal_number 0.98
customer identity.person.full_name 0.93
country geography.location.country 0.95
ip_address technology.internet.ip_v4 0.99Every type maps to a DuckDB SQL expression. FineType says order_date is datetime.date.mdy_slash — that means strptime(order_date, '%m/%d/%Y') will succeed on every matching value. Profile first, then cast with confidence.
Installation
curl -fsSL https://install.meridian.online/finetype | bashbrew install meridian-online/tap/finetypecargo install finetype-cliirm https://install.meridian.online/finetype/win | iexCLI Usage
# Classify a single value (the bundled model is column-aware — pass --mode column)
finetype infer -i "bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90" --mode column
# Classify a column of values from a file, JSON output
finetype infer -f column_values.txt --mode column --output json
# Profile a CSV file — detect column types
finetype profile -f data.csv
# Export a JSON Schema for the whole file
finetype profile -f data.csv -o json-schema > schema.json
# Export a JSON Schema for a single type (or glob)
finetype taxonomy "datetime.timestamp.*" -o json-schema
# Validate data against a schema — and materialise a typed DuckDB table
finetype validate data.csv schema.json --db out.db --table orders
# Start the MCP server for AI agent integration
finetype mcp
# Show taxonomy (filter by domain, category, priority)
finetype taxonomy --domain datetimeColumn-Mode Inference
Single-value classification can be ambiguous: is 01/02/2024 a US date (Jan 2) or EU date (Feb 1)? Is 1995 a year, postal code, or plain number?
Column-mode analyses the distribution of values in a column and applies disambiguation rules:
- Date format — US vs EU slash dates, short vs long dates
- Year detection — 4-digit integers predominantly in 1900–2100 range
- Coordinate resolution — latitude vs longitude based on value ranges
- Numeric types — ports, increments, postal codes, street numbers
# CLI column-mode
finetype infer -f column_values.txt --mode column
# CSV profiling (uses column-mode automatically)
finetype profile -f data.csvSchema Export
Once you know your column types, export a JSON Schema contract — for the whole file, or for an individual type:
# Whole-file schema (add --stats for observed-data constraints)
$ finetype profile -f orders.csv -o json-schema > schema.json
# Single-type schema (or a glob of types)
$ finetype taxonomy "datetime.date.*" -o json-schemaEach schema carries the regex pattern, length constraints, examples, and the DuckDB cast expression — a machine-readable contract you can commit and validate against.
Validate & Materialise
Pass the schema back to validate to gate your data. Run check-only for a quality signal, or add --db/--table to cast the valid rows into a typed DuckDB table in the same pass:
# Check-only — exit code 1 if any row is rejected
$ finetype validate orders.csv schema.json
# Validate AND materialise a typed table + reject sidecar
$ finetype validate orders.csv schema.json --db out.db --table ordersThe typed orders table holds valid rows with per-column transforms applied; a finetype_reject_errors sidecar captures everything that failed validation or the typed cast. Exit codes: 0 no rejects, 1 rejects, 2 error. duckdb must be on your PATH when --db is used.
Performance
Accuracy
Evaluated on 21 real-world datasets (116 annotated columns):
| Metric | Result |
|---|---|
| Label accuracy | 97.4% |
| Domain accuracy | 98.3% |
| Actionability | 99.7% (DuckDB casts succeed on real data) |
The inference pipeline uses a two-stage architecture — Sense (broad classification) followed by Sharpen (fine-grained disambiguation) — with column-mode distribution analysis for ambiguous types like dates and coordinates.
Latency & Throughput
| Metric | Value |
|---|---|
| Model load | 66 ms cold, 25–30 ms warm |
| Single inference | p50 = 26 ms, p95 = 41 ms |
| Batch throughput | 600–750 values/sec |
| Memory footprint | 8.5 MB peak RSS |
Links
Acknowledgements
- QSV — High-performance CSV toolkit that inspired FineType's approach to data profiling