profile

Profile a CSV file — detect the semantic type of every column using column-mode inference.

Scan a CSV file and detect the semantic type of every column. profile is the fastest way to understand what your data contains — run it before writing any queries.

Usage

finetype profile [OPTIONS] --file <FILE>

Options

Flag	Type	Default	Description
`-f`, `--file`	path	—	Input CSV file (single-file mode). Mutually exclusive with `--files`.
`--files`	path	—	File listing input paths, one per line (batch mode). Requires `--out-dir`.
`--out-dir`	path	—	Output directory for batch mode. One output per input is written as `<out_dir>/<stem>.<ext>`.
`-o`, `--output`	string	`plain`	Output format: `plain`, `json`, `csv`, `markdown`, `arrow`, `json-schema`
`--sample-size`	integer	`100`	Maximum values to sample per column
`--delimiter`	character	auto-detect	CSV delimiter character
`--no-header-hint`	flag	—	Disable column name header hints
`--enum-threshold`	integer	`32`	Cardinality threshold for ENUM columns (0 disables ENUM, shows VARCHAR)
`--stats`	flag	—	Attach observed-data constraints to JSON Schema output (`minLength`/`maxLength`, `minimum`/`maximum`, `enum`, `x-finetype-null-rate`, `x-finetype-cardinality`). Requires `-o json-schema`.
`-v`, `--verbose`	flag	—	Show additional detail and enable pipeline tracing

Examples

Profile a CSV file

$ finetype profile -f contacts.csv
FineType Column Profile — "contacts.csv" (12 rows, 6 columns)
════════════════════════════════════════════════════════════════════════════════

  COLUMN                    TYPE                                      BROAD   CONF
  ──────────────────────────────────────────────────────────────────────────────
  id                        representation.identifier.increment      BIGINT  97.6% [numeric_sequential_detection]
  name                      identity.person.full_name               VARCHAR  98.2%
  email                     identity.person.email                   VARCHAR 100.0%
  created_at                datetime.timestamp.iso_8601            TIMESTAMP  99.1%
  ip_address                technology.internet.ip_v4               VARCHAR 100.0% [ipv4_detection]
  amount                    finance.currency.amount                 DECIMAL  99.9% [header_hint_cross_domain:amount]

6/6 columns typed, 12 rows analyzed

The bracketed tokens are sense hints — the detection strategy that settled each column. numeric_sequential_detection recognised the running id, ipv4_detection matched the address pattern, and header_hint_cross_domain:amount used the column header to land on a currency amount.

Profile with JSON output

$ finetype profile -f contacts.csv -o json

JSON output is an object with a columns array. Each entry carries the semantic type, the broad_type (DuckDB storage type), the confidence, null counts, and the transform expression used to cast the column:

{
  "columns": [
    {
      "broad_type": "BIGINT",
      "column": "id",
      "confidence": 0.9756258726119995,
      "disambiguation_applied": true,
      "disambiguation_rule": "numeric_sequential_detection",
      "is_generic": true,
      "non_null": 12,
      "null": 0,
      "samples_used": 12,
      "transform": "CAST({col} AS BIGINT)",
      "type": "representation.identifier.increment"
    }
  ]
}

Pipe to jq to pull out just the types:

$ finetype profile -f contacts.csv -o json | jq '.columns[].type'

Export a JSON Schema for the whole file

$ finetype profile -f contacts.csv -o json-schema > schema.json

This emits a machine-readable JSON Schema describing every column — the contract you pass to validate. Add --stats to attach observed-data constraints (length/range bounds, enum values, null rate, cardinality):

$ finetype profile -f contacts.csv -o json-schema --stats > schema.json

How it works

Sample — reads up to --sample-size values from each column (default: 100).
Classify — runs column-mode infer on each sample, using column names as header hints (unless --no-header-hint is set).
Report — outputs the detected type, broad DuckDB type, and confidence for every column.

The broad type column (BIGINT, VARCHAR, TIMESTAMP, DECIMAL) tells you what DuckDB type each column can safely cast to. Export the schema with -o json-schema, then validate the data against it — pass --db/--table to materialise a typed DuckDB table in the same pass.