Less Typing, More Discovery

By Hugh Cameron

TL;DR: FineType profiles your CSV and casts it into DuckDB with the right types — no manual casts, no format strings. One command to understand your columns, one command to load them.

The many CSVs on your computer hold many mysteries. Often, though, you're faced with solving the mundane before you unlock the magic.

Is that column of numbers a latitude or a postal code that's lost its leading zero? Is that text field a date, an address, or a free-form note? Is 4.6 a magnitude or a price? You open the file, squint at the first twenty rows, and start guessing. Sometimes those guesses hold. Sometimes they don't — and you find out three hours later when a join produces nonsense or a chart axis makes no sense.

That's three hours spent typing — casts, format strings, fixups — instead of discovering what the data actually has to say. We think the ratio should be the other way around.

Less typing

The tools available to analysts are changing fast. DuckDB has shown that serious analytical power can run on your laptop. The barrier to doing real analysis is lower than it's ever been.

But as the barrier drops, the foundations matter more, not less. When anyone can write a query, the question shifts from "can I write this?" to "can I trust this?" And trust starts with the most basic question of all: what type is this data, really?

This is what FineType does. It reads your data and infers the type of every column — not just VARCHAR or DOUBLE, but the semantic type: is that number a latitude, a magnitude, or a depth measurement? Is that string a timestamp, an email, or a category code?

Let's see what that looks like.

Profiling a real dataset

The USGS publishes a catalogue of every significant earthquake recorded worldwide — freely available, well-structured, and exactly the kind of data an analyst might pick up on any given morning. Here are the core measurements from the 2024 catalogue: when each quake struck, where, how deep, and how strong. Grab the sample and profile it:

curl -fsSL https://meridian.online/samples/earthquakes.csv -o earthquakes.csv
finetype profile -f earthquakes.csv
FineType Column Profile — "earthquakes.csv" (14132 rows, 9 columns)
════════════════════════════════════════════════════════════════════════════════

  COLUMN                    TYPE                                      BROAD   CONF
  ──────────────────────────────────────────────────────────────────────────────
  time                      datetime.timestamp.iso_8601_milliseconds TIMESTAMP 100.0%
  latitude                  geography.coordinate.latitude            DOUBLE 100.0% [coordinate_disambiguation]
  longitude                 geography.coordinate.longitude           DOUBLE 100.0% [coordinate_disambiguation]
  depth                     representation.numeric.decimal_number    DOUBLE  98.2% [coordinate_plausibility_gate:out_of_range=12/100]
  mag                       representation.numeric.decimal_number    DOUBLE 100.0%
  rms                       representation.numeric.decimal_number    DOUBLE 100.0%
  horizontalError           representation.numeric.decimal_number    DOUBLE 100.0%
  depthError                representation.numeric.decimal_number    DOUBLE  91.8%
  magError                  representation.numeric.decimal_number    DOUBLE  99.8%

9/9 columns typed, 14132 rows analyzed

In a few seconds, without writing a single query, you know something meaningful about every column.

The time column isn't just a string — it's an ISO 8601 timestamp with millisecond precision, and FineType is 100% confident. The real trick is latitude and longitude: four of these columns are plain DOUBLEs, but FineType distinguishes the two geographic coordinates from the magnitude, depth, and error measurements — and even works out which is latitude and which is longitude from their value ranges. None of this required you to know what rms means or how magnitude scales work. The profile gives you a map of the terrain before you start exploring — built from the data itself, not from assumptions.

Where the data comes from

The sample isn't ours to invent — it's pulled straight from the USGS public FDSN event feed, every magnitude-4+ event of 2024. That gives 22 columns; we kept the nine geophysical measurements (preserving the raw ISO timestamps) and hosted the result so the curl above is a one-liner. Here's exactly how we got it:

# 1. Pull the full 2024 catalogue from USGS (22 columns, 14,132 events)
curl -s "https://earthquake.usgs.gov/fdsnws/event/1/query?format=csv\
&starttime=2024-01-01&endtime=2024-12-31&minmagnitude=4\
&limit=20000&orderby=time" -o earthquakes_full.csv

# 2. Keep the nine measurement columns, leaving the timestamps untouched
duckdb -c "COPY (
  SELECT time, latitude, longitude, depth, mag, rms,
         horizontalError, depthError, magError
  FROM read_csv('earthquakes_full.csv', all_varchar = true)
) TO 'earthquakes.csv' (HEADER, QUOTE '')"

The USGS catalogue is revised as events are reviewed, so the live feed may drift a little over time — that's why we serve a frozen copy. We come back to those other 13 columns below.

More discovery

A profile is useful on its own, but the real payoff is what comes next. FineType can take what it's learned about your columns, export a schema, then validate and load the data straight into DuckDB — with the right types already applied:

finetype profile -f earthquakes.csv -o json-schema > quakes.schema.json
finetype validate earthquakes.csv quakes.schema.json --db quakes.db --table quakes
Validation Report
════════════════════════════════════════════════════════════
  Input:        earthquakes.csv
  Schema:       quakes.schema.json
  Output DB:    quakes.db
  Target table: quakes
  Scan ID:      1

  Total rows:         14132
  Valid rows:         14132
  Invalid rows:           0
  Rejects:                0
    SEMANTIC_TYPE:        0
    TRANSFORM_FAILED:     0
  Grade:             A
════════════════════════════════════════════════════════════

Every one of the 14,132 rows passed the schema and landed in a typed quakes table inside quakes.db. No CREATE TABLE, no CAST(time AS TIMESTAMP), no strptime format strings — FineType writes the SQL so you don't have to. Check the types it applied:

duckdb quakes.db -c "SELECT column_name, column_type FROM (DESCRIBE quakes)"
┌─────────────────┬─────────────┐
│   column_name   │ column_type │
├─────────────────┼─────────────┤
│ time            │ TIMESTAMP   │
│ latitude        │ DOUBLE      │
│ longitude       │ DOUBLE      │
│ depth           │ DOUBLE      │
│ mag             │ DOUBLE      │
│ rms             │ DOUBLE      │
│ horizontalError │ DOUBLE      │
│ depthError      │ DOUBLE      │
│ magError        │ DOUBLE      │
└─────────────────┴─────────────┘

time is a proper TIMESTAMP, so the questions you actually came to ask just work — no parsing, no fixups:

-- The five strongest quakes of the year
SELECT time, round(latitude, 2) AS lat, round(longitude, 2) AS lon, mag, depth
FROM quakes
ORDER BY mag DESC
LIMIT 5;
┌─────────────────────────┬────────┬────────┬────────┬─────────┐
│          time           │  lat   │  lon   │  mag   │  depth  │
├─────────────────────────┼────────┼────────┼────────┼─────────┤
│ 2024-01-01 07:10:09.476 │  37.49 │ 137.27 │    7.5 │    10.0 │
│ 2024-07-19 01:50:48.571 │ -23.08 │ -67.84 │    7.4 │ 127.291 │
│ 2024-04-02 23:58:12.173 │  23.84 │  121.6 │    7.4 │    40.0 │
│ 2024-12-17 01:47:25.741 │ -17.69 │ 168.08 │    7.3 │  54.372 │
│ 2024-06-28 05:36:36.902 │ -15.83 │ -74.45 │    7.2 │    24.0 │
└─────────────────────────┴────────┴────────┴────────┴─────────┘

The strongest was a magnitude 7.5 just after 7am on New Year's Day, off Japan's Noto Peninsula — date arithmetic, geographic filtering, and numeric aggregation all available the moment the data lands, because the types are right.

That's what we mean by less typing, more discovery. Less time wrestling with format strings and cast expressions. More time asking the questions you actually came to answer.

The harder frontier

We kept this example to the catalogue's measurement columns on purpose, because the full file is where the honest, interesting work lives. The USGS export has 22 columns, and alongside the clean geophysics it carries the awkward stuff: a two-letter network code like us, an event id like us6000pgkh, free-form place names like "80 km NW of Kandrian, Papua New Guinea". FineType reads the timestamps and coordinates beautifully — but those metadata columns are exactly the kind of ambiguous, half-documented fields it still trips on today.

That gap is the work we find most worthwhile. The promise of less typing, more discovery shouldn't hold only for the tidy columns; it should hold for the awkward, real, half-documented files that make up most of an analyst's day. Profiling and validating data like the full earthquake catalogue properly — every column, every time, with a quality pass you can actually trust — is a big goal. Closing that gap means lowering the barrier to insight for everyone, not just the people with the time and patience to wrangle their data by hand first. It's worth getting right, and it's where we're headed.

Try it yourself

Pick a CSV — one of yours, or the earthquake sample above — and try the pipeline:

curl -fsSL https://install.meridian.online/finetype | bash
curl -fsSL https://meridian.online/samples/earthquakes.csv -o earthquakes.csv
finetype profile -f earthquakes.csv
finetype profile -f earthquakes.csv -o json-schema > quakes.schema.json
finetype validate earthquakes.csv quakes.schema.json --db quakes.db --table quakes

You might be surprised what you learn. The Quick Start guide walks through installation and your first profile in a few minutes.

This is the first in a series of posts about building analysis on solid ground. Next, we'll look at the 240 semantic types FineType can detect — and why VARCHAR is not a type, it's a surrender.