MeridianMERIDIAN

Dovetail

Discovers how unfamiliar data loads and how tables relate — then compiles it to runnable SQL. Decides, never executes.

Dovetail is the modelling layer between raw files and a pipeline. Point it at an awkward file and it works out how to load it; point it at a loaded database and it works out how the tables relate. It emits runnable artifacts — a standalone .sql load, a Frictionless datapackage.json, foreign-key constraints — and never runs a pipeline itself.

Early Release

Dovetail is at v0.1.0 and under active development. Commands, output formats, and confidence thresholds may change between releases. There is no published binary yet — build from source.

Why it matters

Two questions eat the first hour with any new dataset.

How does this load? Is the file CSV, TSV, NDJSON, a JSON array, or records nested inside a field? What's the exact DuckDB read_* incantation? dovetail survey detects the format and row-level structure and prints the load — preferring a portable, dependency-free .sql and reporting how confident it was. Under-confident detections are surfaced for confirmation rather than emitted blind.

How do these tables relate? Once several tables are loaded, which columns actually reference which — and which "matches" are coincidence? dovetail relate scores every candidate foreign key, then verifies each one against the data. This is the load-bearing part: only a verified edge — zero orphan rows against a provably unique parent key — compiles into a constraint. A coincidental overlap never becomes a hard foreign key.

The boundary is deliberate. Dovetail decides; arcform and DuckDB execute. That keeps it a small, auditable compiler with no second pipeline engine inside it.

Installation

No release binaries yet — build from source with a recent Rust toolchain:

git clone https://github.com/meridian-online/dovetail
cd dovetail
cargo install --path crates/dovetail

Dovetail bundles DuckDB (statically linked), so relate's in-process discovery reads need no external DuckDB install. The .sql loads that survey emits are run with your own DuckDB. Note that dovetail currently builds against the sibling Finetype crates as a path dependency — check out the finetype repo alongside dovetail before building.

Commands

dovetail survey <paths...>              # detect format + structure, print the load SQL
dovetail relate <db> [--out <path>]     # discover + verify foreign keys in a DuckDB
dovetail jaq '<program>' <file>         # run a jq program through embedded, pinned jaq

A worked example: discover the foreign keys in a database

You've loaded a few tables into DuckDB. dovetail relate reads the database, scores every candidate foreign key on naming, value overlap, and how key-like the parent is, then verifies each against the data.

The database below is the repo's relate fixture (tests/relate-fixtures/build.sql) — four candidate relationships, one per outcome. Build it into a DuckDB and point relate at it:

$ dovetail relate warehouse.db
ACCEPT  orders.customer_id -> customers.id  (verified: 0 orphans, parent is unique, confidence 0.95 ≥ 0.60)
REVIEW  products.category_id -> categories.id  (plausible (parent not provably unique); surfaced for review — confidence 0.85)

dovetail relate: 1 accepted, 1 to review

-- Generated by dovetail relate — accepted foreign keys
ALTER TABLE "orders" ADD CONSTRAINT "fk_orders_customer_id" FOREIGN KEY ("customer_id") REFERENCES "customers" ("id");

Read that back:

  • orders.customer_id → customers.id auto-accepts: every child value resolves to a unique parent key, so referential integrity provably holds. No human sign-off — verification earns the "accepted".
  • products.category_id → categories.id is orphan-free and FK-shaped, but categories.id has a duplicate value, so it isn't provably unique. Plausible, not proven — surfaced for review, not compiled.
  • Two other candidates in the same database — an orphaned near-miss (shipments.order_id) and a coincidental boolean (widget_flags.active) — are rejected and not printed.

Only accepted edges compile to constraint DDL. That's the safety rail: no unverified guess becomes a hard constraint.

Add --out to fold the discovered foreignKeys — with their evidence, confidence, and status — into a Frictionless datapackage.json, dovetail's canonical manifest:

dovetail relate warehouse.db --out datapackage.json

Upstream: survey the file first

Before you can relate tables you have to load them. dovetail survey detects a file and prints the DuckDB load — the SQL-preferred rung of its fallback ladder:

$ dovetail survey customers.csv
customers.csv: rung=sql — Csv FlatTable detected at 95% confidence; loadable by DuckDB natively
-- Generated by dovetail survey
-- source: customers.csv
-- format: Csv  structure: FlatTable  rung: sql

CREATE OR REPLACE TABLE "customers" AS
SELECT * FROM read_csv('customers.csv', header = true, delim = ',');

Redirect that load to a file, run it with your own DuckDB, then point relate at the result — dovetail decides, DuckDB executes.

See also

Arcform runs the loads dovetail compiles · Finetype types the columns survey detects

On this page