SchemaForge
Raw CSVs almost never arrive analysis ready
Turning a pile of messy CSVs into a governed, production data pipeline is slow, manual, and repetitive. Someone has to infer the schema, decide on types, work out how tables relate, write the transformation models, document everything, and wire up the ETL. It is the unglamorous work that sits between raw data and any useful analysis, and it gets redone on every new dataset.
Point it at the data, review the output
- Ingest. Feed SchemaForge one or more messy CSVs.
- Infer. A language model reads the data to infer the schema, column types, and likely relationships between tables.
- Generate. It produces DBT models, YAML configs, ER diagrams, and Python ETL code automatically.
- Review. You stay in the loop: inspect the generated artifacts, adjust, and ship.
What it is built on
A Python core that orchestrates an LLM for schema inference and code generation, targeting the modern data stack so the output drops into real workflows rather than a notebook.
Open source and evolving
SchemaForge is open source on GitHub. It is one of several tools I build around a single idea: take the slow, manual glue work in data and AI workflows and let a model do the first 80 percent, with a human in the loop for the rest.