Case study / 01

SchemaForge

An AI-powered data pipeline generator

PythonLLMDBTData EngineeringETL

The problem

Raw CSVs almost never arrive analysis ready

Turning a pile of messy CSVs into a governed, production data pipeline is slow, manual, and repetitive. Someone has to infer the schema, decide on types, work out how tables relate, write the transformation models, document everything, and wire up the ETL. It is the unglamorous work that sits between raw data and any useful analysis, and it gets redone on every new dataset.

How it works

Point it at the data, review the output

Ingest. Feed SchemaForge one or more messy CSVs.
Infer. A language model reads the data to infer the schema, column types, and likely relationships between tables.
Generate. It produces DBT models, YAML configs, ER diagrams, and Python ETL code automatically.
Review. You stay in the loop: inspect the generated artifacts, adjust, and ship.

Tech stack

What it is built on

A Python core that orchestrates an LLM for schema inference and code generation, targeting the modern data stack so the output drops into real workflows rather than a notebook.

PythonLLMDBTYAMLER diagramsETL

Status

Open source and evolving

SchemaForge is open source on GitHub. It is one of several tools I build around a single idea: take the slow, manual glue work in data and AI workflows and let a model do the first 80 percent, with a human in the loop for the rest.

View on GitHub ↗ See other projects

Prudhvi Krovvidi • back to portfolio