Building a Scalable Evaluation Framework with AI Pair Programming (ChatGPT + Cursor)

Home • About • Projects • Blog • Resources • Contact

Before moving into product, I was a software engineer; and that background really helped me build Launchling - a product that helps non-technical founders get from idea to startup plan, fast. But after years in product, my Python knowledge was rusty (read non-existent), and the idea of building a structured evaluation pipeline, spanning Airtable, synthetic test generation, prompt variations, scoring logic, and CI, felt overwhelming and frankly, it would have taken me months.

But I didn’t want to “vibe code” (I loathe this term) my way through it.

Too many projects seem to break under the weight of brittle, AI-assisted glue code. They’re fast to write, look shiny, but are hard to maintain. I wanted Launchling’s evaluation system to be robust: easy to test and evolve, even as it grew in complexity.

AI Pair Programming

I used ChatGPT and Cursor extensively, not as magic wands; but instead, I treated them as collaborators. Basically they were my junior engineers who could work fast but needed structure, clarity, and constraints.

I brought engineering discipline: clear structure, acceptance criteria, good documentation, a plan for test coverage and schema validation; and in return, they helped me write months of code in a matter of days. But it wasn’t always smooth and I managed to get part-way through before Cursor started falling over itself. This was a structured evaluation pipeline, not just a CRUD app, and Cursor started to falter once architectural design and multi-table logic came into play.

What I am building is bespoke and unlike a typical web app or CRUD backend, and so there aren’t many prior examples for Cursor or ChatGPT to draw on. This was more architectural, more abstract, and more data-model-heavy than most AI-native workflows (or at least this is what ChatGPT told me when I started questionning it on why Cursor was finding everything so hard all of a sudden).

This made me realise… I had to teach the system how to help me.

What I Put In Place (that actually Worked)

I completely stopped the work I had been doing on writing the evaluation framework, and instead spent a day putting things in place to make the AI pair programming more effective.

1. Evaluation Goals + Acceptance Criteria

I wrote up:

A clear list of goals (why we’re evaluating plans at all)
Detailed acceptance criteria for what “working” looks like
A runbook for real-world workflows (dry runs, schema checks, synthetic generation)

→ This meant both Cursor and I could test not just whether something worked, but whether it was the right thing to do.

2. Schema Validation and Canonical Codes

We use Airtable as a backing store, but it's really fragile - the fields change, types shift, enum values break silently.

So I:

Froze an Airtable schema as JSON
Added schema diffing into CI (to catch breaking Airtable changes)
Mapped all dropdown/multiselect fields to canonical codes like confidenceCode = high, so that UI text can evolve without breaking logic

→ Now Cursor can reason about inputs safely, and the whole pipeline is more deterministic.

3. A Fully Declarative Run Config

Instead of hardcoded values, every evaluation run reads from run_config.json:

Model (gpt-4, gpt-3.5)
Prompt version
Temperature
Field mappings
Dry run vs write mode

→ Makes tests reproducible, versioned, and easily comparable.

4. Proper Test Structure (With To-Do Plan)

I wrote:

TESTING_GUIDELINES.md (how to write meaningful, maintainable tests)
A TEST_TODO_PLAN.md with priority, required vs optional flags, and code-specific targets
Canonical test cases for real and synthetic data
Dummy Airtable schema extractors so we don’t hit the real API

→ We now have a growing suite of both unit and integration tests, with CI safeguards in place to flag regressions and schema drift; and Cursor is actively building out coverage using this structure.

5. Synthetic Plan Variations

Launchling generates startup plans for non-technical founders and I wanted to test how prompt changes affect output quality.

So I:

Built a synthetic test generator that creates controlled test cases (e.g. same idea, but different budget/confidence/skills)
Logs these to a SyntheticTestCases table
Generates full plans via evaluate_plans.py, storing results in SyntheticTestResults
Tracks links across tables, so we can score and analyze variation impact

→ This lets us quantify whether a new prompt version helps founders with low confidence, or whether we’re overfitting to highly skilled users.

6. End-to-End Linking + Record Creation

Every evaluation run now creates and links:

EvaluationRuns (the context/config for the run)
PlanEvaluations (score + annotations)
SyntheticTestResults (the actual plans generated)
SyntheticTestCases (the controlled input)
With correct linkage back to prompt version and question set

→ This structure took time, but it’s what makes future scoring, analysis, and prompt optimisation possible.

7. CI, Debug Utilities, and Schema Freezing

I added:

GitHub Actions CI on every PR/commit
Schema diffing to catch Airtable drift
Runbook-driven debugging tools
Automated token usage tracking (so I don’t burn money)

What’s in the Docs?

To support long-term maintainability and help both human and AI contributors understand the system, I created a comprehensive internal documentation suite alongside the code. This includes everything from onboarding and system overviews to testing guidelines, versioning rules, and a full evaluation pipeline roadmap.

Inside /evaluation/:

README.md — Component-level overview
ONBOARDING.md — Step-by-step setup for contributors
EVALUATION_SYSTEM_OVERVIEW.md — Data flow, architecture, and linked tables
EVALUATION_PIPELINE_TODO.md — Work-in-progress logic and test roadmap
EVALUATION_GOALS.md — What success looks like and why it matters
TROUBLESHOOTING.md — Common issues and fixes
VERSIONING_GUIDE.md + VERSIONING_QUICK_REFERENCE.md — Prompt and schema version control

In project root:

TESTING_GUIDELINES.md — How tests should be structured and why
CANONICAL_CODES.md — Mapping UI labels to system logic
VERSION_TRACKING.md — All changes to prompt/question versions
CONTRIBUTING.md — AI and human collaboration conventions
Makefile — One-line commands to streamline evaluation, validation, and CI

From Des-Pairing to AI Pairing

Years ago, I wrote a tongue-in-cheek post about the joys and frustrations of pair programming called (Des)Pair Programming — the Marmite of Engineering.

Ironically, working with AI tools like ChatGPT and Cursor has brought me back to pairing… but this time, on my terms.

When pair programming with AI, you can pause, iterate, structure, or discard. You still get that fast feedback loop and collaboration, just without the interruptions, potential tension, or the pace mismatch.

Why This Matters

The result isn’t just a working evaluation system, it’s an AI-extendable platform, and putting these things in place will make extending it considerably easier and quicker. I’ve already seen Cursor write meaningful tests, find linkage bugs, and extend logic I scaffolded days earlier.

That only happens because the structure is solid, the schema is explicit, and the constraints are well-documented.

Lessons (If You’re Using AI Tools on Complex Projects)

Don’t outsource the architecture.

AI tools can write code, but they can’t design your system — yet.

Treat ChatGPT like a senior, Cursor like a junior.

ChatGPT helps you think. Cursor helps you move faster once you know what you want.

Give your tools guardrails.

The more structure, tests, and config you give them, the better they behave.

It’s worth going slow to go fast.

The upfront investment in schema design, testing strategy, and clarity has already paid back multiple times over.

If you're building AI-native tools that aren’t just wrappers or frontends but involve real data, logic, and complexity, then you need structure; and if you give your AI tools that structure, they can become really powerful collaborators. I’d love to hear how others are structuring AI-assisted engineering work, especially when the domain is weird, custom, or evolving fast.

Try it now

Turn your idea into a tiny, tangible prototype - with a little help from Launchling.

👉 Try Launchling – no signup, no jargon, just a gentle push forward.

How I Used AI Pair Programming to Build a Robust Evaluation System (Without Vibe Coding)