Benchmarks

Which AI is actually good at construction?

AI is being sold hard to construction right now, and there is very little evidence behind the noise. ContractorOS benchmarks put frontier models through real construction tasks and score them against validated answer keys, so you can see what to trust AI with and what still needs a human.

Test material

Real drawing sets

Architectural and structural sets from real projects, spanning trades and formats. Not textbook examples. The same messy PDFs contractors deal with every day.

Ground truth

Validated answer keys

Quantities measured and verified by experienced estimators before any model sees the drawings. Models get scored against what a competent human actually produced, not against another AI's opinion.

The field

Frontier models, same inputs

Claude, GPT, Gemini and Copilot run the identical task with the identical inputs. Same drawings, same instructions, same scoring. No tuning toward any vendor.

Cadence

Re-run on every major release

When a major model version ships, the benchmark runs again. Results stay current as the models change, and the history shows which lab is actually improving.

Status

Benchmark 01 — Drawing Takeoff

First run in progress. Results will publish on this page. Join the community to get them when they drop, plus the full task-by-task breakdown.