Which AI is actually good at construction?
AI is being sold hard to construction right now, and there is very little evidence behind the noise. ContractorOS benchmarks put frontier models through real construction tasks and score them against validated answer keys, so you can see what to trust AI with and what still needs a human.
Real drawing sets
Architectural and structural sets from real projects, spanning trades and formats. Not textbook examples. The same messy PDFs contractors deal with every day.
Validated answer keys
Quantities measured and verified by experienced estimators before any model sees the drawings. Models get scored against what a competent human actually produced, not against another AI's opinion.
Frontier models, same inputs
Claude, GPT, Gemini and Copilot run the identical task with the identical inputs. Same drawings, same instructions, same scoring. No tuning toward any vendor.
Re-run on every major release
When a major model version ships, the benchmark runs again. Results stay current as the models change, and the history shows which lab is actually improving.
Benchmark 01 — Drawing Takeoff
First run in progress. Results will publish on this page. Join the community to get them when they drop, plus the full task-by-task breakdown.