4 minutes

Posted by

Varadh Kaushik

AI Engineer

We Just Posted the Highest Score on ADE-Bench. We Didn't Build an Agent to Do It.

Every AI data tool on the market is racing to build the smartest agent. More tools. More skills files. More proprietary scaffolding. We took a different approach, and it turns out, it's the one that works.

Sidecar Data just posted the highest score on ADE-bench, the open-source benchmark from Benn Stancil and dbt Labs that tests AI agents on real-world analytics and data engineering tasks. No custom agent. No proprietary skills files. No purpose-built harness. Just Sonnet 4.6 connected to the Sidecar MCP server, which provides the rich context.

We're not benchmarking our agent. We're benchmarking what happens when an AI actually understands your data.


What Is ADE-Bench?

ADE-bench is the first serious benchmark purpose-built for data engineering work. Created by Benn Stancil (founder of Mode) in collaboration with dbt Labs, it throws AI agents into the deep end: messy dbt projects with hundreds of tables, broken models, vague prompts like "it's broken," and complex analytics questions that mirror what data teams actually deal with every day.

Each task runs inside a Docker container sandbox. The agent gets a project, a database, and a problem to solve. Success is binary. Either all the dbt tests pass or they don't.


The Results


Setup

Score

Sidecar Data (Sonnet 4.6 + Sidecar MCP)

80%

Custom agent harness + 100 deterministic tools (Sonnet 4.6)

74.4%

Snowflake Cortex Code CLI (Opus 4.6)

65%

dbt Labs (Sonnet 4.5 + Fusion + MCP)

~59%

Codex (GPT-5.1)

56%

Sonnet 4.6 baseline (no context)

~40%

Look at the spread. Snowflake's dedicated CLI running Opus 4.6, the most capable model on the market gets 65%. We beat it by 15 points on Sonnet, a smaller, cheaper model. A custom-built harness with a hundred specialized tools lands at 74.4%. dbt Labs' own stack — Fusion, MCP servers, the works comes in around 59%.

And the vanilla baseline with no context? 40%. Same model, no Sidecar MCP: 40%. Add our context: 80%. That 40-point gap is the whole story and it wasn't the model doing the heavy lifting.


Why Context Beats Tooling

The AI industry is pouring effort into agent scaffolding, and for good reason. Models without context make mistakes. But instead of asking "how do we catch those mistakes?" we asked a different question: "what if the model just didn't make them?"

Without context, models guess. They hallucinate column names, misread relationships between staging and mart models, and write SQL that's syntactically valid but semantically wrong. A model that doesn't know your naming conventions will generate dim_customers when your project uses stg_customer. A model that can't see lineage will rebuild a downstream table without understanding what feeds into it.

When you give an LLM deep, structured context about your data environment (your schema, your lineage, your naming conventions, your business logic, how your models relate to each other) it stops guessing. You don't need a deterministic tool to catch a bad table name if the model never writes the bad table name in the first place.


What ADE-Bench Doesn't Test (Yet)

Here's something worth noting: ADE-bench is scoped entirely to dbt projects. That's the right place to start, dbt is where most of the complexity in modern data stacks lives, and it's the hardest thing to benchmark well. ADE-bench is currently the best benchmark available for evaluating AI on data engineering work, and we're glad it exists.

But real data engineering doesn't happen inside dbt alone. When a model breaks in production, the first clue might be a Slack thread from three months ago explaining an upstream schema change. The fix might depend on knowing which dashboards are affected downstream, which team owns them, and whether there's an open Jira ticket about a related issue. The business logic you need might live in a spreadsheet someone shared in a Google Doc, not in a YAML file.

ADE-bench doesn't test any of that. No downstream dashboards. No Slack conversations. No Jira tickets. No cross-tool lineage.

Those are all things Sidecar surfaces, and all things an agent backed by our context can leverage. We scored 80% on a benchmark that tests the narrowest slice of what our context covers. That's our floor, not our ceiling.


Context You Don't Have to Build

The benchmark result raises an obvious question: if context matters this much, why doesn't everyone just… have better context?

Because it's brutally hard to maintain. Documentation goes stale the week after someone writes it. Lineage diagrams reflect last quarter's architecture. Naming conventions exist in one engineer's head. Business logic lives in Slack threads from 2023. Getting your data environment into a state where an AI can actually understand it is a massive lift, and keeping it current is even harder.

That's the problem Sidecar solves.

Sidecar is a consolidated DataOps platform that lives inside your data stack and keeps everything current without your team having to maintain it. We automatically catalog your data assets and keep documentation accurate as your environment changes. We map lineage, ownership, and definitions so they never go stale. We monitor data quality, flag broken dependencies, and catch issues before they become incidents. We identify cost spikes, unused tables, and inefficient queries, then surface clear actions to fix them. And we track governance and compliance across your entire platform.

The result is a data environment that's not just well-organized for your team, but deeply understood by any AI agent that touches it. That's the context that took us to the top of ADE-bench, and it's the same context your team gets out of the box.


What This Means for Data Teams

If you're evaluating AI for your data stack, the question isn't "which agent is smartest?" It's "does the agent actually understand my data?"

Most teams are trying to make AI work on top of messy, undocumented, poorly governed data platforms. That's like hiring a brilliant new analyst and handing them a laptop with no access, no documentation, and a Slack message that says "figure it out."

Sidecar gives every AI agent (and every human on your team) the context they need to do great work. The ADE-bench results are just one proof point. The real value is a data platform that's healthier, cheaper to run, and ready for whatever you throw at it next.


ADE-bench is open source and available on GitHub. We encourage everyone to run it themselves. The more the data community benchmarks transparently, the better we all get.

Want to see what Sidecar can do for your data team? Get in touch.

4 minutes

Posted by

Varadh Kaushik

AI Engineer

Create a free website with Framer, the website builder loved by startups, designers and agencies.