Your Agent Passed the Exam. It Still Can't Do the Job.

Hack Session

About the session

Text-to-SQL is, by most leaderboards, a solved problem. Models post impressive scores and the progress feels real, and then the same architecture gets pointed at a real enterprise warehouse and the numbers fall off a cliff.
Why? The benchmarks we celebrate are built on small, tidy, well-documented schemas with neatly phrased questions and a single obvious answer. Reality is the opposite: hundreds of tables with cryptic names, business terms that map to no column directly, joins that depend on knowledge living only in someone's head, and questions where the same word means three different things depending on who's asking. The way we score these systems doesn't survive the move either, a result that's subtly wrong gets treated the same as one that's obviously right, so the failures that hurt most are exactly the ones our metrics are blind to.

This session looks at where the gap actually lives: in the questions we test against, and in the scores we trust. We'll unpack why ""it works in the demo keeps turning into it broke in production, and what better evaluation could look like.

Speaker

Download Brochure