AtScale’s Public Leaderboard for Text-to-SQL Tasks

Generative AI has caused an explosion of interest in Text-to-SQL use cases. A Text-to-SQL solution would enable decision makers to ask freeform questions of their proprietary data without having to write code or work through data engineering and analyst groups. The biggest challenge is contending with the well known problem of LLM hallucinations. While vendors claim various accuracies, they generally do not share their data or methods of evaluation. This makes it impossible to validate or compare solutions. To remedy this, AtScale is open sourcing a benchmark and leaderboard for Text-to-SQL based on canonical data and evaluation metrics to enable anyone to evaluate and compare Text-to-SQL solutions.

The Challenge of Comparing Text-to-SQL Solutions

Vendors often publish results of Text-to-SQL systems without sharing the data, schema, questions, or evaluation criteria used. 90% accuracy sounds great but is impossible to validate without this information. Further, it is impossible to compare one system to another without using the same inputs and evaluation criteria.

To address this issue, we attempted to create an objective, quantitative method for evaluating and comparing Text-to-SQL systems. We were inspired by the evaluation criteria about benchmarking knowledge graphs for LLM set forth by other researchers. To expand on their work, we wanted to more closely mirror real world data by incorporating a canonical dataset and schema for database benchmarking called TPC-DS. We are now making this benchmark completely public so that others can leverage the exact same inputs and evaluation criteria and extend them for their own purposes. We’re also very open to feedback on how to improve the benchmark.

Benchmark and Text-to-SQL Leaderboard

The benchmark and Text-to-SQL Leaderboard materials hosted in GitHub are public and include the following:

All information required to use the TPC-DS Text-to-SQL evaluation environment as a public benchmark. This includes the download instructions for the TPC-DS dataset, the evaluation questions, the KPI definitions, and information on the evaluation methods
A public performance leaderboard to track industry leaders against this benchmark

The goal is for this benchmark and leaderboard to serve as a community resource. The hope is that this benchmark will enable the industry to establish a common evaluation method, allowing it to progress more transparently.

For ease of use, all of these resources hosted on this repository will also provide tips on setting up the appropriate data and standards for evaluating Text-to-SQL solutions.

Objective Metrics for Evaluating Text-to-SQL

The evaluation is scored across 2 spectrums of question complexity.

Question Complexity: Captures how complicated the KPIs required to answer the question are. Selecting a column from a table would be considered low complexity while involving an aggregation like “ratio of profit to cost” would be considered high complexity.
Schema Complexity: Covers how many data tables are required to answer the question correctly. If the question requires 4 or less tables, it is considered low question complexity. Any question requiring 5 or more tables is considered high complexity.

Examples and further explanation of the evaluation metrics can be found in the git repo here.

Future Plans

As the industry matures, we hope to continuously improve upon this benchmark and make it a robust source of truth for Text-to-SQL solutions. Similarly, as AtScale’s Text-to-SQL solutions mature, we will post our new results to this same leaderboard.

If you have a solution you think should be included on the leaderboard or suggestions on how to improve the benchmark, we encourage you to email ailink@atscale.com.

Read the full whitepaper: Enabling NLP with AtScale’s Semantic Layer & GenAI.