The Semantic Lakehouse for AI/BI

Kyle Hale, Lead Product Architect with Databricks, coined the term “Semantic Lakehouse” in a 2022 blog. The blog explained how to extend the Lakehouse offering to users at the “top of the stack” leveraging PowerBI.

The following year, in a subsequent blog post by Kyle Hale, Soham Bhatt, and I titled “Building a Semantic Lakehouse With AtScale and Databricks,” we discussed a tool-agnostic approach with AtScale that would democratize the Databricks Lakehouse for a broader range of business users in PowerBI, Excel, and Tableau (and others) while intelligently leveraging the elastic compute of DBSQL, while also integrating with Unity Catalog, to achieve sub-second response time on data in the Delta Lake as opposed to shrinking the data, moving it into proprietary formats and relying on in-memory compute in the BI layer.

By combining the AtScale Semantic Layer with their Databricks Lakehouse, our customers have been able to achieve some of the following outcomes:

Eliminate data copies and simplify data pipelines. (Trek Bikes Accelerates Retail Analytics)
Deliver a “single source of truth” for business metrics. (Skyscanner Improves Data Accuracy and Consistency with AtScale)
Provide access to timely data to nontechnical business consumers. (Corning – Revolutionizing Data Analysis: The Shift to Databricks and AtScale)
Empowering decision-makers on the front line by offering a self-guided experience with data (Changing the Game for OLAP and Business Intelligence with a Semantic Lakehouse)

While many more companies are getting up and running on their Semantic Lakehouse journey, it was time to revisit the Semantic Lakehouse and address the elephant in the room.

No, it’s not Hadoop, it’s LLMs and GenAI.

GenerativeBI, LLMs, NLQ, Semantic Layers, and Other Buzzwords.

When the GenAI hype train started gaining momentum, every analytics company needed an LLM story to be relevant. Everyone began to tout some type of text-to-SQL, Natural Language Querying, or AI-powered co-pilot offering that aimed to disrupt the “legacy” BI stack.

GenerativeBI applications provide BI-like functionality, but instead of requiring a user to be proficient in SQL, PowerBI, Excel, Tableau, or a specific tool to interact with data to gain insight, they enable users to ask questions in English and get answers. The value proposition of these NLQ solutions is that business users can self-serve and obtain answers to ad-hoc questions not addressed in their dashboards, reports, or pivot tables without relying on expert IT practitioners to fill in the gaps.

So, no more BI tools, right? There is no need for a Semantic Lakehouse, right? We finally got rid of Excel, right? Well, no, not really.

While these LLMs have tremendous potential, and even with the improvements seen in modern LLMs week over week, there are several known limitations to existing NLQ solutions

LLMs struggle when database schemas get complex and cannot correctly make required table joins.
LLMs are generative by design. This is great in some contexts, but when tasked with a complex domain-specific question, it can provide inconsistent or incorrect results even when asked the same question. There is often no way to know from the user’s point of view when an LLM has hallucinated.
Database tables and column names are often defined in ways that make sense in terms of a company’s data engineering best practices. However, these may differ from the terminology used by the business. As a result, the LLM will struggle to generate correct queries without context of how to map from business to database terminology.

Example – When posed with complex domain-specific business questions such as ”What was my ARR in EMEA in Q4 2024?” trying to produce an accurate answer through statistical inference alone doesn’t achieve sufficient accuracy. What is ARR? How is ARR calculated? What the hell is EMEA? What is the company’s fiscal calendar? The data required to answer these domain-specific questions likely doesn’t reside inside the database.

This is partially why Semantic Layers and the rich glossary of business metadata defined within them have seen a resurgence recently. The domain-specific metadata in Semantic Models is gold for LLMs as it enables them to perform both statistical and contextual inference to achieve the accuracy and domain specificity required to get these applications into the hands of decision-makers with confidence.

But don’t just take my word for it!

Microsoft has put PowerBI at the forefront of their Fabric offering, and creating a Semantic Model is a pre-requisite to using their Copilot.
Google has similar messaging around Looker and how creating LookML semantic models helps ground your AI applications in truth.
Microstrategy (or Strategy, I guess?) highlights how their Semantic Graph enables accurate interpretation and computation of users’ questions.
Salesforce is also launching Tableau Semantics, which aims to improve the quality of AI models.
Snowflake has created its own YAML-based specification for Semantic Model creation to leverage with Cortex Analyst.
While Databricks recently announced Unity Catalog Metrics (more on this later!).

Articles and Youtube videos showing various updates around top bi tools

However, with its renewed importance in the tech stack, the semantic layer has become rife with vendor-specific requirements and lock-in, a missed opportunity to embrace open standards and interoperability.

Setting the standard in Semantics for AI and BI

The emergence of these text-to-SQL applications was the Technology Trigger (to steal a term from the Gartner Hype Cycle) that has brought Semantic Layers back into the limelight. In response to this resurgence, AtScale announced the open-source release of the Semantic Modeling Language (SML), a universal standard designed to promote interoperability and foster a vibrant community of semantic model builders. At its core, it is a multidimensional semantic modeling language that supports metrics, dimensions, hierarchies, semi-additive measures, many-to-many relationships, cell-based expressions, and much more. Here are the key components of SML.

A YAML-based Language Specification: The SML specification is documented and encompasses tabular and multidimensional constructs.
Pre-built Semantic Models: The GitHub repository contains pre-built semantic models incorporating standard data models such as TPC-DS, common training models such as Worldwide Importers and AdventureWorks, and marketplace models such as Snowplow and CRISP. We expect to add semantic models for SaaS applications such as Salesforce, Google Analytics, and Jira soon.
Helper Classes (coming soon): We will release helper classes that will facilitate the programmatic reading and writing of SML syntax.
Semantic Translators (coming soon): We will release converters for migrating other semantic modeling languages to SML, including dbt Lab’s semantic layer and Power BI. Shortly, we expect to release a variety of converters to support the legacy (i.e., Microstrategy, Business Objects, Cognos) and modern (i.e. Looker) semantic modeling tools.

Sample SML code for a Model object

SML aims to set a standard in semantic model creation across AI and BI. Instead of having to reinvent the wheel every time a PowerBI Semantic Model needs to be created for Finance, only to build a similar Semantic Model from scratch for a Tableau use case on the Marketing team, to finally go and build a new YAML semantic model power a text-to-SQL use case for the Sales team to try out, with SML every semantic object is open and interoperable.

Open and interoperable semantic objects enavle distributed team to use the same business logic

Open and interoperable semantic objects also hide the complexity of their implementations, making it easier for distributed teams to “plug and play” without needing to understand another business team’s logic. (SML is pretty neat and you can find out more here!)

What does this mean for my Semantic Lakehouse and AI/BI initiatives? EVERYTHING.

The Semantic Lakehouse V2

So the “table stakes” after getting this far in the blog is that you’ve a basic understanding of AtScale and our integration with Databricks, or at least have read the prequel to this blog. If not, learn more about Building a Semantic Lakehouse With AtScale and Databricks.

However, in the same way that the Databricks Lakehouse unified the Data Lake and Data Warehouse, AtScale aims to unify the world of AI and BI by providing them with a common standard and interface to data.

AtScale dashboard

AI/BI Genie, Dashboard and SQL Editor

It doesn’t matter if you’re a traditional business user interacting with data via a BI tool or are lucky enough to have access to AI/BI Genie performing ad-hoc analysis with NLQ; both users will have a guided experience with the data. If you ask the same question, you’ll get the same answer. If you’d like to see it in action, check out a demo with AtScale Founder/CTO David Mariani and Databricks Field CTO Dael Williamson.

So, to my Databricks admins out there, if a Genie appeared and granted you three wishes (see what I did there?) to make your AI/BI Genie space kick-ass, here’s what you need.

A Semantic Model (open and interoperable)

Whether traditional BI or Generative BI, it begins with a Semantic Model. These applications have the same desired outcome: to provide a broader range of users with a self-guided experience with accurate and actionable data.

LLMs are great, but bridging the gap between language and business data through statistical inference alone is extremely difficult. Also, I doubt a CFO will trust the “Inferred calculation of gross margin.” Others will turn to manually creating logic in a variety of individual systems, but this leads back to the age-old data silo (metadata silo in this instance) problem.This is the very reason why Semantic Models were created. Therefore, a well-built Semantic Model remains king.

With respect to AI/BI Genie and Dashboard, the AtScale Semantic Model is available as a foreign table through our integration with Unity Catalog and Lakehouse Federation. This integration allows AtScale to present its Semantic Model as a single flat table to AI/BI Genie.

AtScale Model creating a single flat table on top of complex data

A Semantic Query Engine

Traditionally, BI tools such as Power BI, Excel, or Tableau queried the Semantic Model with DAX, MDX, or PostgreSQL, at which point the AtScale Query Engine translates those inbound queries into the SQL dialect required to execute the query against the underlying warehouse (DBSQL in the case of Databricks).

With AI/BI Genie, since the Inbound Queries are made against a single logic table, we can remove the need for the LLM to do joins or multi-table operations altogether. In addition, since the AtScale Query Engine will handle the translation between logical column names and KPIs, the LLM will never have to generate complex business logic. This simplifies the question, transforming even the most complex NLQ question into something solvable for the LLM.

AtScale Semantic Layer and Databricks AI/BI Genie integration

More metadata! (Data is the new oil, again)

With modes slowly becoming commodities, it’s clear that one of the only remaining ways to gain a competitive edge is to feed these LLMs with better data. The metadata defined in a Semantic Model is very valuable to these models as it helps with both statistical and contextual inference. Databricks recently announced Unity Catalog Metrics as a way to store certified metrics in the Lakehouse. These metrics will be fully governed and discoverable in Unity Catalog, like any other resource, providing complete lineage visibility.

AtScale is one of the launch partners of Unity Catalog Metrics and will have a bi-directional read-and-write integration, allowing data teams to work with the same semantics as the business, ensuring that all teams use consistent definitions.

AtScale and Databricks Unity Catalog metrics integration

With our Semantic translators, AtScale will be able to take your existing PowerBI Semantic Models and convert them into SML before writing them to Unity Catalog Metrics. This will enable you to hydrate Unity Catalog with existing metrics already defined and locked into proprietary formats in the BI layer.

Interested? See it in action.

Seeing is believing, and if you’d like to see how we are achieving a 4x improvement in query accuracy using Genie, request a demo below. (LLM txt-to-SQL Benchmark)

PS – No chatGPT was harmed in the making of this blog. All heart.

WHITEPAPER

Enable Natural Language Prompting with AtScale’s Semantic Layer and Generative AI

DOWNLOAD NOW