December 17, 2019
Big Data Analytics in the Cloud for Today’s Distributed and Diverse DataThere is a turf war going on today among technology providers to move their enterprise customers to the cloud. IBM believes so strongly that their future success resides in their ability to get enterprises to the cloud that they just announced the largest technology deal in history, buying RedHat for $34 billion. As IBM CEO Ginny Rometty pointed out, only 20% of enterprise data workloads have moved to the cloud. This statement highlights the fact that there is a massive opportunity among technology cloud providers to vie for the rapidly growing cloud storage market estimated to be nearly $90 billion dollars by 2022.
While some enterprises are still dipping their toes into leveraging the cloud for their analytical and OLAP workloads, it’s clear the benefits of moving to a hybrid-cloud architecture outweigh the traditional approach. Technology leaders at large enterprises — whether they are Chief Data Officers, Chief Information Officers or Chief Technology Officers — are being called upon by their businesses to develop and own a cloud strategy for data storage. These strategies typically span the better part of a decade, and often have seven or eight figure budgets. Virtually all of these plans have the goal of moving some or all of an organization’s data warehouse architecture to the cloud. However, the actual move to the cloud often is not penciled in for the near term. Understandably, many technology executives at large enterprises are hesitant to begin moving data to the cloud.
Some of these organizations have vast amounts of data in multiple legacy systems such as Teradata or Hadoop. A lot of time and money has been committed to implementing and maintaining these systems, and it is far from simple to unravel all of that work and recreate it in the cloud. Perhaps most importantly, moving petabytes of data from a legacy system to the cloud has the potential to wreak havoc on the business intelligence teams who are analyzing an enterprise’s data to provide insights on key business questions. Without a plan in place for how to rewrite the logic that exists between BI tools and on-premise data warehouses when data is moved to the cloud, a cloud transformation project could result in business interruption that prevents the delivery of key insights in a timely fashion.
Additionally, migrating to a cloud data warehouse inherently involves a key decision — choosing which data warehouse or combination of warehouses to migrate to. Given the amount of data, time, and money involved, Chief Data Officers are right to proceed with patience and to test out multiple options. Arguably the best strategy for selecting a cloud EDW would be to hedge on which will best suit an organization’s needs by trying multiple products. This strategy could help an enterprise receive better pricing, as cloud data warehouse vendors will jockey for an enterprise’s business. Of course, the major obstacle to this approach is the lack of a solution to handle siloed data. Until a solution that nails data federation across multiple data stores hits the market, organizations are better suited to focus on a single cloud vendor.
Despite the roadblocks to a fast migration of enterprise data to the cloud, immediate benefits in both cost and performance should encourage large companies to start moving data into the cloud in the near term:
Cost
Most line of business executives are responsible for P&Ls, and Chief Data Officers are no different. As many of these individuals are industry veterans, they will be well aware that the cost of housing data in an on-premise warehouse is high. Indeed, legacy systems such as Netezza and Vertica can become cost prohibitive when asked to host data on the scale that enterprises in 2018 require. These tools also carry high maintenance costs. Even newer data stores like Hadoop that are cheaper than legacy warehouses can still be extremely costly and complex to maintain. Hadoop’s open source nature exacerbates the high maintenance costs, to the point where they can offset the cheaper price of standing up the tool.
Even moving a portion of an enterprise’s information from older warehouses to a cloud system like Google BigQuery, Amazon Redshift, and Snowflake could provide significant short-term cost benefits. The pay-as-you-go pricing model employed by cloud providers is a great fit for the current gap in how between data storage and usage, as 60-70% of an enterprise’s data is never utilized for analytics. In a legacy data warehouse, this information still ends up costing money while lying dormant. Moving some of this data to a cloud environment would eliminate this cost, as an enterprise would only need to pay for data that it uses.
Performance
While economics is always top of mind for line of business executives, ensuring that the data architecture they are responsible for enables strong analytical performance is critical. Ultimately, the reason data warehouses are central to a business’ operations is for that data to be used to solve critical challenges facing an enterprise. For that to be possible, the users of business intelligence tools like Tableau and Power BI need to be able to efficiently run queries to a data warehouse. Cloud data warehouses have an advantage on legacy systems as their more modern architecture typically results in faster query response times, which in turn reduces the time it takes an organization to derive crucial insights from the data it possesses.
Taking advantage of the improvement in BI performance by moving data to the cloud is something that organizations should begin executing on in the short term, as the discrepancy in agility between cloud and legacy warehouses will likely continue to increase. Google and Amazon both have massive amounts of resources to throw at improving Redshift and BigQuery respectively, while Snowflake has had no issue raising the level of funding needed to continue improving its product. Indeed, these three vendors are well placed to combine product improvements with the R&D efforts necessary to unlock the next big trend in the cloud data warehouse space.
Taking Advantage of New Modeling Methodologies, Avoiding a Forced Move
In additional to the widening gap in performance between cloud and legacy data warehouses, cloud solutions are better positioned to allow enterprises to run machine learning and predictive AI models. These types of advanced models will play a significant role in helping enterprises make critical business decisions proactively. The possible use cases for AI and Machine Learning are significant and exist in many industries. Financial services companies could use these models to pinpoint signs of fraud and identify investment opportunities earlier. Healthcare organizations could enhance medical experts’ abilities to identify trends or red flags that might ultimately improve diagnostics and treatment. Having the right enterprise data warehouse to support these types of models will be essential for an enterprise to be able to see the benefits of increased automation in business intelligence.
Finally, as organizations continue to move data into the cloud, and companies with both legacy and cloud offerings put more resources into their cloud solutions, there is a distinct possibility that support for legacy solutions could expire. Finding out that support for a tool is ending can present significant challenges, as an organization using that tool will likely be forced to spend time and money to move data elsewhere. A forced move as opposed to a planned one will also be more disruptive to the business as a whole. If Ginni Rometty’s statement that enterprises have only moved 20% of their workloads to the cloud is directionally accurate, identifying systems where support could be discontinued represents a method to prioritize which workloads among the remaining 80% should be moved next.
In 2025, it is not unreasonable to expect that a majority of enterprise data will sit in cloud warehouses. Organizations moving to the cloud earlier will be able to benefit for a longer period of time than their competitors from the cost and performance benefits that being on the cloud provides.
The Practical Guide to Using a Semantic Layer for Data & Analytics