What is a data warehouse?
Data Warehouses: Overview, benefits, and how to choose a data warehouse.
In our guide on How to Build a Modern Data Stack, we explained which tools and systems you need to have to create a data stack that will grow with your business, your team, and the needs of your internal and external customers.
The data warehouse is the centerpiece of that stack; it’s where all of the data you need to run your business will be centralized.
What is a data warehouse?
A data warehouse stores data from multiple data sources in a single location to make it easier to access, blend, and analyze. These repositories are designed and optimized for large-scale storage of data and quick querying of that data, even if the queries you’re trying to answer are quite complicated.
Today, most of the providers that early stage companies consider are cloud data warehouses. Like many of the other tools you might use to run your business, these modern data warehouses are software as a service (SaaS) platforms that don’t require end-users to own or maintain the physical infrastructure.
Here’s a more technical definition from Snowflake, one of the top cloud data warehouses on the market:
“A data warehouse is a relational database that is designed for analytical rather than transactional work. It collects and aggregates data from one or many sources so it can be analyzed to produce business insights. It serves as a federated repository for all or certain data sets collected by a business’s operational systems.”
That’s a bit of a mouthful, but you don’t need to know what a federated repository is to make use of a data warehouse in your organization. It’s more important to get a solid handle on the basics, which include:
- When do I need a data warehouse?
- Benefits of a data warehouse
- How do I get data into a data warehouse?
- Data warehouses vs. data lakes
- How do I build a data warehouse?
- How much does a data warehouse cost?
- Which data warehouse provider should I choose?
Keep in mind that you won’t need to tackle each of the above items on your own – building a data warehouse is a technical process that requires specialized knowledge and skills!
As a non-technical business leader, it’s important for you to first understand the basics. Once you have a solid foundation of knowledge, you can source the technical resources needed to build and maintain your data warehouse.
When do I need a data warehouse?
In our How to Build a Modern Data Stack blog, we talked about when it’s time to implement a data stack in your company:
As a general rule of thumb, if your organization is frequently asking questions about data that sits in more than one place, it’s a good time to build out a data analytics infrastructure.
Since a data warehouse is the centerpiece of your data stack, the same guidelines apply here, too. As your company grows and becomes increasingly data-driven, you’ll want to consider building a data warehouse to centralize all of your important data.
To expand on that, here are some additional examples that signal the need for a data warehouse:
- You have large amounts of data being generated from multiple sources (e.g. SaaS tools, marketing tools, production databases, and CDPs).
- Data is overly granular and hard to work with, causing the business to spend a lot of time manually summarizing and pivoting the data to make it usable.
- People spend a lot of time manually updating existing management reports and pulling data from various sources to get a complete picture.
- Users in your organization regularly download data exports from multiple sources and combine them in spreadsheets for analysis.
- Different departments (theoretically) report on the same data but often arrive at different conclusions. This might happen because people are working from outdated data or because business definitions of important metrics are inconsistent.
- You need a complete historical record of data, but it isn’t possible within all of the platforms your company uses. For example, Google Analytics 4 only retains user-level data for a maximum of 14 months, but your digital marketing team wants to compare user data year-over-year for the last three years.
Finally, and perhaps most importantly, if your organization regularly delays decision-making because you’re waiting for answers to questions, it’s probably time for a data warehouse (and a full-fledged data analytics and business intelligence program, if we’re being honest).
Benefits of a data warehouse
A data warehouse is part of a system that allows your organization to make better decisions, faster. Here are some of the other top benefits of a data warehouse:
1. Improved data quality inside the data warehouse
A data warehouse alone doesn’t improve data quality – if you load bad data into a warehouse and don't transform it to improve quality and usability, it’ll still be bad data.
Traditionally, ETL (Extract, Transform, Load) processes transformed data before loading it into the warehouse so that it would arrive clean and organized. Nowadays, modern cloud data warehouses offer enough compute power and scalable storage to ingest raw data and transform it into clean, usable, analytics-ready models after it lands in the warehouse. This process is called ELT (Extract, Load, Transform).
2. Cost-effective, scalable access to historical insights
With a data warehouse, you can store large volumes of historical data (and quickly retrieve that data as needed) for analytics. As a result, stakeholders can easily and efficiently gain a comprehensive understanding of past trends, performance, and challenges, which allows for educated prediction and decision-making.
As we mentioned earlier, the historical completeness that a data warehouse provides also means you don’t have to worry about whether you’ll be able to access past data from your favorite SaaS tools.
We mentioned Google Analytics 4 as an example earlier, but historical insights can be a concern with any number of data sources. Here’s an example from Stripe: “In live mode, users can only access events created within the last 13 months from the Dashboard. Events older than 30 days will only provide a summary view. Events within the last 30 days will be fully visible via the API and Dashboard.”
3. Simplified cross-domain analysis
Cross-domain analysis refers to the process of analyzing data from multiple sources… and it’s rarely straightforward when you’re working with multiple data sources.
When you can join data from individual sources, like ad platforms and transaction platforms, you’ll be one step closer to answering questions like, “What is our ROAS (return on advertising spend) by channel?” without needing to download a bunch of CSVs and Index-Match them together in Excel.
4. Company-wide trust in data
“Single source of truth” is a buzzword that you’ll hear again and again to describe this benefit, but trust is the real value-add that it’s describing.
Here’s the thing about trust: When it comes to data, trust is easy to lose and difficult to get back. Because a data warehouse is your central repository of all data, business stakeholders know that they can trust that data to be consistent and accurate.
5. More accurate data analysis
Data warehouses also enhance your business intelligence capabilities by facilitating more accurate analysis.
For example, because data is stored in one location, the KPIs that are computed from that data are more likely to be consistent and reliable, which leads to improved decision-making across the business.
6. Scalable storage capacity and flexible pricing
Cloud data warehouses also make it easy to scale your storage capacity up or down as needed, based on the amount of data that needs to be stored.
Pricing is also flexible. Data warehouse providers have different billing structures for the services they provide, which means you can choose one that best suits the way your organization’s users work. For example, if you have large amounts of data and a steady usage pattern (i.e., users run queries at predictable intervals), you’ll want to keep that in mind when vetting potential data warehouses.
7. Optimal data granularity
A data warehouse can easily store data to the exact degree of detail, or granularity, that’s required for business analysis. More detailed data allows for more flexibility in your data analysis.
How a name field is subdivided is a common example that’s used to explain data granularity. A single field that contains a customer’s first, middle, and last name isn’t very granular. But if the name field has been subdivided into first name, middle name, and last name, you have access to more granular, and therefore more specific, data.
What’s the ideal level of data granularity? Well, it depends.
There’s no “right” answer to how granular the data that’s stored in your warehouse needs to be. That’s a question that will be answered during the data modeling process, which is a topic that we’ll explore in more detail in an upcoming blog.
8. Increased speed
Generating reports requires a ton of computational power, and running business intelligence tools directly on a user’s computer can slow their system down to a halt. Data warehouses allow for faster querying because they’re purpose-built for speedy data retrieval and analysis.
Cloud data warehouses have another advantage in the speediness department: They’re relatively quick and easy to build and deploy.
Conventional data warehousing required a team of database administrators and data modelers. Cloud data warehouse solutions like Snowflake and Google BigQuery have completely changed the game – they’re very easy to set up and scale up or down alongside your organization’s data needs.
9. Improved security
Since all of your data is in a single location, it’s possible to grant permission-based access to specific users who need specific data.
Modern cloud data warehouses also have strong permissioning, data security, and encryption practices in place, and because all of your data can be backed up automatically, the risk of accidental data loss is substantially reduced.
However, it’s important to note that “secure” might be a subjective term depending on the security posture of your company. Organizations that embrace cloud-based technologies are more likely to agree with the above statements. On the other hand, some organizations prefer to host all of their infrastructure in-house and will not opt for solutions that exist outside of their virtual private cloud (regulatory compliance requirements can also play a role).
10. Empowerment of end business users
This last benefit is our favorite: A data warehouse is the focal point of a modern data stack that makes it possible for people within your company to access data that allows them to make informed decisions and drive results within the business.
How do I get data into a data warehouse?
The process of getting data into a data warehouse is called data ingestion. We’re going to share more about data ingestion in the future if you’re interested in learning the ins and outs, but here’s the CliffsNotes version:
Data ingestion is generally accomplished via an ETL or ELT process. The letters stand for extract, transform, and load, and here’s what happens during each part:
Extract: Data is pulled from a source, like a SQL database or a SaaS tool.
Transform: The format or structure of a data set is converted to match the properties of the data warehouse (or another destination system) and the needs of the business.
Load: Data is moved into a data warehouse (or another destination system).
The main difference between the two is where data transformation occurs:
- With ETL, data is transformed on a separate server before it’s loaded into the storage system.
- With ELT, data is extracted from a source, loaded into a storage system, and transformed inside the storage system as it’s processed.
ETL is the more traditional process. However, ELT is widely used amongst modern companies that already store their data in the cloud, because modern data warehouses are powerful enough to process large-scale data transformations.
There are several different tools you can use to funnel data from various source systems into your data warehouse. Fivetran and Airbyte are two popular options that integrate with many commonly used third-party SaaS applications.
Data warehouses vs. data lakes
These terms are sometimes used interchangeably, and while each one stores large amounts of data, it’s worth noting that they aren’t the same thing.
A data warehouse stores structured, filtered data that has already been processed for a specific use. It takes a lot of computational power to process all of that raw data, which is why data warehouses tend to be more expensive than data lakes.
Data lakes store a mix of structured, semi-structured, and unstructured data. They’re like a “dumping ground” where huge amounts of data can be stored in a flexible format for future use. They tend to be more affordable when you have massive amounts of data. For example, Internet of Things (IoT) companies that generate and store a lot of sensor and transaction-level data are good candidates for data lakes.
It’s worth noting that as businesses collect an ever-increasing volume of data, data warehousing technologies have evolved to meet the demands of their users. Today’s today warehouses are getting better and better at querying large amounts of data.
Amazon Redshift Serverless is a good example of this evolution. The service automatically scales data warehouse capacity in response to workloads, which provides fast performance and helps users optimize their costs, because they only pay for what they use.
Speaking of costs, by this point you’re probably wondering what a data warehouse will cost.
How much does a data warehouse cost?
Designing, building, and maintaining a data warehouse is often expensive and time-consuming. If you’re a non-technical business leader who won’t be handling the implementation or maintenance side of the data warehouse, we recommend focusing your energy on becoming familiar with the different pricing models.
This part gets tricky, as you’ll see in a moment, but the primary factors that influence the final cost of your data warehouse are:
- Consumption-based pricing, which is based on the amount of data queried. This pricing model is easier to calculate, but costs can balloon very quickly depending on usage. Snowflake and Google BigQuery use this model (both providers also offer upfront pricing, which we’ll explain in more detail below.)
- Compute-based pricing, which is calculated based on CPU, RAM, and disk usage. Compute needs are generally more predictable but the upfront investment is higher. This is how Amazon RedShift determines your final costs.
You have plenty of cloud data warehouse vendors to choose from, but here’s a brief and basic overview of the pricing models for the three solutions mentioned above, plus some examples of their ideal usage scenarios:
Snowflake
Snowflake’s costs depend on compute time and storage needs.
For storage, you can pay upfront, which is cheaper ($23/TB/month), or opt for on-demand pricing that scales with usage but is slightly more expensive ($40/TB/month). Snowflake has a time-based pricing model for compute resources where you’re only charged for the time that your users are actively running queries.
This arrangement works for organizations that have steady and continuous usage patterns, but who also require frequent upscaling and downscaling of their data resources. Companies that provide data as a service are good candidates for Snowflake’s pricing model because their clients need access to that data via an analytics user interface or data APIs.
Download Snowflake’s pricing guide here.
Google BigQuery
The cost of BigQuery is also determined by storage needs and compute usage. Storage costs $20/terabyte/month. If data isn’t used within 90 days, it’s automatically moved into their long-term storage option with a reduced cost of $10/terabyte/month.
Calculating compute usage is slightly more complicated. By default, queries are billed with an on-demand pricing model where the price you pay is determined by the number of users per day, the number of queries, and the average data usage per query. Flat-rate pricing is also available if you want a predictable monthly cost, but this option is best for high-volume users.
This setup is a good option for organizations that run a large number of queries intermittently. For example, occasional complex queries for monthly business reports or ad-hoc analyses from sales and marketing teams.
Learn more about BigQuery’s pricing here.
Amazon Redshift
To determine how much Redshift will cost, you first need to estimate what type and how many “nodes” (aka a collection of computing resources) you’ll need. Here’s how to figure that out:
- For smaller storage needs, Amazon recommends DC2 nodes that cost $0.25/hour to $4.80/hour.
- If you’ll need to store large amounts of data, Amazon recommends RA3 nodes with managed storage. The cost for these nodes ranges from $0.85/hour to $6.80/hour, plus an extra $24/TB/month for storage.
These are the on-demand rates – discounts are available if you sign a longer-term contract.
In general, Redshift is a good fit for organizations that need near-constant computation resources, like live dashboards that require constant refreshing.
Use this calculator to estimate how much Redshift will cost your organization.
Which data warehouse provider should I choose?
As you may have guessed by now, the answer to this question is once again, “It depends.”
The three technologies mentioned above represent a small sampling of the available choices. Other data warehouse options include Microsoft’s Azure Synapse, Single Store, Firebolt, Exasol, and the list goes on and on. You can check out G2’s data warehouse reviews to find even more. We recommend reviewing documentation and requesting a quote for any solution you’re interested in. From there, you can compare, contrast, and consult the data experts on your team (or an outsourced consultant).
And while we can’t tell you which cloud data warehouse solution is the best fit for your company, we can connect you with the resources you need to make important decisions about your company’s data analytics and BI program. Browse the AirOps blog to find more insights, best practices, and real-world advice that you can use to empower data teams and business teams to get more value from data.