Data Demystified, Part 3: How to build a modern data stack
Part 3: The (non-technical) business leader’s guide to kicking off an effective data analytics & BI program.
When companies first start thinking about building out their data analytics infrastructure, they’re generally already using some SaaS tools, such as Salesforce or Shopify. That means data is already being generated in data stores within those tools. Many of those tools also have their own business intelligence capabilities, even if they don’t provide the whole picture. Additionally, it’s not uncommon for early-stage companies to track and store data in Google Sheets or Excel.
In this scenario, you have multiple sources of data that need to be centralized and combined somewhere. Oftentimes, that means downloading CSV files, combining everything in a spreadsheet, and creating static one-time models. This is a bad approach because:
- Data is siloed. As you’ll see below, most companies use a dizzying array of SaaS tools for their daily operations. Different teams use different tools, too. This leads to a lot of siloed data that can’t be accessed, analyzed, and used to advance the business.
- Data isn’t trustworthy. Do you know where the most recent version is? Is data latency (the time it takes for your data to become available in your database or data warehouse after an event occurs) high? How about data fidelity – has your data been vetted for accuracy and liability?
- It’s easier to make mistakes. Manual processes mean more opportunities for errors, like people working from an outdated Google Sheet with financial metrics that don't match the most recent data collected in Stripe.
- It takes a ton of extra time. The right data stack reduces the need for monotonous work and repetitive data-related tasks. Since everyone in the organization is working from the same system, collaboration between teams and users gets much faster, too.
- It’s not scalable. Having a disjointed, siloed data system isn’t scalable, meaning your technology and processes can’t grow as your data volume, analyses, and overall complexity increase.
The solution? A modern data stack that can grow with your business, your team, and the needs of your customers (both internal and external.) Internally, the right data stack will help your organization’s business users get more out of data. A data stack will benefit your external customers, too, especially if they rely on data from your company in their own operations. For example, a property management company’s homeowner investor customers benefit from having access to data on property performance.
If you’re ready for a deeper dive, keep reading to learn everything you need to know about building a modern data stack that will grow with your business. Here are the questions we’ll answer:
- What is a modern data stack? When does my company need one?
- What are the parts of a modern data stack?
- How do I set up my data stack?
- How much does a modern data stack cost?
What is a modern data stack? When does my company need one?
Terms like “data stack” and “technology stack” are often used haphazardly, but they do have specific meanings. A data stack refers to the set of technologies and services (the “layers”) that organizations use to store, manage, access, analyze, and operationalize data.
As a general rule of thumb, if your organization is asking data questions in more than one place, it’s a good time to build out a data analytics infrastructure.
For example, if an ecommerce company handles all of their advertising in Google Analytics and Google Adwords, all of the advertising-related data they need lives in one place, hurrah! Any questions someone might have are easily answered in a single platform. However, if a company is advertising on Facebook, LinkedIn, Google, and elsewhere, they need to check multiple sources to answer basic questions about the overall performance of their ad campaigns. In this scenario, the right data infrastructure will make it possible to get a holistic view of advertising performance… no time-consuming, redundant processes needed.
With the right data stack, the right team, and the right operating framework, people within your company will have access to actionable data that allows them to make informed decisions and drive results within the business. We’ll share a step-by-step process that you can use to build your own data stack in more detail below, but here’s a quick rundown of what you need to do:
- Choose a data warehouse to centralize your data. We recommend a cloud-data warehouse like Snowflake, Amazon Redshift, or Google BigQuery.
- Choose a data ingestion tool to move data into your data warehouse. We like Fivetran or Airbyte for this part.
- Transform the data with a transformation and orchestration tool. We recommend using dbt and Apache Airflow.
- Choose a data analytics/business intelligence (BI) tool that can visualize the data. There’s no shortage of options, so compare tools on a review site like G2 to find the right one for your needs. A word of caution, though: This is an area where costs can quickly balloon. We often recommend that companies start with a free tool like Google Data Studio, as long as it fits their needs. You can always upgrade later.
- Make your data operational via a reverse ETL solution that moves data from your data warehouse back into your business’s favorite SaaS tools.
Before we take a more detailed look at the “How” of building a modern data stack, let’s define each of the layers of the stack.
What are the parts of a modern data stack?
Modern data stacks are generally built on cloud-based services. Here’s a basic overview of the different parts within the stack and what they do:
Data Source
Aka where the data comes from. There are many potential data sources, like a production database that contains data about the product your company sells, systems that automatically generate logs and reports, and other tools that the business uses for daily operations, like third-party SaaS applications.
Most companies rely on a growing number of SaaS tools in their daily operations. According to data from Productiv, the average company’s SaaS app portfolio contains 254 apps. On a functional level, their research also found that most teams use 40 - 60 SaaS apps on average. That’s a lot of data sitting in siloes that you could be using to improve the business.
Data Ingestion
Data ingestion tools collect, process, and prepare data for transformation into a data warehouse. As a result, clean data will flow between your data source and data storage system.
There are two types of data ingestion methods that you can use to transfer data from a data source to a data warehouse:
- ETL (Extract, Transform, Load): Data is extracted from a source, transformed on a secondary processing server, and loaded into a data storage system. This is the more traditional process, with the main downside being a slower data ingestion process because the data is transformed on a separate server before it’s loaded into the storage system.
- ELT (Extract, Load, Transform): Data is extracted from a source, loaded into a storage system, and transformed inside the storage system. This process came on the scene with the evolution of cloud computing in the 2000s. Since the data can be loaded into your warehouse and transformed at the same time, the data ingestion process is faster with ELT.
Services like Fivetran and Airbyte specialize in this movement of data. Both tools use ELT and make it easy to transport raw data from production sources (like your favorite SaaS tools) into a data storage solution.
It’s worth noting that ELT has become the common standard for data ingestion, because companies already store their data in the cloud and modern data warehouses are powerful enough to do transformations at scale. ELT also integrates seamlessly with popular transformation tools like dbt.
Don’t discount ETL entirely, though. It's the right solution for some companies. For example, when you pre-load the data for transformation, it’s easier to remove personally identifiable information (PII), which can be helpful for companies that need to meet HIPAA (Health Insurance Portability and Accountability Act) compliance standards.
Data Storage
All the data coming in from your data sources is aggregated into a single storage location. Think of this part as the center of your business’s data program.
This is usually a data warehouse. Three of the most popular options for early-stage companies are Snowflake, Amazon Redshift, and Google BigQuery.
Data Transformation and Modeling
Data transformation and modeling tools take the data that’s stored in your data warehouse and convert it into a format that allows business value to be extracted. Those models ensure that business users in the organization use consistent, reliable, and accurate information in their analyses.
A good modeling layer is critical to the success of your data analytics and business intelligence program – before you can use data for analysis it needs to be modeled into something usable.
Dbt, Apache Airflow, and Looker (specifically LookML) are well-known examples of data transformation, orchestration, and semantic modeling tools. It’s not just about choosing the right tool, though, and we’ll explain why below.
Data Analytics and Business Intelligence
You might also know this part as “data visualization.” The data analytics component of the modern data stack allows users to derive insights from their data, often through visualizations like graphs, charts, tables, and dashboards.
This is where the data that’s been collected, structured, and modeled becomes actionable information that people can use to make more informed business decisions.
There are loads of BI tools that you can use to help facilitate data analytics. However, we’ve noticed an interesting trend: The “final mile” for data analytics and BI usually takes you to business tools such as GSuite (Sheets, Docs, Slides), MS Office, Notion/Coda, or Airtable.
This is an important consideration – BI tools don’t always offer users an easy, straightforward way to take their data the final mile into the user-friendly tools that they prefer. Keep this in mind when choosing a business intelligence and data analytics tool.
Data Operationalization
This part of the data stack goes by a few names, including data activation and reverse ETL (Extract, Transform, Load). Semantics aside, the end goal is similar: data operationalization, which is the process of moving data from the data warehouse back into third-party business tools that make the data actionable.
There are a few flavors of use cases for reverse ETLs. One common use case is helping transmit larger volumes of data, such as customer and product data, from a warehouse back into SaaS tools like a CRM. This is a bit more advanced use case, often led by Data / Data Engineering teams. Tools like Hightouch and Census are often recommended here.
At the same time, there is an increasing opportunity for business end-users to take advantage of reverse ETL technology. They can use reverse ETL to help drive their business data into the core operating tools and documents they use everyday to run their teams and execute their workflows. Ultimately, this empowers them to move the numbers they need to move.
This should be a goal for all data-driven companies: Everyone in the organization can seamlessly access, understand, and build with the data they need in the operating tools they already know and love, like Excel, GSuite, Notion, and even certain BI tools.
Increasingly, low- and no-code tools that allow anyone in the business to explore and use data are part of the modern data stack. At AirOps, we believe that this move toward self-service data analysis, where even non-technical employees can build analyses, workflows, and internal tools in a safe, scalable way, represents the future of work.
If that sounds like something that would help unlock the power of data in your organization, click here to make sure you receive our latest updates.
How do I set up my data stack?
Now that you understand the components of a modern data stack, let’s review the actual step-by-step process you’ll take to get your data stack up and running.
Step 1: Identify your data sources
First, figure out where your data is coming from. Make sure you have a global view of every data source before moving on to the next step.
Your end-users probably have preferred SaaS tools that they rely on to do their work. Teams might also be assessing new SaaS tools to add to the operations portfolio. Either way, here are some things to take note of when identifying and/or choosing your data sources:
- What data do you want to get out of the tool?
- Does it have an API that allows you to get data out of the tool? APIs allow different applications to “talk” to each other and they’re a common feature of many SaaS tools. Tools that have an API are ideal, but it’s not a universal feature. Your company may use tools that lack an API.
- Is the tool’s API supported by an out-of-the-box connector/ETL? (We’ll talk about this more in Step #3.)
Step 2: Choose a data warehouse
Next, choose the data warehouse that will store all of your organization’s data.
The big players in the cloud-based data warehouse space are Snowflake, Amazon Redshift, and Google BigQuery. Every tool has different features, benefits, and tech specs. The best data warehouse will depend on your company’s specific needs and you should consider factors like:
- Security
- Volume of data
- The location of the data warehouse (Some countries have restrictions on the types of data that can be stored outside of the country – learn more about how to navigate possible geographic limitations in the first Redshift vs BigQuery vs Snowflake article linked after this list.)
- Scalability (Consider how your database will scale with the growth of your organization’s data analysis demands. The articles linked below each have information you can use to assess whether a data warehouse is likely to meet your needs.)
- The technical engineering resources you have access to for setup and ongoing maintenance
- The company’s long-term BI and data analytics goals (Will you need to build complex machine-learning models with your data, for example?)
- The pricing structure
Selecting a data warehouse is a critical step, so don’t rush it. Here are two resources you can use to kickstart your vetting process:
- Redshift vs BigQuery vs Snowflake: A comparison of the most popular data warehouse for data-driven digital transformation and data analytics within enterprises
- Snowflake vs Redshift vs BigQuery: 16 Critical Differences
Step 3: Select a data ingestion tool
Once you have a data warehouse, you need to load data into it from all of the data sources that you identified in the first step. This happens via a data ingestion tool that handles the ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) process.
There are several different tools you can use that funnel data from various systems and store it in your data warehouse. We mentioned Fivetran and Airbyte earlier – they’re solid choices that integrate with many commonly used third-party SaaS applications.
When comparing data ingestion tools, consider things like pricing structure, the number of integrations available (including whether those integrations support the tools your organization uses), and whether the tool is open-source.
Open-source data ingestion solutions can be a good choice for organizations that use SaaS tools that don’t have API integrations: They use a community-driven development approach where users can build connections between third-party tools and the data ingestion tool, then share them with other users.
Step 4: Decide how to model the data
Data modeling happens in the data transformation layer of your stack, but the process isn’t simple or straightforward. For this part, you’ll probably want to work with someone adept in SQL, such as a data engineer, analytics engineer, or data analyst with experience using modeling tools such as dbt.
That’s because proper data modeling requires a balance of technical skills and business acumen. We’ve noticed that more and more companies opt to hire an Analytics Engineer who understands the tooling and the business to implement this part of the modern data stack. These experts understand how to use tools like dbt, which make it possible to model and transform data directly inside of the data warehouse with only a knowledge of SQL (as opposed to being a full data engineer with deeper technical skills.)
In an upcoming blog, we’ll share a more detailed breakdown of data modeling, why it’s important, and a primer on how to set up your company’s foundational data models.
Step 5: Determine your data analytics process and tooling
Once your data has been stored, ingested, and modeled, teams can start analyzing it.
There’s no shortage of data analytics and business intelligence tools. As we mentioned earlier, you can use a site like G2 to compare tools and find the best fit for your organization and its users. We also created a guide to help startups choose the best BI tool: How to Choose a BI tool.
If you’re just getting started, however, consider a free tool like Google Data Studio. It's more than enough in the early days.
Here’s another thing we touched on earlier: A lot of data analysis and BI work happens in core operating tools like Excel, MS Office, and Google Sheets. This is even true for companies that have made significant investments in visualization tools.
We believe this is a good thing, and it ties into the last step of setting up your modern data stack.
Step 6: Choose a tool to make your data actionable
The final step is data activation (also known as data operationalization.)
There are a few ways to activate your data. One option is a process called “reverse ETL” and here’s how it works: Data is extracted from your data warehouse, transformed inside the warehouse to become usable inside of a third-party system, and then loaded into the third-party system. This makes it easier for end-users to act on their data, without time-consuming manual processes or the need to have multiple tools open in multiple tabs.
Here are some examples of reverse ETL in action:
- Automatically add email subscribers who downloaded a free resource into Salesforce
- Combine data from customer support, sales, and product into Hubspot to develop personalized marketing campaigns
- Link data from internal support channels into Zendesk to provide more effective customer service
- Get real-time sales forecasts in Google Sheets
- Streamline data analysis throughout the entire company by automating manual data processes, like pulling and importing data from CSV files
Traditionally, each of the above examples could only happen with the support of data and engineering teams.
Nowadays, reverse ETL tools make it possible for an organization’s non-technical end-users to self-serve their data needs. They do this through no-code tools that empower and encourage anyone to build the analyses, workflows, and data tools they need in a secure, scalable way.
At AirOps, we’re big believers in the self-service data revolution and we’re hard at work on something very exciting that will help more people unlock the power of data in their organizations. If you want to be the first to know about our upcoming product launch, click here to subscribe to our email list and receive the latest updates.
How much does a modern data stack cost?
Many variables will influence the total cost of your data stack (and the overall cost of your data analytics and BI program in general), including:
- The volume of data
- Whether you need to hire an in-house data team
- The tools that you choose (the cost of BI tools, in particular, can add up quickly)
- How many seat licenses you’ll need for different tools
- How complex your schema needs to be to get the metrics you want to report on
This is a significant investment. You can expect that your data stack will be a fairly large line item in your budget. Early-stage companies that have received their Series A funding regularly spend 10% - 15% of their fundraise operating budgets on their data stack, including the tools, teams, and services required to operate that stack effectively. Unfortunately, these costs tend to balloon and it's not uncommon for companies to see negative ROI out of their data analytics and BI functions. Careful planning can help you avoid that outcome and turn your data stack into a value-generating asset.
This is the second article in our How to Run Business Intelligence and Data Analytics series, where you’ll get a clear roadmap to follow throughout your journey to data program excellence. If you missed the first two parts, be sure to check them out:
- Data Demystified, Part 1: Getting Started With Data Analytics & Business Intelligence
- Data Demystified, Part 2: Setting Your Data Strategy
We’ll continue to demystify all things data-related in upcoming blogs, so keep an eye on this space!