How to build your foundational data models

An overview of a modular data modeling process.

July 15, 2022
Kyle

Data modeling refers to the process of applying structure to raw data and making it usable. You can think of it as an abstraction layer that interacts with your raw data and enables you to find business answers within it.

If you want a quick overview of the ins and outs of data modeling (including types of data models, types of data modeling, and why data modeling is important), you can find that information in this post from the AirOps blog archives → What is Data Modeling?. It explains the “what is?” part of data modeling, which is a helpful precursor to today’s topic: How to build your company’s foundational data models.

A quick disclaimer on building data models before we get started

Here’s the thing about data modeling: It’s not a rigid concept. Definitions of data modeling vary. There’s no 100% universally accepted, agreed-upon data modeling process that’s used by every organization.

However, no matter how you define or approach data models, the outcome of an analytics data model is the same: Data modeling takes raw data and helps make it usable for data analytics and business intelligence (BI).

So, without any further ado, let’s explore the data modeling process that we like to use here at AirOps when working with startups, early-stage companies, and other organizations. We’ll review each step in detail, but here’s a (very) simplified rundown of our preferred data modeling process:

  1. Pull data into a data warehouse with an ELT tool
  2. Transform data inside the warehouse (transformation = cleaning, joining, and altering the data so it's usable)
  3. Output tables that contain metrics and dimensions

In reality, there’s a bit more to it than those three steps, though.

Step 0: Set up your modern data stack

Before you can even think about how to model your data, you’ll need a modern data stack. Like many other common data modeling processes, the data modeling approach that we’re about to review is powered by a modern data stack that includes an ELT layer and a cloud data warehouse

While it’s possible to build data models without a modern data stack, this is the method that we use and recommend most often. The following steps assume that you’ve already:

  1. Pulled source data into your data warehouse.
  2. Performed foundational transformations during the ELT process, including renaming columns, changing data types, removing personally identifiable information (PII), and filtering. The end result of this process will be staging tables, which we’ll talk about again in Step 4.

That last part is super important. Taking care of your foundational data transformations as early as possible saves time in the long run and ensures that you’re using uniform, consistent naming conventions and column definitions from here on out.

Step 1: Gather business requirements

Once you have your modern data stack ready to go, you need to gather business requirements and outline how data will be used to gain insights and drive action.

There are a few things to keep in mind during the beginning steps of data modeling. For starters, try to anticipate future questions. Much like how early data transformation sets you up for success, so does having a future-minded approach. Don’t just think about the requirements you have today – also think about the questions you’ll need to answer in the future. That way, your data team can build models that do so much more than solve a short-term problem or answer a one-off question.

I’m also going to recommend that you take a modular approach to data modeling by thinking about the business requirements of a single function or subject area versus the company as a whole. Many companies set out to create a monolithic, overarching data model that captures the entire business at once. While that's important and can provide a unique perspective, those types of data models are incredibly difficult to maintain.

Here’s an example of what I mean: Let’s say you want to switch customer support providers from ZenDesk to Hubspot’s Service Hub. If you’ve built a massive overarching data model, that single switch in tooling means your entire data model needs to be redone. This approach isn’t very flexible and it’s a pain to maintain.

However, if you’ve built several individual data models for customer support, marketing, sales, finance, and other functions, or even data models for individual data sources, it’s much easier to combine that data as needed to derive insights.

Step 2: Identify entities and their key attributes

The first drafts of your data model will identify entities and their unique attributes. Entities are real-world objects that can be tangible or intangible. Attributes describe the properties of an entity.

For example, an entity called employee might have attributes that include name, work location, health plan enrollment, and employee ID. Here’s a visual example:

A diagram showing the difference between entities and attributes in a data model.
Source

Step 3: Define relationships between entities and attributes

Next, you’ll map out the relationships between different entities (this is sometimes referred to as a conceptual data model). 

This is the "big picture" of what your data model will contain, how it will be organized, and which rules are involved. It will also serve as the blueprint for building your final data models.

Step 4:  Join tables together and define metrics

You have a data modeling roadmap, so let’s go back to the staging tables that we talked about earlier.

In this step, you’ll take a source table, clean it up a bit, and then re-deploy it. I like to call this the “intermediate” stage – it’s where you join multiple tables together and define metrics.

Once you have intermediate tables, you’ll join those tables together and draw relationships between them. These join operations enable pulling additional dimensions into fact tables, defining custom calculation logic, and performing aggregations over multiple components of the dataset.

During this step, you’ll notice that the lines between tables start to cross and come together. Your data model will start to look a little something like this:

A data model using data from Zendesk.

This is a model created from data extracted from Zendesk – the purple box on the right represents our final table for ticket_summary_daily. To get to this model, I had to consider all of the key pieces of information relevant to customer support tickets: quantity, number closed, response time, customer satisfaction score, and other important metrics (those are represented in the blue boxes).

Step 5: Create a metric table/data mart

After the intermediate stage, I sometimes create a metrics table (also called a data mart). Metric tables are the final tables that combine multiple metrics and dimensions (represented by the purple box in the above data model). This makes it easy for the business to understand and use the data for analytics.

In the Zendesk data model example for ticket_summary_daily, we combined various zendesk_ticket_metrics (which includes reply_times and comment_metrics) with a zendesk_ticket_summary table. 

This process is how metric tables are created, and it’s how you get your first useful and understandable tables. 

Step 6: Create physical tables using a data modeling tool

Physical tables can now be created and scheduled to automatically update in your database using a data modeling tool or by writing custom scripts. I’ll assume that you plan to use a data modeling tool to make the process easier. There are tons to choose from, but my favorite is dbt.

Dbt is great for a lot of reasons, but their package hub is especially useful for building modular data models. If you’re connecting data from any of the sources listed in the hub, dbt offers predefined sets of transformation steps that will build a model for you… or at least get you 90% of the way there. 

Bonus step: Data operationalization

Once you have modular data models for subject area-specific use cases, business end-users can access data and combine it in a user-friendly tool to reach insights that span multiple data sources. 

This is called data operationalization, and it’s a term that you’ll hear the AirOps team use frequently (if you haven’t already!).

Data Operationalization: What happens when everyone in an organization can seamlessly access, understand, and build with the data they need in the core operating tools they already know and love, like Salesforce, Zendesk, and Hubspot 🙌.

As an example, I might want to take my daily_ticket_summary model and look at that data in relation to product usage. In the past, I would have needed engineering support to achieve this level of data operationalization. 

That’s changed in the last few years, though, thanks to reverse ETL.

Reverse ETL tools make data operationalization possible. They replicate data from a storage system (like your cloud data warehouse, where all of your well-modeled data is stored) to third-party tools and applications that you use for everyday operations, like Salesforce, Zendesk, Google Sheets, Notion, Airtable, and others. 

These tools expand the scope of what people can do with data, especially non-technical end-users. With a reverse ETL solution in your data stack, those nice data models that you just made will be put to good use.

At AirOps, we believe that reverse ETL is the future of work 🔮. That’s why we’re building something exciting that will help analytics teams support and empower business end-users and help them go from insights to action. If that sounds like something your company could use, click here to sign up for email updates and we’ll keep you in the loop.

The link between data modeling and data analytics

The answers that you get out of your analytics are only going to be as good as the quality of data that you feed into the system. High-quality data models ensure that your data analytics and BI program runs off clean, well-structured data that’s designed to answer important business questions, now and well into the future. You’ll be closer to solving the data adoption problem, too. 

Just remember: There are many, many ways to build a data model. You don’t necessarily have to build your own models in this exact manner, this is simply how I like to approach the process.