Building Your Skytrax Data Warehouse: A Comprehensive Guide

Mar 30, 2022 | Programming

Welcome to the world of data storage and analysis! In this how-to article, we’ll navigate the intricate pathways of setting up a full data warehouse infrastructure using technologies like Docker, Apache Airflow, AWS Redshift, and Metabase. Buckle up, as we unravel how to orchestrate your data like a maestro directing a symphony!

What You’ll Need

  • An AWS account
  • Docker installed on your machine
  • Basic understanding of SQL and data pipelines

Understanding the Architecture

Imagine our data warehouse as a bustling airport. Here’s a breakdown of its core modules:

  • Apache Airflow: The air traffic controller, ensuring smooth data orchestrations.
  • AWS Redshift: The hangar where our data safely resides.
  • Metabase: The dashboard where travelers (data analysts) check flight statuses (data visualizations).
  • Docker: The container that keeps our airport facilities organized and functional.

Data Acquisition and ETL Overview

The journey starts with gathering data from here. This data takes an initial landing on your local disk before taking a flight to the ‘Landing Bucket’ on AWS S3. Our ETL jobs, written in SQL and smartly scheduled in Airflow to run every hour, keep this data fresh and up-to-date.

Data Modeling

We categorize our data into:

Dimension Tables:

  • aircrafts
  • airlines
  • passengers
  • airports
  • lounges

Fact Tables:

  • fact_ratings

ETL Flow: Connecting the Dots

Now, let’s delve into our ETL flow, likening it to a series of checkpoints in our airport:

  • The data collected from here is ushered to the landing zone (S3 buckets).
  • The ETL job safely transfers data from this zone to the staging area in Redshift.
  • A task in Airflow kicks in to transform the data, ensuring it’s flight-ready.
  • Dimensional and fact tables in our Data Warehouse are then updated using an UPSERT operation.
  • Data quality checks are conducted to ensure every piece of information is accurate and on point.

Environment Setup: Preparing for Takeoff

Hardware Configuration

For Redshift, we’ve utilized a 2-node cluster with dc2.large instance types.

Setting Up the Infrastructure

Follow these steps to set up your local infrastructure:

  1. Run git clone https://github.com/iam-mhaseeb/Skytrax-Data-Warehouse
  2. Navigate into the directory with cd Skytrax-Data-Warehouse
  3. With Docker service running, execute docker-compose up. This may take some time as it pulls the latest images and installs everything automatically.

Configuring Redshift

To run a Redshift cluster, please follow the detailed AWS Guide.

How to Run Your Infrastructure

Running Airflow

  • Ensure Docker containers are up and running.
  • Access the Airflow UI by navigating to http://localhost:8080 in your browser.
  • Set up the required connections, and you should see the skytrax_etl_pipeline DAG at your disposal.

Explore different views of the DAG to visualize your data flow!

Accesing Metabase

  • Ensure Docker containers are still running.
  • Open the Metabase UI by heading to http://localhost:3000 and set up your Metabase account and connecting to the database.
  • Once your DAG has run successfully, you can start playing around with your data!

Troubleshooting Potential Issues

If you encounter issues during setup or execution, here are some troubleshooting ideas:

  • Ensure Docker is running properly and no containers are failing.
  • Check your Airflow logs for any errors in the DAG task execution.
  • Verify your AWS configuration and permissions if there are issues with the data pipeline.
  • If specific transformations are failing, review your SQL queries for any potential syntax errors.
  • For any persistent issues, consider reaching out for community insights.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

This Skytrax Data Warehouse is an exemplary demonstration of leveraging technology to manage and visualize an immense amount of data efficiently. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox