Skip to content

Final Project for Data Engineering Zoomcamp Course 2024 πŸ§™πŸ”₯

Notifications You must be signed in to change notification settings

murkenson/movies_tv_shows_data_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

This is the final project for Data Engineering Zoomcamp by DataTalks Club πŸš€πŸ˜€

Table of Contents

Introduction

Overview

Movies and TV shows on streaming platforms πŸŽ₯🍿

Movies and TV shows

I really enjoy watching movies and TV shows, and I recently watched Dune 2 and Gentlemen on Netflix. It was fantastic!

For my final project as part of the Data Engineering Zoomcamp 2024 program, I have selected two TMDB (The Movie Database) datasets from Kaggle:

  • The Full TMDb TV Shows Dataset for 2024
  • The full TMDb Movies Dataset for 2024

These datasets provide comprehensive information about various movies and TV shows, including ratings, genres, cast, and more.

I will be focusing my analysis on these platforms:

  • Netflix
  • Amazon Prime
  • Disney+
  • Hulu

Problem description

  1. Platform Content Overview: Analysis of TV shows and movies available on each platform.
  2. Decadal Distribution of Releases: Examination of the distribution of TV shows and movies by their release years over the last ten years, categorized by platform.
  3. Genre Dominance by Platform: Identification of the top five prevalent genres on each platform.
  4. Episodic Excellence: Ranking the top 10 TV shows on each platform based on the number of episodes.
  5. Film Length Leaders: Ranking the top 10 movies on each platform by their runtime duration.

Dashboard

You can view the dashboard here.

Note

After completing the course, I have deleted all the data, so there is nothing left on the dashboard.

Also it is saved as a static PDF here.

Technologies

The selected technologies for this project include:

  • Cloud: Google Cloud Platform (GCP)
  • Data Lake (DL): Google Cloud Storage (GCS)
  • Data Warehouse (DWH): BigQuery
  • Infrastructure as Code (IaC): Terraform
  • Workflow Orchestration: Mage AI
  • Transforming Data: DBT (data build tool)
  • Data Visualization: Looker Studio

Architecture diagram

Note

For Partitioning, tables larger than 100 Gb are chosen, and for Clustering tables larger than 10 Gb are chosen. The reason for filtering out the smaller tables is because the optimization benefit is smaller and less predictable. Therefore, this project does not include these smaller tables, as the volume of data we have is relatively small. Here is the source.

A few words about how the pipeline operates in MAGE AI. Data is extracted from the Kaggle dataset and stored in the data directory on the server. It is then archived. In the following block, it is saved in parquet format and uploaded to gcs. External tables, staging, and core-level models are created within the dbt block. Upon successful completion of all higher-level processes, the data directory is successfully removed. In my case, I used Dynamic Blocks to take the output of one block and dynamically create more blocks using that information.

Data Source

Dataset was taken from Kaggle:

Reproducability

How to Recreate the Project?

Recreating the project is a simple process and should only take about 15 minutes. You can find a detailed tutorial on how to do this here.

Future Improvements

  • Create tests for all code and sql

Credits

A special thank you to the instructors, who have provided guidance and support throughout the course. Their expertise and insights have been incredibly valuable in the development of the Movies and TV Shows project. The course has taught me a great deal of useful skills and techniques, which have greatly enhanced my knowledge as a data engineer.

Thank you for organizing such a comprehensive and interesting course experience! πŸ™

Feedback:

The feedbacks that was received after the verification process on the course

  • Looks nice!
  • The benefit of choosing a large dataset is that partitioning and clustering actually makes sense, you are the only person I evaluated that actually did this step! The steps to reproduce are clear, and the tiles are beautiful. It would be nice if you included more information on the steps you took, ie, information on data transformation with dbt, maybe a screenshot of your DAG, etc.
  • Good command of python and detailedly and logically segregated scripts. Seems to not have up-to-date data to be fetched regularly via batch? Would be cool to have that!

About

Final Project for Data Engineering Zoomcamp Course 2024 πŸ§™πŸ”₯

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published