This is the final project for Data Engineering Zoomcamp by DataTalks Club 🚀😤

Introduction

Overview

Movies and TV shows on streaming platforms 🎥🍿

I really enjoy watching movies and TV shows, and I recently watched Dune 2 and Gentlemen on Netflix. It was fantastic!

For my final project as part of the Data Engineering Zoomcamp 2024 program, I have selected two TMDB (The Movie Database) datasets from Kaggle:

The Full TMDb TV Shows Dataset for 2024
The full TMDb Movies Dataset for 2024

These datasets provide comprehensive information about various movies and TV shows, including ratings, genres, cast, and more.

I will be focusing my analysis on these platforms:

Netflix
Amazon Prime
Disney+
Hulu

Problem description

Platform Content Overview: Analysis of TV shows and movies available on each platform.
Decadal Distribution of Releases: Examination of the distribution of TV shows and movies by their release years over the last ten years, categorized by platform.
Genre Dominance by Platform: Identification of the top five prevalent genres on each platform.
Episodic Excellence: Ranking the top 10 TV shows on each platform based on the number of episodes.
Film Length Leaders: Ranking the top 10 movies on each platform by their runtime duration.

Dashboard

You can view the dashboard here.

Note

After completing the course, I have deleted all the data, so there is nothing left on the dashboard.

Also it is saved as a static PDF here.

Technologies

The selected technologies for this project include:

Cloud: Google Cloud Platform (GCP)
Data Lake (DL): Google Cloud Storage (GCS)
Data Warehouse (DWH): BigQuery
Infrastructure as Code (IaC): Terraform
Workflow Orchestration: Mage AI
Transforming Data: DBT (data build tool)
Data Visualization: Looker Studio

Note

For Partitioning, tables larger than 100 Gb are chosen, and for Clustering tables larger than 10 Gb are chosen. The reason for filtering out the smaller tables is because the optimization benefit is smaller and less predictable. Therefore, this project does not include these smaller tables, as the volume of data we have is relatively small. Here is the source.

A few words about how the pipeline operates in MAGE AI. Data is extracted from the Kaggle dataset and stored in the data directory on the server. It is then archived. In the following block, it is saved in parquet format and uploaded to gcs. External tables, staging, and core-level models are created within the dbt block. Upon successful completion of all higher-level processes, the data directory is successfully removed. In my case, I used Dynamic Blocks to take the output of one block and dynamically create more blocks using that information.

Data Source

Dataset was taken from Kaggle:

The Full TMDb TV Shows Dataset for 2024
- File Type: CSV
- File Size: 78.3 MB
- Rows: 166,383
- Columns: 29
The Full TMDb TV Movies Dataset for 2024
- File Type: CSV
- File Size: 463.1 MB
- Rows: 1,011,520
- Columns: 23

Reproducability

How to Recreate the Project?

Recreating the project is a simple process and should only take about 15 minutes. You can find a detailed tutorial on how to do this here.

Future Improvements

Create tests for all code and sql

Credits

A special thank you to the instructors, who have provided guidance and support throughout the course. Their expertise and insights have been incredibly valuable in the development of the Movies and TV Shows project. The course has taught me a great deal of useful skills and techniques, which have greatly enhanced my knowledge as a data engineer.

Thank you for organizing such a comprehensive and interesting course experience! 🙏

Feedback:

The feedbacks that was received after the verification process on the course

Looks nice!
The benefit of choosing a large dataset is that partitioning and clustering actually makes sense, you are the only person I evaluated that actually did this step! The steps to reproduce are clear, and the tiles are beautiful. It would be nice if you included more information on the steps you took, ie, information on data transformation with dbt, maybe a screenshot of your DAG, etc.
Good command of python and detailedly and logically segregated scripts. Seems to not have up-to-date data to be fetched regularly via batch? Would be cool to have that!

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
movies_tv_shows		movies_tv_shows
recreate_project		recreate_project
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dev.env		dev.env
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is the final project for Data Engineering Zoomcamp by DataTalks Club 🚀😤

Table of Contents

Introduction

Overview

Problem description

Dashboard

Technologies

Data Source

Reproducability

Future Improvements

Credits

Feedback:

About

Releases

Packages

Languages

murkenson/movies_tv_shows_data_pipeline

Folders and files

Latest commit

History

Repository files navigation

This is the final project for Data Engineering Zoomcamp by DataTalks Club 🚀😤

Table of Contents

Introduction

Overview

Problem description

Dashboard

Technologies

Data Source

Reproducability

Future Improvements

Credits

Feedback:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages