Big Data Warehouse and Data Governance Services Team at Netflix (2024)

Netflix continues to innovate on its product and have a goal to grow to 500 million members worldwide. In addition, Netflix has a strategic goal to build a globally connected and data informed studio that can manage hundreds of productions simultaneously. The scale and challenges of integrating production and studio data with the product data is unprecedented in the industry.

In this mission, the data platform team is at the core of Netflix success.

The data platform team is responsible for building a comprehensive data infrastructure and solutions to support various Netflix business functions. As we scale to serve the fast growing and increasingly complex business needs, we are adopting a product oriented approach and devising a comprehensive data strategy for the whole of Netflix. And, specifically, the big data warehouse team is responsible for two of the key missions.

Intelligent Big Data Warehouse

We are envisioning an intelligent abstraction layer that could fully automate the maintenance and optimization of our big data warehouse; and thereby improving productivity of our users and efficiency of our platform.

The big data platform is responsible for the analytical data infrastructure that enables our business stakeholders to make business decisions efficiently and with ease. Our big data warehouse is on AWS S3. It has hundreds of thousands of datasets and hundreds of petabytes of data. The platform also runs hundreds of thousands of big data workflows and jobs every day on compute engines like Presto, Spark, or Druid.

The big data platform has a few core components that we continue to innovate on. Iceberg is a table format for large analytical datasets in our data warehouse. Metacat is our metadata store that maps hundreds of thousands of logical datasets to their physical locations on AWS S3. We also have lineage information on workflow, job and dataset dependencies.

To take it to next level, using the metadata and lineage information from the core components, we could auto-discover and prioritize datasets most ‘beneficial’ to optimize based on different dimensions like data cost, # of queries using the dataset, criticality to the business, etc. We could use machine learning algorithms to run experiments to auto-analyze the benefits of optimization by doing ‘explore and exploit’ on various tuning parameters. For example, we could tune a dataset to optimize storage (e.g., encoding or compression) or to optimize compute time (e.g., sort order, row grouping optimization, or files compaction.) Based on the analysis results, we could auto-optimize datasets prioritizing on largest gain. We could leverage excess compute resources to perform optimization during trough time. We could also have a feedback loop to measure the effectiveness of the optimized dataset and adjust the heuristics and parameters. We could rinse and repeat this lifecycle of prioritizing, analyzing, optimizing and benchmarking across all datasets in the data warehouse.

We envision to have a fully automated intelligent data warehouse as we scale. This way, our business users can focus on what they do best and not need to worry about wrangling with details or managing the life cycle of their huge datasets in the big data warehouse.

Centralized Data Governance Infrastructure

This team is also responsible for the broader data strategy at Netflix. We aim to provide visibility to the taxonomy and business context of all Netflix datasets (including video assets, unstructured logs, etc). This would allow Netflix better organize its data, take better informed actions on the data with increasing precision in security, privacy, and efficiency, improve its risk management capabilities, and increase confidence in the quality of the data.

We just started on a journey to build the foundational components for the data governance initiative. We are building a Netflix-wide data catalog to capture and infer business metadata across all datasets at Netflix. We are also building a Netflix-wide schema registry to help datasets interoperate across systems and manage the lifecycle of schemas in different data stores. We are designing a centralized and yet customizable data detection framework that could sample, detect, and report violations across all datasets which would give us holistic insights on the risk profiles and quality of our data. We are also building a centralized and pluggable policy engine that would allow our stakeholders to customize data policy rules for all datasets.

At Netflix, data infrastructure runs like a power grid that supplies electricity across all organizations. However, there are many data sources beyond what is managed by the data platform team. There are also a few local data solutions built for fast innovation. To spearhead a cohesive and centralized data strategy across Netflix requires a lot of strategic alignment and partnership with different business stakeholders. We are in the middle of defining and evolving Netflix data strategy across different organizations as we continue to grow as a global entertainment company.

This team is in the midst of building its long term vision of an intelligent data warehouse and a centralized data governance platform for Netflix.

We are looking for an engineering manager to lead this team. Let us know if you are interested to lead the team through this journey.

Big Data Warehouse and Data Governance Services Team at Netflix (2024)
Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6314

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.