We have hundreds of platform users ranging from running casual queries to ETL developers and data scientists running tens to hundreds of queries every day. However, as a user of the system, understanding where and how a particular job executes can be confusing. This cohesive infrastructure abstracts all of the orchestration from the execution and allows the platform team to be flexible and adapt to dynamic environments without impacting users of the system. Genie, our execution service, abstracts the configuration and resource management for job submissions by providing a centralized service to query across all big data resources.
We experiment with new software and perform live upgrades by simply diverting jobs from one cluster to another or adjust the size and number of clusters based on need as opposed to capacity.
Decentralizing the data warehouse frees us to explore new ways to manage big data infrastructure but also introduces a new set of challenges.įrom a platform management perspective, being able to run multiple clusters isolated by concerns is both convenient and effective. This differentiates us from the more traditional configuration where Hadoop’s distributed file system is the storage medium with data and compute residing in the same cluster.
One of the key points from the article is that Netflix leverages Amazon’s Simple Storage Service (S3) as the “source of truth” for all data warehousing.
In a post last year we discussed our big data architecture and the advantages of working with big data in the cloud (read more here).