Spark Performance / Test Harness
Apache Spark is a fascinating big data platform that combines ease-of-use for developers and admins with computation performance. We focus on improving the throughput of Spark-as-a-service, in scenarios involving cluster sharing and multi-tenancy. We explore adaptations of clustering techniques to Spark, such as malleable/moldable workload scheduling, back-filling, and others. There are many interesting challenges and trade-offs involved, such as guaranteeing high throughput and workload latency/turnaround vs total throughput.
For the purposes of our research, we created Test Harness (TH), a unique tool for experimenting with multi-tenant Spark service performance. It allows designing experiments that encompass all the factors of a multi-tenancy setting: tenants, SLAs, workloads, datasets, and most important – the arrival process (trace) by which workloads and datasets are submitted to the service by tenants. Then, TH allows executing experiments and finally collecting the Spark and infrastructure metrics in an orderly fashion and reasoning about them – first through high-level goal metrics (scores) and then drilling down through bundled tools.
TH serves several important use-cases, including analyzing service performance under particular scenarios, and managing performance regression tests for a Spark service that is being developed. Last, being independent of any particular service architecture, TH can be used for competitive analysis between different Spark services, using the same set of experiments.