DoK Talks #141 - Dossier: multi-tenant distributed Jupyter Notebooks // Iacoppo Colonnelli & Dario Tranchitella
Manage episode 334453420 series 2865115
When providing data analysis as a service, one must tackle several problems. Data privacy and protection by design are crucial when working on sensitive data. Performance and scalability are fundamental for compute-intensive workloads, e.g. training Deep Neural Networks. User-friendly interfaces and fast prototyping tools are essential to allow domain experts to experiment with new techniques. Portability and reproducibility are necessary to assess the actual value of results.
Kubernetes is the best platform to provide reliable, elastic, and maintainable services. However, Kubernetes alone is not enough to achieve large-scale multi-tenant reproducible data analysis. OOTB support for multi-tenancy is too rough, with only two levels of segregation (i.e. the single namespace or the entire cluster). Offloading computation to off-cluster resources is non-trivial and requires the user's manual configuration. Also, Jupyter Notebooks per se cannot provide much scalability (they execute locally and sequentially) and reproducibility (users can run cells in any order and any number of times).
The Dossier platform allows system administrators to manage multi-tenant distributed Jupyter Notebooks at the cluster level in the Kubernetes way, i.e. through CRDs. Namespaces are aggregated in Tenants, and all security and accountability aspects are managed at that level. Each Notebook spawns into a user-dedicated namespace, subject to all Tenant-level constraints. Users can rely on provisioned resources, either in-cluster worker nodes or external resources like HPC facilities. Plus, they can plug their computing nodes in a BYOD fashion. Notebooks are interpreted as distributed workflows, where each cell is a task that one can offload to a different location in charge of its execution.
Iacopo Colonnelli is a Computer Science research fellow. He received his Ph.D. with honours in Modeling and Data Science at Università di Torino with a thesis on novel workflow models for heterogeneous distributed systems, and his master’s degree in Computer Engineering from Politecnico di Torino with a thesis on a high-performance parallel tracking algorithm for the ALICE experiment at CERN. His research focuses on both statistical and computational aspects of data analysis at large scale and on workflow modeling and management in heterogeneous distributed architectures.
Dario is an SWE that turned DevOps, and he's regretting this choice day by day. Besides making memes on Twitter that gain more reactions than technical discussions, leading the development of Open Source projects at CLASTIX, an Open Source-based start-up focusing on Multi-Tenancy in Kubernetes.
KEY TAKE-AWAYS FROM THE TALK
From this talk, people will learn:
- The different requirements of Data analysis as a service
- How to configure for multi-tenancy at the cluster level with Capsule
- How to write distributed workflows as Notebooks with Jupyter Workflows
- How to combine all these aspects into a single platform: Dossier
All the software presented in the talk is OpenSource, so attendees can directly play with them and include them in their experiments with no additional restrictions.