ABSTRACT :

The n-dimensional array (ndarray) is a ubiquitous data structure in scientific computing, whether analyzing time-varying movies of neural activity, collections of satellite images, or sensor time series. The ndarray generalizes the two-dimensional matrix to support data structures spanning multiple dimensions, and many applications could benefit from efficient distributed implementations. While Spark's distributed DataFrame provides rich support for large tabular data, handling ndarrays remains a challenge. This talk introduces Bolt, an open-source implementation of an ndarray built on PySpark. Bolt provides a familiar API enabling distributed computations across one or more array dimensions at a time. It also implements an efficient chunking scheme to minimize shuffle complexity as well as analysis of shape information to simplify error handling. In this talk, we look at Bolt's design and implementation before taking a deep-dive into a particular use case -- probing how decision making works in the brain by analyzing whole-brain neuroimaging + behavioral data from an animal engaged in a virtual reality environment.

BIO:

Jason Wittenbach is a Senior Machine Learning Engineer in the the "Center for Machine Learning" at Capital One. He has a BS in Mathematics and Physics from the University of Notre Dame and a PhD in Physics from The Pennsylvania State University. During his doctoral work, he studied mathematical models of neural circuits to understand how the brain uses sensory-motor feedback in decision making, using the songbird as a model system. As a postdoc at the Janelia Research Campus (Howard Hughes Medical Institute), he built computational tools for processing large neuroimaging datasets and employed machine learning techniques to uncover links between neural activity and animal behavior. A Capital One, he has worked on projects in market intelligence and anti money laundering, as well as developed an automated hyperparaemter tuning platform. Jason is interested in designing and leveraging cutting-edge machine learning techniques to solve important domain-specific problems, as well as architecting and building software tools that support these analytic methods.