My work in machine learning includes theory and algorithm synthesis, with application to signal processing, fault detection and prediction in regimes from health care to complex environmental systems. Past work include constrained probabilistic clustering, hierarchical empirical Bayes models for classification and prediction of longitudinal data streams (with applications to medicine), fast neural net dynamical surrogates for model-data fusion (with applications to data assimilation in a large-scale river-estuary-ocean system), anomaly detection in medical tests and environmental sensor networks, computer-controlled stimuli for probing the sensory-motor system in weakly electric fish, ensemble dynamics of spike-timing-dependent plasticity, and nonlinear dimensionality reduction.
My current work is focused on approximation techniques for probability densities arising from Markov processes on continuous domains. The time evolution and steady state densities for such processes are not solvable in closed form for the case of transition probabilities W(x'x) that are non-Gaussian and non-linear in the initial state x'. Over the past several years, my students and I have developed asymptotic expansions for densities from such processes. The approximation techniques have applications to spike-timing-dependent neural plasticity, and random walks (arising for example in stochastic gradient descent in statistical model fitting). The techniques also have application in statistical state estimation, and in chemical and gene expression networks.
Robustly Detecting Clinical Laboratory Errors
Hospital clinical laboratory tests are a major source of medical information used to diagnose, treat, and monitor patients. Such test errors lead to delays, additional expense, clinical evaluation and sometimes to erroneous treatments that increase risk to patients. Such errors compromise clinical utility, cost effectiveness and patient safety. One recent study suggests that errors in measured total blood calcium concentration due to instrument mis-calibration alone cost from $60M to $199M annually in the US.; as noted below, the bulk of errors do not originate in instrument mis-calibration.
Clinical laboratory errors affect about 0.5% of samples collected. Of those, approximately 75% of clinical laboratory test errors originate during sample collection, transport, and storage — jointly called the pre-analytic phase — before samples reach the analysis instruments. However the quality control measures standard in hospital clinical test labs only monitor instrument calibration to fiducial test materials. They are therefore completely blind to sample faults introduced in the pre-analytic phase, where most errors originate.
Data derived from patient samples, rather than instrumentation calibration checks, holds the key to detect faults introduced in the pre-analytic phase. Attempts to date to use such information are primitive and grossly insufficient. Current methods are either so insensitive to errors that they do not detect sample faults reliably, or they routinely flag normal samples as being faulty.
This project develops and uses statistical machine learning technology to reliably detect errors in hospital clinical laboratory tests, using data derived from patient samples. In a preliminary study, the PI showed that multi-variate statistical models of lab tests revealed errors that existing techniques missed. The primary obstacle to developing reliable statistical detectors for lab errors is the cost of labeling samples combined with the low error rate. Developing and evaluating any automated error-detection algorithm requires a sufficient number of samples, both faulty and non-faulty. Determining which tests are faulty requires review of the tests and other patient data (e.g. charts) by a clinical lab expert— a time-consuming and economically unfeasible prospect given the low fault rate. The project addresses this challenge through active learning paradigms used to select, with emphasis on rare classes, subsets of the data for labeling by human experts. The project focuses on chronic kidney disease because of its medical importance and large data repository at the PI’s institution. This research will provide algorithms for clinical lab error detection that will extend to tests used in other disease entities (for example diabetes and heart failure).
Ultimately, the error-detection algorithms developed from this research will make their way into clinical laboratory information systems and further into commercialization and thus deployment on a scale significant enough to have widespread positive impact on laboratory costs patient risk.
Stochastic Learning Dynamics
Funded by NSF
The discovery that synaptic plasticity is mediated by processes sensitive to the precise relative timing of pre- and post-synaptic events overturned models of synaptic change based on average activity levels (so-called rate-dependent models). The discovery of Spike-Timing-Dependent Plasticity (STDP) requires new theoretical tools for its description.
Individual STDP events have inherent random variability as well as variability from timing fluctuations due to circuit-level random factors. So computational synaptic dynamics in the new paradigm must be based in the theory of stochastic processes. Previous work modeling the stochastic dynamics STDP typically used the nonlinear Fokker-Planck equation (FPE) to approximate the intractable master equation governing the dynamics. Although often useful, the FPE is known to be deeply flawed and potentially misleading. The situation recalls the use of the FPE by machine learning theorists in the early to mid 1990s; the dynamics of both STDP and on-line, machine learning algorithms follow a Markov process described by a master equation. This project establishes rigorous tools for treating the stochastic dynamics of learning systems based on spike-timing-dependent synaptic plasticity. It develops well-grounded approximation techniques (and exact solutions where available) for probability distributions on the synaptic weights and their moments, and applies the new techniques to synaptic dynamics in natural and artificial learning systems. The new methods are compared to the FPE used in recent literature to provide insight into the accuracy and appropriateness of the various methods. The techniques are relevant not only to computational neuroscience and machine learning, but more broadly to regimes with Markov dynamics are described by a master equation --- potentially including state estimation, and the chemical master equation. The project provides software to the research community for computing distributions and moments using the new methods.