Announcements

StatConnect@AI: Student Posters

Xin Li – PhD Student, Dept. of Math/Statistics – Georgetown University

This paper introduces a computationally efficient algorithm in systems theory for solving inverse problems governed by linear partial differential equations (PDEs). We model solutions of linear PDEs using Gaussian processes with priors defined based on advanced commutative algebra and algebraic analysis. The implementation of these priors is algorithmic and achieved using the Macaulay2 computer algebra software. An example application includes identifying the wave speed from noisy data for classical wave equations, which are widely used in physics. The method achieves high accuracy while enhancing computational efficiency.

Dana Kim – Master’s Student, Data Science and Analytics – Georgetown University

This study examines whether the New York Police Department’s implementation of a proprietary predictive policing system known as Patternizr exacerbated racial bias in policing practices. Using 9.49 million complaint records from NYC Open Data spanning the years of 2006–2024 and focusing on violent felonies, we conducted statistical and time-series analyses to compare racial disparities across NYPD precincts with the highest complaint counts before and after the rollout of Patternizr in 2013. We employed t-tests, chi-square tests, ANOVA, and ARIMA time series analysis to evaluate changes in monthly complaint volumes and racial composition of complaints, and forecasted reporting delays and monthly complaint volume based on past trends. Across the 20 precincts we examined, 18 exhibited statistically significant shifts in racial composition following Patternizr’s implementation. Although Black suspects became less proportionally overrepresented relative to pre-Patternizr baselines, when comparing racial groups in their change in representation, Black suspects became overrepresented relative to White suspects. Additionally, complaint counts involving Black suspects remained 5-7 times higher per month than those involving White suspects, and crimes involving Black suspects were reported approximately three times faster than those involving White suspects. Our results indicate that existing disparities in policing patterns persist in the post-implementation period, and show evidence of at least relative exacerbation for historically marginalized racial groups. These findings suggest that predictive policing systems trained on historically biased data may reproduce (if not exacerbate) existing disparities, underscoring the importance of rigorous empirical evaluation, transparency, and algorithmic auditing in the deployment of AI-driven law enforcement tools.

Zixuan Zhao – PhD Student, Department of Statistics – George Washington University

Individual patient data (IPD) are essential for statistical analysis in clinical re search,yet access is often limited due to privacy concerns,high data-sharing costs,and proprietary restrictions. Conventional approaches to synthetic data generation, such as generative adversarial networks (GANs), require a large piece of IPDas training set. We propose an assumption-lean, three-step method to create synthetic IPD for survival data, requiring no IPD for training. In summary, it digitizes the reported Kaplan-Meier(KM) plots and generates covariates that match the reconstructed data. Compared with existing IPD reconstruction methods, our approach is the first to exploit Scalable Vector Graphics (SVG) for high-accuracy digitization, and the first to provide covariate information. We demonstrate the method’s potential through two detailed case studies and complementary simulation studies.

Han Su – PhD Student, Department of Statistics – George Washington University

Combining forecasts from multiple experts often yields more accurate results than relying on a single expert. In this paper, we introduce a novel regularized ensemble method that extends the traditional linear opinion pool by leveraging both current forecasts and historical performances to set the weights. Unlike existing approaches that rely only on either the current forecasts or past accuracy, our method accounts for both sources simultaneously. It learns weights by minimizing the variance of the combined forecast (or its transformed version) while incorporating a regularization term informed by historical performances. We also show that this approach has a Bayesian interpretation. Different distributional assumptions within this Bayesian framework yield different functional forms for the variance component and the regularization term, adapting the method to various scenarios. In empirical studies on Walmart sales and macroeconomic forecasting, our ensemble outperforms leading benchmark models both when experts’ full forecasting histories are available and when experts enter and exit over time, resulting in incomplete historical records. Throughout, we provide illustrative examples that show how the optimal weights are determined and, based on the empirical results, we discuss where the framework’s strengths lie and when experts’ past versus current forecasts are more informative.

Arman Azizyan – Master’s Student, Department of Statistics – George Mason University

Deep Learning Based Changepoint Detection Algorithm for Time Series Data with Smooth Local Fluctuations Detecting mean shifts in time series is challenging when the underlying signal contains smooth structure that masks abrupt changes, causing standard changepoint algorithms to perform poorly. This work introduces a deep neural network based approach that isolates local mean shifts by explicitly modeling and removing the smooth component. For each time index, two feed-forward neural network (FNN) models are trained on left and right neighborhoods of set width, and a third FNN is trained on the combined two‑window region. A detector statistic is constructed from the residuals of these three models, quantifying whether the two‑model fit provides a significantly better local description than the single‑model alternative. The statistic is smoothed, and local maxima are extracted under a minimum‑distance constraint to identify candidate changepoints. A threshold is then selected using the empirical cumulative distribution function of the detector statistic, enabling data‑driven identification of significant peaks. A k‑means refinement step subsequently adjusts the estimated changepoint locations and estimates the magnitude of the mean shift. Finally, a global FNN is fitted to the corrected series to assess model adequacy and provide diagnostic evaluation. Simulation studies and a real‑world application demonstrate that the proposed method substantially outperforms selected competing methods, yielding fewer false positives and more accurate localization of mean shifts, particularly in the presence of smooth underlying structure.

Jilei Lin – PhD Student, Department of Statistics – George Washington University

Experimental autoimmune myasthenia gravis (EAMG) is an established animal model for studying the progression of MG, a chronic autoimmune neuromuscular disorder, and for developing effective treatments. In EAMG studies, weight trajectories exhibit biologically distinct phases: an initial pre-chronic period followed by chronic sub-phases with differing rates of change. These structured transitions motivate modeling the relationship between weight and treatment time using a bent-line regression framework. However, the analysis is complicated by ethical and experimental protocols requiring early euthanasia of severely affected mice, which introduces monotone missingness in the weight data. Standard approaches that ignore this mechanism yield biased estimates of both slopes and change points. We develop a censored bent-line quantile regression framework with a simple estimation procedure. By leveraging the appealing identifiability of quantiles under censoring, the method consistently identifies pre-chronic and chronic sub-phases without making parametric distributional assumptions. We establish the consistency and asymptotic normality of the proposed estimator and develop a bias-corrected approach for constructing confidence intervals. Simulation studies and application to the EAMG data demonstrate that the proposed method substantially reduces bias and improves inference in evaluating the efficacy of an antigen-specific immunotherapeutic vaccine.

Roberta Southwood – Master’s Student, Data Science – American University

This study applies statistical methods to rigorously test mathematical sheaf modeling on the wolf and moose population predator and prey dynamic in Isle Royale, Michigan. Data is from the contiguous collection years of 1959 to 2019. The importance of this study is to improve understanding of sheaf modeling on ecological systems. Sheaf modeling is a topological tool which allows mathematicians to track data locally and globally while accounting for inconsistencies and forming hypotheses about future interactions. We apply these concepts to a study on wolves and moose previously modeled by James T. Thorson et. al. with a dynamic structural equation model (DSEM). We are reapplying the netlist translation from DSEM models to sheaf structures developed by Micheal Robinson et. al. In our novel approach we apply this method along various segmented timelines of data, allowing for intricate evaluation of the sheaf model framework. Results are quantified by minimizing and tracking consistency radii which measure distance between observed and hypothesized data. Exploring consistency radii allows verification of model performance as well as identifies areas in need of improvement or further study.

Zexin Ren – PhD Student, Department of Statistics – George Washington University

Systematic literature reviews of clinical trials are central to evidence synthesis and regulatory decision-making, but conventional workflows are time-consuming, labor-intensive, and vulnerable to selection bias. We propose two semi-automated multi-agentic systems (MAS) to support systematic reviews: one for study screening and one for data extraction. In the screening MAS, multiple independent agents evaluate trials in parallel, and an inspector agent resolves disagreements for human verification. The extraction MAS uses a standardize-then-extract architecture and reports extraction confidence for extracted fields. Across the workflow, the proposed systems substantially reduce screening and extraction time while producing stable and reproducible decisions. In a real-world replication of a published network meta-analysis, the framework recovered all previously included studies and identified additional eligible trials missed by manual review. This work provides a reproducible, scalable, and regulatory-aligned framework for AI-assisted evidence synthesis in clinical research.

Yang Long – PhD Student, Department of Statistics – George Mason University

Understanding how covariates, such as demographic, clinical and genetic variables, relate to complex spatial imaging outcomes is a fundamental problem in brain imaging studies, particularly when imaging responses are high-dimensional and only partially observed. Recent generative models allow missing imaging modalities to be synthesized from auxiliary data, but their incorporation into regression analysis requires careful statistical treatment. In this project, we propose a two-stage Synthetic Surrogate Functional Regression (SSFR) framework, a statistical safeguard for incorporating AI-generated neuroimages to learn spatially varying covariate effects in the presence of partially observed imaging modalities. SSFR addresses data sparsity by integrating surrogate images within a unified regression framework, yielding stable and efficient estimation when the primary imaging modality is incompletely observed and remaining robust to the choice of surrogate generation model. To enable neuroimaging data with dense voxel grids over complex 3D domains, we further develop a distributed version that utilizes a triangulation-based domain decomposition, partitioning the imaging space into subregions and enabling parallel estimation via trivariate penalized splines. The SSFR scheme preserves the statistical accuracy of the centralized estimator while substantially reducing computational and memory costs, enabling practical analysis of high-dimensional imaging data. Extensive simulations demonstrate the numerical accuracy, robustness, and scalability of the methods. Applied to spatially normalized PET images from ADNI, SSFR yields interpretable, high-resolution coefficient maps characterizing spatially varying associations between auxiliary covariates and cerebral metabolism, demonstrating the critical synergy between modern AI data synthesis and rigorous statistical methodology.

Jiayi Zheng – PhD Student, Department of Statistics – George Mason University

In recent years, the size of datasets has dramatically increased. This has en- couraged the use of subsampling, where only a subset of the full dataset is used to fit a model in a more computationally efficient manner. Existing methods do not provide much guidance on how to find optimal subsamples for a linear model when the variance of the errors depends on the model covariates through an un- known function. This paper presents three main contributions that aid in finding optimal subsamples in the case of heteroscedastic errors. First, a kernel-based method is proposed for estimating the error variances in the full dataset based on a Latin Hypercube subsample. Second, a generalized version of the Information- Based Optimal Subdata Selection (IBOSS) algorithm is introduced that uses the variance estimates to find subsamples with high D−efficiency. Third, an Ap- proximate Nearest Neighbor Simulated Annealing (ANNSA) algorithm is used to find subsamples that are efficient under the I−optimality criterion, which seeks to minimize integrated prediction error variance. Simulations show that the pro- posed subsampling algorithms have better D− and I−efficiencies than existing methods. The subsampling methods are used to analyze an airline dataset with over 7 million rows.