Uncategorized

StatConnect@AI: Student Posters

Xin Li – PhD Student, Dept. of Math/Statistics – Georgetown University

This paper introduces a computationally efficient algorithm in systems theory for solving inverse problems governed by linear partial differential equations (PDEs). We model solutions of linear PDEs using Gaussian processes with priors defined based on advanced commutative algebra and algebraic analysis. The implementation of these priors is algorithmic and achieved using the Macaulay2 computer algebra software. An example application includes identifying the wave speed from noisy data for classical wave equations, which are widely used in physics. The method achieves high accuracy while enhancing computational efficiency.

Dana Kim – Master’s Student, Data Science and Analytics – Georgetown University

This study examines whether the New York Police Department’s implementation of a proprietary predictive policing system known as Patternizr exacerbated racial bias in policing practices. Using 9.49 million complaint records from NYC Open Data spanning the years of 2006–2024 and focusing on violent felonies, we conducted statistical and time-series analyses to compare racial disparities across NYPD precincts with the highest complaint counts before and after the rollout of Patternizr in 2013. We employed t-tests, chi-square tests, ANOVA, and ARIMA time series analysis to evaluate changes in monthly complaint volumes and racial composition of complaints, and forecasted reporting delays and monthly complaint volume based on past trends. Across the 20 precincts we examined, 18 exhibited statistically significant shifts in racial composition following Patternizr’s implementation. Although Black suspects became less proportionally overrepresented relative to pre-Patternizr baselines, when comparing racial groups in their change in representation, Black suspects became overrepresented relative to White suspects. Additionally, complaint counts involving Black suspects remained 5-7 times higher per month than those involving White suspects, and crimes involving Black suspects were reported approximately three times faster than those involving White suspects. Our results indicate that existing disparities in policing patterns persist in the post-implementation period, and show evidence of at least relative exacerbation for historically marginalized racial groups. These findings suggest that predictive policing systems trained on historically biased data may reproduce (if not exacerbate) existing disparities, underscoring the importance of rigorous empirical evaluation, transparency, and algorithmic auditing in the deployment of AI-driven law enforcement tools.

Zixuan Zhao – PhD Student, Department of Statistics – George Washington University

Individual patient data (IPD) are essential for statistical analysis in clinical re search,yet access is often limited due to privacy concerns,high data-sharing costs,and proprietary restrictions. Conventional approaches to synthetic data generation, such as generative adversarial networks (GANs), require a large piece of IPDas training set. We propose an assumption-lean, three-step method to create synthetic IPD for survival data, requiring no IPD for training. In summary, it digitizes the reported Kaplan-Meier(KM) plots and generates covariates that match the reconstructed data. Compared with existing IPD reconstruction methods, our approach is the first to exploit Scalable Vector Graphics (SVG) for high-accuracy digitization, and the first to provide covariate information. We demonstrate the method’s potential through two detailed case studies and complementary simulation studies.

Han Su – PhD Student, Department of Statistics – George Washington University

Combining forecasts from multiple experts often yields more accurate results than relying on a single expert. In this paper, we introduce a novel regularized ensemble method that extends the traditional linear opinion pool by leveraging both current forecasts and historical performances to set the weights. Unlike existing approaches that rely only on either the current forecasts or past accuracy, our method accounts for both sources simultaneously. It learns weights by minimizing the variance of the combined forecast (or its transformed version) while incorporating a regularization term informed by historical performances. We also show that this approach has a Bayesian interpretation. Different distributional assumptions within this Bayesian framework yield different functional forms for the variance component and the regularization term, adapting the method to various scenarios. In empirical studies on Walmart sales and macroeconomic forecasting, our ensemble outperforms leading benchmark models both when experts’ full forecasting histories are available and when experts enter and exit over time, resulting in incomplete historical records. Throughout, we provide illustrative examples that show how the optimal weights are determined and, based on the empirical results, we discuss where the framework’s strengths lie and when experts’ past versus current forecasts are more informative.

Arman Azizyan – Master’s Student, Department of Statistics – George Mason University

Deep Learning Based Changepoint Detection Algorithm for Time Series Data with Smooth Local Fluctuations Detecting mean shifts in time series is challenging when the underlying signal contains smooth structure that masks abrupt changes, causing standard changepoint algorithms to perform poorly. This work introduces a deep neural network based approach that isolates local mean shifts by explicitly modeling and removing the smooth component. For each time index, two feed-forward neural network (FNN) models are trained on left and right neighborhoods of set width, and a third FNN is trained on the combined two‑window region. A detector statistic is constructed from the residuals of these three models, quantifying whether the two‑model fit provides a significantly better local description than the single‑model alternative. The statistic is smoothed, and local maxima are extracted under a minimum‑distance constraint to identify candidate changepoints. A threshold is then selected using the empirical cumulative distribution function of the detector statistic, enabling data‑driven identification of significant peaks. A k‑means refinement step subsequently adjusts the estimated changepoint locations and estimates the magnitude of the mean shift. Finally, a global FNN is fitted to the corrected series to assess model adequacy and provide diagnostic evaluation. Simulation studies and a real‑world application demonstrate that the proposed method substantially outperforms selected competing methods, yielding fewer false positives and more accurate localization of mean shifts, particularly in the presence of smooth underlying structure.

Jilei Lin – PhD Student, Department of Statistics – George Washington University

Experimental autoimmune myasthenia gravis (EAMG) is an established animal model for studying the progression of MG, a chronic autoimmune neuromuscular disorder, and for developing effective treatments. In EAMG studies, weight trajectories exhibit biologically distinct phases: an initial pre-chronic period followed by chronic sub-phases with differing rates of change. These structured transitions motivate modeling the relationship between weight and treatment time using a bent-line regression framework. However, the analysis is complicated by ethical and experimental protocols requiring early euthanasia of severely affected mice, which introduces monotone missingness in the weight data. Standard approaches that ignore this mechanism yield biased estimates of both slopes and change points. We develop a censored bent-line quantile regression framework with a simple estimation procedure. By leveraging the appealing identifiability of quantiles under censoring, the method consistently identifies pre-chronic and chronic sub-phases without making parametric distributional assumptions. We establish the consistency and asymptotic normality of the proposed estimator and develop a bias-corrected approach for constructing confidence intervals. Simulation studies and application to the EAMG data demonstrate that the proposed method substantially reduces bias and improves inference in evaluating the efficacy of an antigen-specific immunotherapeutic vaccine.

Roberta Southwood – Master’s Student, Data Science – American University

This study applies statistical methods to rigorously test mathematical sheaf modeling on the wolf and moose population predator and prey dynamic in Isle Royale, Michigan. Data is from the contiguous collection years of 1959 to 2019. The importance of this study is to improve understanding of sheaf modeling on ecological systems. Sheaf modeling is a topological tool which allows mathematicians to track data locally and globally while accounting for inconsistencies and forming hypotheses about future interactions. We apply these concepts to a study on wolves and moose previously modeled by James T. Thorson et. al. with a dynamic structural equation model (DSEM). We are reapplying the netlist translation from DSEM models to sheaf structures developed by Micheal Robinson et. al. In our novel approach we apply this method along various segmented timelines of data, allowing for intricate evaluation of the sheaf model framework. Results are quantified by minimizing and tracking consistency radii which measure distance between observed and hypothesized data. Exploring consistency radii allows verification of model performance as well as identifies areas in need of improvement or further study.

Zexin Ren – PhD Student, Department of Statistics – George Washington University

Systematic literature reviews of clinical trials are central to evidence synthesis and regulatory decision-making, but conventional workflows are time-consuming, labor-intensive, and vulnerable to selection bias. We propose two semi-automated multi-agentic systems (MAS) to support systematic reviews: one for study screening and one for data extraction. In the screening MAS, multiple independent agents evaluate trials in parallel, and an inspector agent resolves disagreements for human verification. The extraction MAS uses a standardize-then-extract architecture and reports extraction confidence for extracted fields. Across the workflow, the proposed systems substantially reduce screening and extraction time while producing stable and reproducible decisions. In a real-world replication of a published network meta-analysis, the framework recovered all previously included studies and identified additional eligible trials missed by manual review. This work provides a reproducible, scalable, and regulatory-aligned framework for AI-assisted evidence synthesis in clinical research.

Yang Long – PhD Student, Department of Statistics – George Mason University

Understanding how covariates, such as demographic, clinical and genetic variables, relate to complex spatial imaging outcomes is a fundamental problem in brain imaging studies, particularly when imaging responses are high-dimensional and only partially observed. Recent generative models allow missing imaging modalities to be synthesized from auxiliary data, but their incorporation into regression analysis requires careful statistical treatment. In this project, we propose a two-stage Synthetic Surrogate Functional Regression (SSFR) framework, a statistical safeguard for incorporating AI-generated neuroimages to learn spatially varying covariate effects in the presence of partially observed imaging modalities. SSFR addresses data sparsity by integrating surrogate images within a unified regression framework, yielding stable and efficient estimation when the primary imaging modality is incompletely observed and remaining robust to the choice of surrogate generation model. To enable neuroimaging data with dense voxel grids over complex 3D domains, we further develop a distributed version that utilizes a triangulation-based domain decomposition, partitioning the imaging space into subregions and enabling parallel estimation via trivariate penalized splines. The SSFR scheme preserves the statistical accuracy of the centralized estimator while substantially reducing computational and memory costs, enabling practical analysis of high-dimensional imaging data. Extensive simulations demonstrate the numerical accuracy, robustness, and scalability of the methods. Applied to spatially normalized PET images from ADNI, SSFR yields interpretable, high-resolution coefficient maps characterizing spatially varying associations between auxiliary covariates and cerebral metabolism, demonstrating the critical synergy between modern AI data synthesis and rigorous statistical methodology.

Jiayi Zheng – PhD Student, Department of Statistics – George Mason University

In recent years, the size of datasets has dramatically increased. This has en- couraged the use of subsampling, where only a subset of the full dataset is used to fit a model in a more computationally efficient manner. Existing methods do not provide much guidance on how to find optimal subsamples for a linear model when the variance of the errors depends on the model covariates through an un- known function. This paper presents three main contributions that aid in finding optimal subsamples in the case of heteroscedastic errors. First, a kernel-based method is proposed for estimating the error variances in the full dataset based on a Latin Hypercube subsample. Second, a generalized version of the Information- Based Optimal Subdata Selection (IBOSS) algorithm is introduced that uses the variance estimates to find subsamples with high D−efficiency. Third, an Ap- proximate Nearest Neighbor Simulated Annealing (ANNSA) algorithm is used to find subsamples that are efficient under the I−optimality criterion, which seeks to minimize integrated prediction error variance. Simulations show that the pro- posed subsampling algorithms have better D− and I−efficiencies than existing methods. The subsampling methods are used to analyze an airline dataset with over 7 million rows.

Aakash Hariharan – Master’s Student – George Washington University

The study investigates the efficacy of satellite-derived atmospheric pollution data as high-frequency proxies for sectoral economic performance in regions where traditional national accounting data are limited or unreliable. Conventional indicators, such as Nighttime Lights, have long been used as proxies for aggregate economic activity. However, they often suffer from luminosity saturation in dense urban and industrial hubs which provides limited insights into specific productive sectors. By synthesizing existing literature and empirical analysis, this research evaluates how distinct chemical signatures, including Nitrogen Dioxide (NO2), Sulfur Dioxide (SO2), Methane (CH4), and Particulate Matter (PM2.5), can be linked to sector-specific economic activities such as transportation, heavy manufacturing, energy production, and mining. The study first addresses the technical challenge of isolating PM2.5 levels from NO2 and SO2 precursors to derive a refined signature of transportation activity in Pakistan. Second, Algeria’s separation of hydrocarbon and non-hydrocarbon GDP is used to identify which atmospheric indicators are most closely associated with extractive activity relative to broader industrial production. Finally, a comparative analysis using subnational and quarterly GDP estimates from Myanmar examines the statistical relationships between these environmental indicators and observed sectoral economic performance. This multi-country framework offers a robust methodology for near-real-time monitoring of economic activity in data-scarce developing economies.

Naiska Buyandalai – Master’s Student – George Washington University

Food deserts, areas where residents have limited access to affordable and nutritious food, affect millions of Americans and are closely linked to economic inequality and poor health outcomes. This study analyzes the structural and socioeconomic factors associated with food desert prevalence across the United States using data from the USDA Food Access Research Atlas and the USDA Food Environment Atlas. After extensive data cleaning and feature engineering, tract and county level indicators were constructed to capture income levels, poverty rates, food assistance participation, transportation access, and food retail availability. Exploratory data analysis revealed strong geographic and socioeconomic disparities in food access, with higher concentrations of food desert tracts in areas characterized by lower median income, higher poverty rates, and greater reliance on SNAP benefits. To further examine these relationships, a Random Forest model was developed to identify the most influential predictors of food desert prevalence. The results highlight how structural economic conditions strongly shape food accessibility across communities. Based on these findings, this study proposes the Food Equity & Economic Development (FEED) Act, a data driven policy framework that combines targeted tax incentives for healthy food retailers, infrastructure investment in underserved areas, and workforce development initiatives. This work demonstrates how data science and machine learning can support evidence based policies aimed at reducing food access inequities in the United States.

Ziji Qin – PhD Student – George Washington University

In comparative studies, achieving balance in influential baseline covariates is fundamental to the credible assessment of treatment effects. Two prominent strategies for improving covariate balance are covariate-adaptive randomization (CAR) and rerandomization (RR). CAR is originally designed for settings in which subjects are enrolled sequentially, whereas RR requires that all subjects enter the trial simultaneously. Although both approaches aim to mitigate covariate imbalance, their underlying mechanisms differ substantially, which has hindered systematic comparisons and left important theoretical questions unresolved. This article develops a unified framework that enables a rigorous comparison of CAR and RR when both procedures target an identical imbalance criterion. Within this framework, we establish conditions under which RR achieves balance properties comparable to those of CAR and characterize how each design affects the asymptotic behavior of treatment-effect estimators. We further investigate the computational burden of RR, showing that it grows linearly with the sample size but exponentially with the covariate dimension. As a result, while RR can be effective in small to moderate samples, its computational cost escalates rapidly, limiting its scalability in large trials or experiments with a large number of covariates. In contrast, CAR retains computational efficiency while delivering strong balance guarantees in both sequential and simultaneous enrollment settings.

Ritu Patel- Master’s Student – George Washington University

Temperature in large language models (LLMs) is usually seen as a simple sampling control, but the impact on behavior is more nuanced than just a change from determinism to randomness. We address this in the first part by framing low-temperature generation as a dynamical systems problem and then testing if different LLMs have unique temperature-dependent “fingerprints.” We convert generated texts into a series of discrete symbols using symbolic dynamics to study tail periods, attractor-like motifs, and the transitions from fixed or low-period behavior to complex regimes. We perform a temperature sweep for several autoregressive models: DistilGPT-2, GPT-2, GPT-2 Medium, GPT-2 XL, GPT-Neo 1.3B, Qwen1.5-1.8B, and Spanish GPT-2. Results reveal that low-temperature behavior significantly differs depending on the model. Spanish GPT-2 and GPT-2 show the most stability, having dominant fixed-point patterns and exhibiting chaotic behavior much later. GPT-Neo 1.3B and GPT-2 XL have more distinct transition phases with temporary periodic motifs, whereas DistilGPT-2, GPT-2 Medium, and Qwen1.5-1.8B become unstable earlier and show more irregular symbolic expansion. Transition temperatures also vary depending on the prompt. To sum up, the study indicates that each LLM has a unique low-temperature behavioral pattern, and the symbolic period structure offers a human-understandable method for comparing generation dynamics among different models.

Jing Tan & Yuhua Pang- Master’s Students – Georgetown University

This study analyses changes in family formation patterns in the United States over the past four decades. It focuses on long-term trends in marriage, cohabitation and child-living arrangements, with variations in these trends across states. Unlike studies that rely solely on marriage or divorce rates, this paper focuses on the actual family structures in which children live, thereby providing a more comprehensive picture of contemporary changes in American families. The project combines long-term residential arrangement data from the Census Bureau, state-level marriage rate data from the CDC, state-level unemployment rates and other socio-economic indicators from the BLS. The study describes long-term changes at the national level, including trends in the proportion of children living with married two-parent families, the proportion of children in single-parent families, and the proportion of families with children headed by unmarried couples. It subsequently examines differences in family structure at the state level and analyze the influence of factors such as income, poverty, education and unemployment. Concurrently, the study employs spatial analysis methods to examine whether there is a spatial clustering of family structures across states. Furthermore, as an extension, this paper plans to utilize the PSID or NLSY79 Child and Young Adult data to explore the intergenerational links between family-of-origin structure and marital, cohabitation, and fertility behaviors in adulthood. Through this research, the paper aims to gain a more comprehensive understanding of the diverse trends in family formation in the United States and their social implications.

Aditya Kanbargi & Sanjana Kadambe Muralidhar- Master’s Students – George Washington University

NutriMap is a data analytics project that examines the relationship between food access, grocery pricing, and inflation across the United States. The research combines county-level food environment data from the U.S. Department of Agriculture, over 48,000 real-time grocery prices from a major national retailer, product nutrition information from a public food database, and ten years of government food price indices. Using this integrated dataset, the analysis identifies 345 food desert counties nationwide, affecting an estimated 13.6 million people. The results reveal a striking affordability gap: residents in food deserts can afford roughly 11 grocery items per week on a typical food budget, compared with 35 items for those in well-served areas — a 2.5 times difference. The project also finds that healthier food products cost 136 percent more per unit than processed alternatives, creating a systematic nutrition disadvantage for lower-income households. To anticipate future trends, the project compares two forecasting methods across six food price categories and finds that four are expected to rise over the next six months, signaling continued pressure on household food budgets. All findings are presented through a three-page interactive dashboard featuring a national food desert map, a cost-of-nutrition comparison, and an inflation forecast view. NutriMap demonstrates how combining spatial, nutritional, and economic data into a single visual platform can make food access inequality more visible and support better-informed public health and policy discussions.

Ruishan Lin- PhD Student – George Mason University

High-dimensional genomic data play a pivotal role pharmaceutical research and precision medicine. Conventional gene-level analyses often assume independence among genes, potentially missing coordinated biological processes driving disease progression and treatment response. Network-based approaches address this limitation by modeling gene–gene interactions and enabling the identification of functionally coherent molecular substructures linked to clinical phenotypes. We propose a pipeline that constructs disease network for each sample, classify the responders and nonresponders samples accurately, and extract the most influential subnetworks to drive interpretable biomarker discovery. The proposed method incorporates Graph Neural Networks (GNNs) and Graph Information Bottleneck (GIB) principles, reaching a balance between prediction and dimension reduction. One key highlight of our approach is the explainability of deep learning models. Rather than relying on post-hoc explainers, the optimization penalty itself quantifies node and edge significance. Edges and nodes that strongly resist the prior yield high KL scores, allowing for the direct extraction of a “minimal sufficient subgraph.” We demonstrate the efficacy of this pipeline in identifying group-discriminative features and shared biological hubs within complex disease networks. Ultimately, this framework provides a generalizable, end-to-end explainable pipeline for mapping drug response mechanisms and isolating true driver genes in precision medicine, while addressing current computational scalability limits in whole-genome applications.

Hannah Peterson- Master’s Student – American University

According the National Institute of Health, women experience a larger amount of time between symptom onset and diagnosis of a disease than men do for most diseases (2023). Delayed diagnosis and under diagnosis poses a public health threat to those who are missed, as the longer a disease goes untreated, the more a patient is susceptible to worsening symptoms and negative impacts to their health. Expert systems are AI-powered software that are meant to mimic human “expert” decision making, including medical professionals. Essentially, while they are limited in interactive capabilities, it is used by professionals to provide advice and develop solutions in appropriate settings. While useful, expert systems being utilized in a medical setting require professional stan- dards. Women being grossly under and misdiagnosed by professionals suggests that this misrepresentation could be relayed to expert systems. As expert systems are programmed and trained by the same experts making these mistakes, I wondered if expert systems had a similar misdiagnosis rate as human experts do. I analyzed two systems utilized by medical professionals based on four gender groups and four diseases, inputting relative symptoms per gender looking for a specific diagnosis and analyzing the output by these systems. The results of my experiment reported statistical significance and visual disparity between male and female diagnosis. Further, the data shows a possible secondary discovery, with differences in diagnosis types, favoring mental illness related diagnosis for women far more than reported for men.

Navid Nezamabadi- PhD Student – George Washington University

A zero-inflated Poisson–Gamma Dynamic System for forecasting intermittent demand in inventory systems is explored. The model incorporates a binary occurrence component governing zero versus nonzero demand alongside a dynamic Poisson–Gamma structure for demand size, enabling flexible Bayesian modeling of temporal dependence and heterogeneity in intermittent demand processes.

Zixu Hao- Master’s Student – Georgetown University

Microbiome studies increasingly rely on publicly available sequencing data, yet contamination remains a major obstacle to reliable biological interpretation. Existing decontamination methods often depend on experimental metadata or laboratory measurements that are missing from shared datasets, limiting their use in secondary analyses. This study introduces a new perspective that treats contamination as an ecological coexistence problem between microbes and host samples. By modeling host–microbe relationships within a network framework, we identify microbial features that consistently associate with negative controls as likely contaminants. The same framework also defines biologically meaningful core microbes in disease-relevant settings. Because it requires only routinely available data, this approach enables principled, interpretable decontamination across diverse microbiome studies and supports more reliable large-scale and meta-analytic research.