Statistics

Advances in sensing technology have made it possible to collect large volumes of high-dimensional time-series data. In fields like genetics and neuroscience, key questions concern whether directed relationships between variables can be learned from these data. To this end, graphical vector autoregressions are a popular tool because zeros among the autoregressive coefficients and error precision matrix have natural interpretations in terms of Granger non-causality and contemporaneous conditional independence. In many applications where system dynamics are subject to functional or structural constraints, assuming the process is stable can be advantageous. However, enforcing stability demands restricting the autoregressive coefficients to lie in a constrained space with a complex geometry called the stationary region. The resulting inferential challenges are compounded when sparsity is also a requirement. Working in the Bayesian paradigm, we tackle the problem through a parameter expansion approach, constructing a spike-and-slab prior with support constrained to the stationary region. A mixture of G-Wishart distributions provides a sparse prior for the error precision matrix. Computational inference is carried out via a Metropolis-within-Gibbs scheme which exploits the No-U-Turn Sampler and reversible-jump steps. We demonstrate the benefits, both inferential and predictive, of our approach through simulation experiments and an application in neuroscience.

We propose a joint modeling framework that integrates zero-inflated longitudinal count data with time-to-event outcomes, explicitly accounting for a cure fraction. The longitudinal process is modeled using flexible mixed-effects Hurdle models to handle excess zeros and overdispersion, while the survival component combines a Cox model with a mixture cure formulation to distinguish susceptible and cured individuals. The two processes are linked through current longitudinal information, enabling dynamic risk prediction. Inference is performed using Hamiltonian Monte Carlo for robust estimation. We validate the approach through simulations and apply it to an HIV cohort, demonstrating its value for personalized risk assessment and clinical decision-making.

A key objective in spatial epidemiology is to identify the drivers of elevated disease risks at a population level, using non-overlapping areal unit level data that comprise the total numbers of disease cases, exposures of interest and known confounders. The spatial pattern in disease risk is likely to be influenced by unmeasured confounders, whose omission induces spatial autocorrelation into the residuals from the chosen epidemiological model. A Poisson log-linear model fitted in a Bayesian paradigm is typically used for inference, which incorporates the known confounders with a linear or additive regression component and the unmeasured confounders via a set of spatially autocorrelated random effects. While such a model correctly allows for the inherent autocorrelation in the data, confounder interactions and the shapes of their functional relationships with disease risk have to be specified in advance rather than being directly learned from the data. Therefore this paper proposes the SPAR-Forest-ERF algorithm for population-level epidemiological risk assessment, which is the first fusion in this context of random forests for capturing non-linear and interacting confounder-response effects with Bayesian spatial autocorrelation models that can estimate interpretable exposure response functions (ERF) with full uncertainty quantification. Methodologically, we extend existing methods set in a prediction context by correctly propagating uncertainty between both the ML and statistical models, developing a new stopping criteria designed to ensure the stability of the primary inferential target, and incorporating a range of different ERFs for maximum model flexibility. This methodology is motivated by a new study of the impact of air pollution concentrations on self-rated health in Scotland, using data from the recently released 2022 national census.

Object oriented data analysis (OODA) provides a framework for the statistical analysis of a wide variety of highly structured data. Shapes, images, functions, covariance matrices, networks and other manifold-valued data are all examples of object data that can be analyzed using OODA methods, and colleagues at Leeds have been at the forefront of many methodological developments. One of the tasks in OODA is regression analysis for manifold-valued data, and we consider an approach to manifold regression using intermediate spaces, by mapping from complex response or predictor manifolds into simpler spaces where statistical analysis is more straightforward. We focus on two applications: compositional data analysis of environmental chemicals in South Florida, and the monitoring of peat bogs from satellite images in the Flow Country in northern Scotland.

Likelihood-based inference in stochastic non-linear dynamical systems, such as those found in chemical reaction networks and biological clock systems, is inherently complex and has largely been limited to small and unrealistically simple systems. Recent advances in analytically tractable approximations to the underlying conditional probability distributions enable long-term dynamics to be accurately modelled and making Bayesian inference much more feasible. We use preliminary analyses based on the Fisher Information Matrix of the model to guide the implementation of Bayesian inference. We show that this parameter sensitivity analysis can predict which parameters are practically identifiable. An asymptotically exact inference process based on Markov chain Monte Carlo methods is then implemented to accurately estimate the reaction rate parameters.

We study two foundational problems in distributed survival analysis: estimating Cox regression coefficients and cumulative hazard functions, under federated differential privacy constraints, allowing for heterogeneous per-sever sample sizes and privacy budgets. To quantify the fundamental cost of privacy, we derive minimax lower bounds along with matching (up to poly-logarithmic factors) upper bounds. In particular, to estimate the cumulative hazard function, we design a private tree-based algorithm for nonparametric integral estimation. Our results reveal server-level phase transitions between the private and non-private rates, as well as the reduced estimation accuracy from imposing privacy constraints on distributed subsets of data.

To address scenarios with partially public information, we also consider a relaxed differential privacy framework and provide a corresponding minimax analysis. To our knowledge, this is the first treatment of partially public data in survival analysis, and it establishes a no-gain in accuracy phenomenon. Finally, we conduct extensive numerical experiments, with an accompanying R package FDPCox, validating our theoretical findings. These experiments also include a fully-interactive algorithm with tighter privacy composition, which demonstrates improved estimation accuracy.

This paper studies the detection of a change in high-dimensional linear models. We derive the minimax lower bounds on the detection boundary and the rate of estimation which exhibit a phase transition with respect to the sparsity of the covariance-weighted differential parameter, revealing a delicate interaction between the covariance of regressors and the change in regression parameters. We complement these results by proposing methods that achieve minimax (near-)optimality in the sparse and the dense regimes, respectively. Referred to as McScan and QcScan, they scan the maximum and the quadratic aggregations of the local covariances at strategically selected locations for change point detection; in particular, QcScan is the first method shown to achieve consistency in the dense regime. Further, a combined method is proposed which is adaptively optimal even when the sparsity is unknown, and we complete the study of the change point problem by considering post-detection estimation of the differential parameter and the refinement of the change point estimators. Numerical experiments confirm the new findings as well as demonstrating the computational and statistical efficiency of the covariance-scanning based methods. This is joint work with Tobias Kley and Housen Li (University of Göttingen).

The increasing availability of longitudinal data is set to yield scientific discoveries across various domains, yet methods for modelling complex multivariate functional dependencies remain limited.

Motivated by a COVID-19 study conducted in Cambridge hospitals, we propose a Bayesian approach for representing high-dimensional curves, combining latent factor modelling and functional principal component analysis (FPCA). This approach captures correlations across variables (e.g., biomarkers) and time, by positing that subsets of variables contribute to a small number of FPCA expansions (e.g., representing latent disease processes) through variable-specific loadings. Subject variability is modelled using a small number of functional principal components, each characterised by a smoothly varying temporal function. A variational inference algorithm with simulated annealing ensures efficient exploration of multimodal distributions.

Our numerical experiments demonstrate reliable parameter estimation and scalability to high-dimensional data (such as longitudinal measurements of 20,000 genes across a few hundred subjects). Through the COVID-19 study, we illustrate how our framework helps disentangle disease heterogeneity. It clarifies which biomarkers coordinate over time and predicts molecular trajectories at the subject level, towards personalised treatment strategies.

This is joint work with Salima Jaoua and Daniel Temko.

In order to solve the problem that the continuous sampling plan (CSP) could only meet the quality constraint and delayed the termination for the deteriorating process, we proposed two improving schemes, the optimal continuous sampling plan (OCSP) and the integrated process control scheme between CSP and the process yield index Spk. OCSP identifies the sole optimal scheme parameters for the specified in-line process based on the estimator of the conformance rate, terminating the deteriorating process with the effective working interval determined by the quality requirement. The integrated process control scheme could simultaneously satisfy four constraints, quality, cost, and two types of risk, and stop the worse process with the risk control strategy, which is suitable to the scenario with limit inspection capability.

In many real-world settings, spatial point processes provide a natural framework for modelling data such as molecular locations, structural features within cells or materials, or dead pixels on X-ray detectors. Comparing the empirical distributions of these point patterns across populations can give insight into the underlying generative mechanisms.
We employ geometry-aware nonparametric methods and illustrate them through two applications to confocal microscopy data.
Firstly, we analyse the spatial locations or microtubules within K-fibers in cells with an overexpressed protein (TACC3) versus control. This sheds light on the interplay between the structures and chemicals involved in cell division (mitosis), a fundamental process for most biological organisms and highly relevant for cancer research.
Second, we study the spatiotemporal dependencies between protein species. We construct an estimator of their bulk movement using the earth mover’s distance and propose a test statistic that aggregates local movement differences across a spatial partition. Our method is built on a novel null-hypothesis framework combining between- and within-sample independence of bulk movement patterns with distributional invariance under a geometrically defined subgroup of permutations. We apply this to EB3–TACC3 experiments in microtubules, providing evidence for the colocalisation of these proteins.

Sarah Heaps ( Durham University) – Bayesian inference of sparsity in stable vector autoregressive processes

Taban Baghfalaki (The University of Manchester) – Dynamic Risk Prediction by Combining Zero-Inflated Longitudinal Data and Survival Outcomes with a Cure Fraction

Duncan Lee (University of Glasgow) – A spatial random forest algorithm for population-level epidemiological risk assessment

Ian Dryden (University of Nottingham) – Object data analysis and environmental monitoring

Ben Swallow (University of St Andrews) – Bayesian inference for oscillatory systems using the pcLNA algorithm

Yi Yu (University of Warwick) – Optimal Cox regression under federated differential privacy: coefficients and cumulative hazards

Haeran Cho (University of Bristol) – Adaptively optimal change point detection in high-dimensional linear models

Hélène Ruffieux (MRC Biostatistics Unit, University of Cambridge) – A Bayesian functional factor model for high-dimensional curves

Chunzhi Li (East China Jiao Tong University) – The improvement of the continuous sampling plan for the in-line process

Julia Brettschneider (University of Warwick) – Geometry-Aware Permutation Tests for Spatial and Spatiotemporal Point-Process Data in Microscopy

Search results for “”