Statistics

In a probabilistic framework, synthetic data corresponds to data generated from the posterior distribution. The main task is to assess how similar the synthetic data is to the real data: this corresponds to model checking and assessing model fit. We present an approach that allows the user to balance between diversity and fidelity of the synthetic data regularising the model via a single parameter between zero and one. In one extreme, the model focuses on diversity covering the data distribution, and in the other the model distribution focuses increasingly close to the mode(s) of the data distribution. We demonstrate the proposed approach for latent factor and mixture models as well as regression and classification.

Visually guided actions such as reaching and grasping, though seemingly simple, encode a wealth of information about how the three-dimensional world is represented to guide movement. The kinematics of these actions reveal “perceptual signatures” that correlate with object properties such as size, distance and orientation, as well as with moment-to-moment adaptations in motor learning. Beyond these physical mappings, kinematic patterns also reflect individual styles of grasping, i.e. distinct strategies for balancing uncertainty and efficiency under the same movement goal.
Recent work has shown that these kinematic patterns can be mined not only for biomechanical information but also for the latent cognitive and perceptual variables that shape movement. Last year, in collaboration with a student from the School of Mathematics, we began applying machine-learning methods to 3D motion-capture data to classify grip strategies and identify these hidden attractors within the data. The initial results are promising and open exciting possibilities for further cross-disciplinary development.

Abstract: Therapist-led trials are central to evaluating interventions in psychotherapy, physiotherapy, and rehabilitation. These are typically designed as individually randomised, parallel-group trials, where patients are assigned to interventions at enrollment. The primary aim is to estimate the effect of the intervention, defined as a specific therapy or procedure of interest.

A persistent methodological challenge in such trials is the confounding of therapist effects with intervention effects. One proposed solution is to randomise patients to therapists, but this introduces significant practical constraints, including therapist availability, capacity, and turnover. Neglecting these constraints during trial design can lead to inflated costs and biased estimates of intervention effects. In this talk, we shall explore key logistical challenges in therapist randomisation and present strategies to address them at both the design and analysis stages. In particular, different randomisation methods will be examined, along with their implications for trial structure, statistical inference, and operational feasibility.

Abstract: There is substantial public health interest in mapping spatial patterns of disease risk, particularly in terms of identifying regions of high risk which may benefit from intervention. The majority of existing studies relate to a single severity of disease outcome (eg hospitalisation), without considering any other severities of the same disease (eg mild cases treated in a primary care setting). In reality, there will be correlation between severity levels, and it can be useful to develop a unified model which can account for different severities simultaneously.

An additional challenge is that data for these different severity levels are often collected at different geographical resolutions. It is therefore necessary to rescale data to a common spatial resolution to provide comparable inference. This talk will outline a novel spatially smoothed data augmented Markov chain Monte Carlo algorithm which addresses these challenges. The model will be demonstrated using a study of respiratory disease risk in Scotland in 2017.

This work is co-authored with Dr Duncan Lee (University of Glasgow) and was originally published in Biometrics in 2023. https://eprints.gla.ac.uk/276601/1/276601.pdf

Lee, D. and Anderson, C. (2023) Delivering spatially comparable inference on the risks of multiple severities of respiratory disease from spatially misaligned disease count data. Biometrics, 79(3), pp. 2691-2704. (doi: 10.1111/biom.13739) (PMID:35972420)

The role of visit-to-visit variability of a biomarker in predicting related disease has been recognized in medical science. Existing measures of biological variability are criticized for being entangled with random variability resulted from measurement error or being unreliable due to limited measurements per individual. In this article, we propose two new measures to quantify the biological variability of a biomarker by evaluating the curvature of the biomarker trajectory behind longitudinal measurements. Given a mixed-effects model for longitudinal data with the mean function over time specified by cubic splines or a semiparametric multiplicative model, a Cox model is assumed for time-to-event data by incorporating the defined variability as well as the current level of the underlying longitudinal trajectory as covariates, which, together with the longitudinal model, constitutes the joint modeling framework in this talk. Estimation algorithm based on the EM algorithm or estimating equation is discussed. Simulation studies are conducted to reveal the advantage of the proposed method. To assess whether the variabiliy of the systolic blood pressure is predictive of cardiovascular events, the proposed method is applied to data from the Medical Research Council ( MRC ) Elderly Trial and the Atherosclerosis Risk in Communities ( ARIC ) study.

Abstract: Normalized Latent Measure Models (NLMMs) are a framework for modelling and comparing similar probability distributions using mixtures of nonparametric distributions. In this talk, I will focus on density regression problems where the distributions are indexed by covariates. As well as introducing the models, I will discuss how identifiability, variable selection and overfitting can be addressed within a Bayesian framework to provide interpretable inferential methods. I will illustrate their use in a range of applications.

Abstract: High-dimensional multivariate nonstationary time series are commonly observed in many scientific and industrial applications. In this work, we introduce a novel wavelet-domain dimension reduction technique that constructs time-scale adaptive principal components, as well as a new between-domain cross-coherence measure. The new tools are shown to successfully capture the salient dynamic features of the data as well as quantify the extent of its association with the new representation. We establish theoretical and numerical results on the technique, and demonstrate the tools on a dataset arising in a neuroscience study. This is joint work with Marina Knight and Jessica Hargreaves (University of York).

Abstract: Linear systems occur (e.g. Lx=f) throughout engineering and the sciences, notably as differential equations. In many cases the forcing function (f) for the system is unknown, and interest lies in using noisy observations of the system (y=Ax+e) to infer the forcing, as well as other unknown parameters. I will discuss our recent work that uses adjoints of linear systems to infer forcing functions modelled as Gaussian processes (GP). By using a truncated basis expansion, we can do conjugate Bayesian inference for the GP, in many cases, with substantially lower computation than would be required using alternative methods. I'll demonstrate the approach using an advection-diffusion model that arises in our attempts to model the spread of air pollution in Kampala, Uganda.

Abstract: In this talk I will present Integrated R², a novel statistic for quantifying the dependence of a scalar random variable Y on a vector of predictors X. Integrated R² has the desirable property that it vanishes if and only if Y and X are independent, and attains the maximum value of one precisely when Y is a measurable function of X. Unlike many dependence measures that require strong parametric assumptions or complex tuning, its estimator is as simple to compute and interpret as classical correlation coefficients such as Pearson’s, Spearman’s, or Chatterjee’s.
Building on this measure, I will introduce the algorithm FORD (Feature Ordering by Dependence), which orders candidate features by their incremental contribution to dependence. I will discuss theoretical guarantees, including asymptotic normality results, and demonstrate through experiments (on synthetic and real datasets) how Integrated R² and FORD often outperform existing methods.

Pre-eclampsia remains a significant public health issue globally, with a disproportionately high burden in Sub-Saharan Africa (SSA). Hypertensive disorders of pregnancy account for 22.1% of maternal deaths in the region, and the incidence of pre-eclampsia is higher than the global average. A key challenge is the lack of effective early detection and monitoring, particularly in low-resource settings. While technological solutions, such as smartwatches, show promise, their implementation faces significant hurdles, including device malfunctions, poor patient adherence, and incompatibility with existing infrastructure. This highlights the need to address critical research gaps in the region, spanning epidemiology, pathophysiology, diagnosis, and management. Solutions must be tailored to local contexts, taking into account unique risk factors and systemic challenges. The integration of robust, proven devices with community-based support systems is crucial. Furthermore, emerging technologies such as the Internet of Things (IoT) and Artificial Intelligence (AI) hold great potential. AI models, for instance, could enable early risk prediction, while federated learning and explainable AI (XAI) can help overcome privacy concerns and foster trust among clinicians. The future of maternal healthcare in SSA depends on a combination of context-sensitive, evidence-based interventions and the strategic adoption of innovative, sustainable technologies.

Seppo Virtanen (University of Leeds) – Balancing diversity and fidelity of probabilistic synthetic data generation

Carlo Campagnoli (University of Leeds) – Hidden patterns in reach-to-grasp movements: using kinematics and machine learning to reveal perceptual strategies

Arpan Singh (University of Leeds) – Design and Analysis of Therapist-led trials

Craig Anderson (University of Glasgow) – Modelling spatially misaligned disease count data with multiple severities

Renwen Luo (Beijing Normal-Hong Kong Baptist University ) – Modelling biomarker trajectory and variability in joint analysis of longitudinal and time-to-event data

Jim Griffin (University College London) – Normalized Latent Measure Regression Models

Matthew Nunes (University of Bath) – Adaptive Principal Component Analysis for Multivariate Nonstationary Time Series

Richard Wilkinson (University of Nottingham) – Adjoint-accelerated inference in linear systems

Mona Azadkia (London School of Economics) – A New Measure of Dependence

Moses M Thiga (Egerton University) – Challenges and Prospects for the Early Detection and Management of Pre-Eclampsia in Sub-Saharan Africa

Search results for “”