Abstract: High-dimensional multivariate nonstationary time series are commonly observed in many scientific and industrial applications. In this work, we introduce a novel wavelet-domain dimension reduction technique that constructs time-scale adaptive principal components, as well as a new between-domain cross-coherence measure. The new tools are shown to successfully capture the salient dynamic features of the data as well as quantify the extent of its association with the new representation. We establish theoretical and numerical results on the technique, and demonstrate the tools on a dataset arising in a neuroscience study. This is joint work with Marina Knight and Jessica Hargreaves (University of York).
Abstract: Linear systems occur (e.g. Lx=f) throughout engineering and the sciences, notably as differential equations. In many cases the forcing function (f) for the system is unknown, and interest lies in using noisy observations of the system (y=Ax+e) to infer the forcing, as well as other unknown parameters. I will discuss our recent work that uses adjoints of linear systems to infer forcing functions modelled as Gaussian processes (GP). By using a truncated basis expansion, we can do conjugate Bayesian inference for the GP, in many cases, with substantially lower computation than would be required using alternative methods. I'll demonstrate the approach using an advection-diffusion model that arises in our attempts to model the spread of air pollution in Kampala, Uganda.
Abstract: In this talk I will present Integrated R², a novel statistic for quantifying the dependence of a scalar random variable Y on a vector of predictors X. Integrated R² has the desirable property that it vanishes if and only if Y and X are independent, and attains the maximum value of one precisely when Y is a measurable function of X. Unlike many dependence measures that require strong parametric assumptions or complex tuning, its estimator is as simple to compute and interpret as classical correlation coefficients such as Pearson’s, Spearman’s, or Chatterjee’s.
Building on this measure, I will introduce the algorithm FORD (Feature Ordering by Dependence), which orders candidate features by their incremental contribution to dependence. I will discuss theoretical guarantees, including asymptotic normality results, and demonstrate through experiments (on synthetic and real datasets) how Integrated R² and FORD often outperform existing methods.
Pre-eclampsia remains a significant public health issue globally, with a disproportionately high burden in Sub-Saharan Africa (SSA). Hypertensive disorders of pregnancy account for 22.1% of maternal deaths in the region, and the incidence of pre-eclampsia is higher than the global average. A key challenge is the lack of effective early detection and monitoring, particularly in low-resource settings. While technological solutions, such as smartwatches, show promise, their implementation faces significant hurdles, including device malfunctions, poor patient adherence, and incompatibility with existing infrastructure. This highlights the need to address critical research gaps in the region, spanning epidemiology, pathophysiology, diagnosis, and management. Solutions must be tailored to local contexts, taking into account unique risk factors and systemic challenges. The integration of robust, proven devices with community-based support systems is crucial. Furthermore, emerging technologies such as the Internet of Things (IoT) and Artificial Intelligence (AI) hold great potential. AI models, for instance, could enable early risk prediction, while federated learning and explainable AI (XAI) can help overcome privacy concerns and foster trust among clinicians. The future of maternal healthcare in SSA depends on a combination of context-sensitive, evidence-based interventions and the strategic adoption of innovative, sustainable technologies.
Air pollution is one of the most pressing environmental challenges, with significant implications for public health, policy, and sustainable development. Traditional statistical approaches have provided useful insights, but recent advances in data mining and machine learning offer powerful tools to improve prediction accuracy, interpretability, and decision support. This presentation will explore the application of data mining techniques, including tree-based models, ensemble learning, feature selection, and hybrid models for air pollution modelling, with a focus on PM10 concentrations during transboundary haze episodes in Malaysia. Case studies will demonstrate how these models contribute to early warning systems and policy interventions. This talk also discusses the challenges faced in this domain, including data quality, extreme haze event prediction, model generalizability, and integration with real-time monitoring systems. Finally, the talk aims to open discussion on collaborative opportunities to advance air pollution modelling, leveraging interdisciplinary expertise and international partnerships.
Effective infectious disease surveillance is essential to initiating control measures to protect the public’s health in a timely manner. While traditional surveillance methods remain crucial, supplementing them with automated approaches, such as surveillance algorithms, can improve the system’s ability to detect potential outbreaks for investigation and control. In this project, we focused on the implementation of the multi-purpose outbreak detection algorithm to enhance surveillance capabilities of the PHA Health Protection Surveillance team. Implemented in R, the improved version of the Farrington algorithm has been integrated into an Infectious Signal Detection tool, an interactive Shiny application, facilitating user-friendly exploration of time series for signal detection. Our tool allows users to import external data or use direct data pipelines and analyse such data through a set of choices of parameters.
Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this talk, I will present a novel view of these tasks in the context of single-cell genomic data. To do so, I propose modelling the observed count-matrices of genomic data by representing these measurements as a bipartite network with multi-edges. Starting with this first-principles network model of the raw data, I will show improvements in clustering single cells via a suitably-identified d-dimensional Laplacian Eigenspace (LE) using a Gaussian mixture model (GMM-LE), and apply UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). From this first-principles viewpoint, the LE representation of the data-points estimates transformed latent positions (of genes and cells), under a latent position statistical model of nodes in a bipartite stochastic network. By applying this proposed methodology to data from three recent genomics studies in different biological contexts, I will show how clusters of cells independently learned by this proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts, with an accuracy that is competitive with the industry-standard for these data. I will then show how this novel view of these data can provide unique insights, leading to the identification of a LE breast-cancer biomarker that significantly predicts long-term patient survival outcome in two independent validation cohorts with data from 1904 and 1091 individuals.
Speaker: Najmeh Nakhaeirad, University of Pretoria
Title: Density estimation from biased circular data
Abstract: Sampling with errors provides observations that, instead of being drawn from the distribution of interest, are rather drawn from a biased version of it. New estimation approaches are developed to retrieve the true density in the presence of such data contaminated by measurement errors from the circular manifold. Since weighted distribution theory provides a unifying approach for the correction of biases that exist in data, we assume a class of weighted distributions on the circle as the distribution of the biased observations. Then, both frequentist and Bayesian methods are applied to capture the true density of the data from the data contaminated with errors. Numerical assessments support the findings via simulation and real data analysis.
Speaker: Inger Fabris-Rotelli , University of Pretoria
Title: Spatial linear networks with applications
Abstract: This talk will introduce spatial linear networks and cover a number of application areas in Spatial Statistics. A variety of methods will be discussed involving analysis in a linear network space. Applications in informal roads, criminology and disease mapping will be presented.
As artificial intelligence grows ever more prominent within public discourse, record levels of investment and political pressure been ploughed into applying these technologies in the healthcare space. Despite this, few AI innovations translate into the clinical setting or result in real-world patient benefit. In this talk Dr Zucker will provide an overview of the barriers to AI adoption within the UK healthcare system, offer practical advice to academics on how to maximise the impact of their work in health and describe some of the projects and opportunities that can support AI and data science researchers within the region.
Bayesian emulation, and more generally Gaussian process models, have been successfully applied across a wide variety of scientific disciplines. This is both in the context of efficiently analysing computationally intensive models, as well as general statistical models for inference and prediction of response for new predictors given a training dataset. In this talk, we introduce emulators as fast statistical approximators, providing a predicted value at any input, along with a corresponding measure of uncertainty. We then proceed to discuss developments of Known Boundary Emulation (KBE) strategies which utilise the fact that, for many computer models, there exist hyperplanes in the input parameter space for which the model output can be evaluated far more efficiently. For example, this may be because the response is known at such inputs; or in the context of a computer model, such inputs may yield an analytical solution or the potential for application of a much simpler, more efficient, numerical solver. We demonstrate how information on these known hyperplanes can be incorporated into the emulation process via analytical update, thus involving no additional computational cost, before illustrating our techniques on a scientifically relevant and high-dimensional systems biology model of hormonal crosstalk in the roots of an Arabidopsis plant.