menuicon

Research

Follow us_twitter

Statistics seminar

To be added to the mailing list, please contact Linh Nghiem. All seminars are 2-3pm on Fridays, except noted otherwise.

Seminars in 2024

Friday, 22 March

Title: Adjusted predictions in generalized estimation equations
Speaker: Francis Hui, Australian National University

Abstract: Generalized estimating equations (GEEs) is a popular regression approach that requires specification of the first two marginal moments of the data, along with a working correlation matrix to capture the covariation between responses e.g., temporal correlations within clusters in longitudinal data. The majority of research and application of GEEs has focused on estimation and inference of the regression coefficients in the marginal mean. When it comes to prediction using GEEs, practitioners often simply, and quite understandably, also based it on the regression model characterizing the marginal mean.

In this talk, we propose a simple adjustment to predictions in GEEs based on utilizing information in the assumed working correlation matrix. Focusing on longitudinal data, and by viewing the GEE from the perspective of solving a working linear model, we borrow ideas from universal kriging to construct a “conditional” predictor that leverages temporal correlations between the new and current observations within the same cluster. We establish some theoretical conditions for the proposed adjusted GEE predictor to outperform the standard (unadjusted) predictor. Simulations show even when we misspecify the working correlation structure, adjusted GEE predictors (combined with an information criterion for choosing the working correlation structure) can still improve on predictive performance of standard GEE predictors as well as the so- called “oracle GEE predictor” using all observations.

This is a joint work with Samuel Mueller (MQ University) and A.H.Welsh (ANU).

Friday, 15 March

Title: On aspects of localized learning
Speaker: Andreas Christmann, Universitat Bayreuth

Abstract: Many machine learning methods do not scale well with respect to computation time and computer memory, if the sample size increases. Divide and conquer methods can be helpful in this respect and often allow for parallel computing. This is true e.g. for distributed learning. This talk will focus on localized learning which is a related but slightly different approach. We will mainly consider aspects of statistical robustness and stability questions. The first part of the talk will investigate the question of qualitative robustness without specifying a particular learning method. The second part will deal with total stability of kernel-based methods and is based on Koehler and Christmann (2022, JMLR, 23, 1-41).

Friday, 8 March

Title: Learning Deep Representations with Optimal Transport
Speaker: He Zhao, Data61, CSIRO

Abstract: Originated from the works of mathematicians, statisticians, and economists, Optimal Transport (OT) is a powerful tool for resource allocation. Recently, OT has gained significant attention and utility in machine learning and deep learning, particularly in areas where the comparison of probability measures is essential. In this talk, I will introduce two recent works of mine on applying OT for deep representation learning that captures essential structural information in the data, leading to improved generalisation and robustness. One is on the task of image data augmentation for imbalanced problems and the other is on missing value imputation.

Friday, 1 March

Title: A BLAST from the past: revisiting BLAST's E-value
Speaker: Uri Keich, University of Sydney

Abstract: The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research. BLAST established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful statistical analysis. Specifically, BLAST reports the E-value of each reported alignment, which is defined as the expected number of optimal local alignments that will score at least as high as the observed alignment score, assuming that the query and the database sequences are randomly generated. We critically reevaluate BLAST's E-values, showing that they can be at times significantly conservative while at others too liberal.

We offer an alternative approach based on generating a small sample from the null distribution of random optimal alignments, and testing whether the observed alignment score is consistent with it. In contrast with BLAST, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than the BLAST E-value. Indeed, in cases where BLAST's analysis is valid (i.e., not too liberal), our approach seems to deliver a greater number of correct alignments. One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties, avoiding BLAST's limited options of matrices and penalties. In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing with E-values, which can at times be difficult to interpret.

Joint work with Yang Young Lu (Cheriton School of Computer Science, University of Waterloo) and William Stafford Noble (Department of Genome Sciences and Paul G. Allen School of Computer Science and Engineering, University of Washington)

Seminars in 2023

Friday, 3 November

Title: Post-Processing MCMC with Control Variates
Speaker: Leah South, Queensland University of Technology

Abstract: Control variates are valuable tools for improving the precision of Monte Carlo estimators but they have historically been challenging to apply in the context of Markov chain Monte Carlo (MCMC). This talk will describe several new and broadly applicable control variates that are suitable for post-processing of MCMC. The methods I'll speak about are in the class of estimators that use Langevin Stein operators to generate control variates or control functionals, so they are applicable when the gradients of the log target pdf are available (as is the case for gradient-based MCMC). I will first give an overview of existing methods in this field, as per [1], before introducing two new methods. The first method [2], referred to as semi-exact control functionals (SECF), is based on control functionals and Sard’s approach to numerical integration. The use of Sard’s approach ensures that our control functionals are exact on all polynomials up to a fixed degree in the Bernstein-von-Mises limit. SECF is also bias-correcting, in the sense that it is capable of removing asymptotic bias in biased MCMC samplers under some conditions. The second method [3] applies regularisation to improve the performance of popular Stein-based control variates for high-dimensional Monte Carlo integration. Several Bayesian inference examples will be used to illustrate the potential for reduction in mean square error.

[1] South, L. F, Riabiz, M., Teymur, O. and Oates, C. J. (2022). Postprocessing of MCMC. Annual Review of Statistics and Its Application, 9, 529-555.
[2] South, L. F., Karvonen, T., Nemeth, C., Girolami, M., & Oates, C. (2022). Semi-Exact Control Functionals From Sard's Method. Biometrika, 109(2), 351-367.
[3] South, L. F., Oates, C. J., Mira, A., & Drovandi, C. (2023). Regularized zero-variance control variates. Bayesian Analysis, 18(3), 865-888.

Friday, 27 October

Title: A tripartite decomposition of vector autoregressive processes into temporal flows
Speaker: Brendan Beare, University of Sydney School of Economics

Abstract: Every autoregressive process taking values in a finite dimensional complex vector space is shown to be equal to the sum of three processes which we call the forward, backward and outward temporal flows. Each of the three temporal flows may be decomposed further into the sum of a stochastic component and a deterministic component. The forward temporal flow consists of a stationary infinite weighted sum of past innovations (the stochastic component) and a term which decays/grows exponentially as we move forward/backward in time and is determined by the behaviour of the autoregressive process in the arbitrarily distant past (the deterministic component). The backward temporal flow consists of a stationary infinite weighted sum of future innovations (the stochastic component) and a term which grows/decays exponentially as we move forward/backward in time and is determined by the behaviour of the autoregressive process in the arbitrarily distant future (the deterministic component). The outward temporal flow consists of a nonstationary finite weighted sum of innovations going outward from time zero (the stochastic component) and a term which grows at a polynomial rate as we move away from time zero and is determined by the value taken by the autoregressive process at time zero (the deterministic component). Each of the three temporal flows are obtained by applying one of three complementary spectral projections to the autoregressive process. The three spectral projections correspond to a separation of the eigenvalues of the autoregressive coefficient into three regions: the open unit disk, the complement to the closed unit disk, and the unit circle.

Friday, 15 September

Title: Forecasting intraday financial time series with sieve bootstrapping and dynamic updating
Speaker: Hanlin Shang, Macquarie University

Abstract: Intraday financial data often take the form of a collection of curves that can be observed sequentially over time, such as intraday stock price curves. These curves can be viewed as a time series of functions observed on equally spaced and dense grids. Due to the curse of dimensionality, high-dimensional data pose challenges from a statistical aspect; however, it also provides opportunities to analyze a rich source of information so that the dynamic changes within short-time intervals can be better understood. We consider a sieve bootstrap method to construct 1-day-ahead point and interval forecasts in a model-free way. As we sequentially observe new data, we also implement two dynamic updating methods to update point and interval forecasts for achieving improved accuracy. The forecasting methods are validated through an empirical study of 5-min cumulative intraday returns of the S&P/ASX All Ordinaries Index.

Friday, 1 September

Title: Directions old and new: Palaeomagnetism and Fisher meet modern statistics
Speaker: Janice Scealy, Australian National University

Abstract: Most modern articles in the palaeomagnetism literature are based on statistics developed by Fisher's 1953 paper Dispersion on a sphere, which assumes independent and identically distributed (iid) spherical data. However, palaeomagnetic sample designs are usually hierarchical, where specimens are collected within sites and the data are then combined across sites to calculate an overall mean direction for a geological formation. The specimens within sites are typically more similar than specimens between different sites, and so the iid assumptions fail. We will first review, contrast and compare both the statistics and geophysics literature on the topic of analysis methods for clustered data on spheres. We will then present a new hierarchical parametric model, which avoids the unrealistic assumption of rotational symmetry in Fisher's 1953 paper Dispersion on a sphere and may be broadly useful in the analysis of many palaeomagnetic datasets. To help develop t! he model, we use publicly available data as a case study collected from the Golan Heights volcanic plateau. Next, we will explore different methods for constructing confidence regions for the overall mean direction based on clustered data. Two bootstrap confidence regions that we propose perform well and will be especially useful to geophysics practitioners.

Wednesday, 30 August

Title: Unsupervised Spatial-Temporal Decomposition for Feature Extraction and Anomaly Detection
Speaker: Jian Liu, University of Arizona

Abstract: The advancement of sensing and information technology has made it reliable and affordable to collect data continuously from many sensors that are spatially distributed, generating readily available Spatial-Temporal (ST) data. While the abundant ST information embedded in such high-dimensional ST data provides engineers with unprecedented opportunities to understand, monitor, and control the engineering processes, the complex ST correlation makes conventional statistical data analysis methods ineffective and inefficient. This is especially true for ST feature extraction and ST anomaly detection, where the features of interest or the anomalies possess ST characteristics subtly different from the normal routine background. In this seminar, I will introduce a generic unsupervised learning method based on ST decomposition. The high-dimensional ST data are modeled as a tensor, which is then decomposed into different tensor components represented as a combination of a series of lower-dimensional factors. Without relying on pre-annotated training data, these tensor components will be estimated to indicate the latent features and/or anomalies of interest. A regularization approach is adopted to incorporate the knowledge of the tensor components’ intrinsic ST characteristics into the algorithm to improve the accuracy and robustness of the model estimation. Multiple case studies were conducted to demonstrate the effectiveness of the proposed methods, including water burst detection in water distribution systems and video segmentation.

Friday, 18 August

Title: Conceptualizing experimental controls using the potential outcomes framework
Speaker: Kristen Hunter, University of New South Wales

Abstract: The goal of a well-controlled study is to remove unwanted variation when estimating the causal effect of the intervention of interest. Experiments conducted in the basic sciences frequently achieve this goal using experimental controls, such as “negative” and “positive” controls, which are measurements designed to detect systematic sources of such unwanted variation. Here, we introduce clear, mathematically precise definitions of experimental controls using potential outcomes. Our definitions provide a unifying statistical framework for fundamental concepts of experimental design from the biological and other basic sciences. These controls are defined in terms of whether assumptions are being made about a specific treatment level, outcome, or contrast between outcomes. We discuss experimental controls as tools for researchers to wield in designing experiments and detecting potential design flaws, including using controls to diagnose unintended factors that influence the outcome of interest, assess measurement error, and identify important subpopulations. We believe that experimental controls are powerful tools for reproducible research that are possibly underutilized by statisticians, epidemiologists, and social science researchers.

Thursday, 3 August

Title: On arbitrarily underdispersed discrete distributions
Speaker: Alan Huang, University of Queensland

Abstract: We survey a range of popular count distributions, investigating which (if any) can be arbitrarily underdispersed, i.e., its variance can be arbitrarily small compared to its mean. A philosophical implication is that certain models failing this criterion should perhaps not be considered “statistical models” according to the extendibility criterion of McCullagh (2002). Four practical implications will be discussed. We suggest that all generalizations of the Poisson distribution be tested against this property.

Friday, June 2

Speaker: Dennis Leung, University of Melbourne
Location: Room 2020 Abercrombie Bldg H70
Time: 11 am - 12pm

Title: ZAP: Z-Value Adaptive Procedures for False Discovery Rate Control With Side Information

Abstract: Adaptive multiple testing with covariates is an important research direction that has gained major attention in recent years, as it has been widely recognized that leveraging side information provided by auxiliary covariates can improve the power of testing procedures for controlling the false discovery rate (FDR), e.g. in the differential expression analysis of RNA-sequencing data, the average read depths across samples can provide useful side information alongside individual p-values, and incorporating such information promises to improve the power of existing methods. However, for two-sided hypotheses, the usual data processing step that transforms the primary statistics, generally known as z-values, into p-values not only leads to a loss of information carried by the main statistics but can also undermine the ability of the covariates to assist with the FDR inference. Motivated by this and building upon recent advances in FDR research, we develop ZAP,! a z-value based covariate-adaptive methodology. It operates on the intact structural information encoded jointly by the z-values and covariates, to mimic an optimal oracle testing procedure that is unattainable in practice; the power gain of ZAP can be substantial in comparison with p-value based methods.

Friday, May 19

Speaker: Andrew Zammit Mangion , University of Wollongong

Title: Neural Point Estimation for Fast Optimal Likelihood-Free Inference

Abstract: Neural point estimators are neural networks that map data to parameter point estimates. They are fast, likelihood free and, due to their amortised nature, amenable to fast bootstrap-based uncertainty quantification. In this talk I give an overview of this relatively new inferential tool, giving particular attention to the ubiquitous problem of making inference from replicated data, which we address in the neural setting using permutation-invariant neural networks. Through extensive simulation studies we show that these neural point estimators can quickly and optimally (in a Bayes sense) estimate parameters in weakly-identified and highly-parameterised models, such as models of spatial extremes, with relative ease. We demonstrate their applicability through an analysis of extreme sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates and bootstrap-based confidence intervals from hundreds of spatial fields in a fraction of a second. This is joint work with Matthew Sainsbury-Dale and Raphaël Huser.

Friday, April 21

Speaker: Bradley Rava, University of Sydney Business School

Title: A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification

Abstract: We study fairness in classification, where one wishes to make automated decisions for people from different protected groups. When individuals are classified, the decision errors can be unfairly concentrated in certain protected groups. We develop a fairness-adjusted selective inference (FASI) framework and data-driven algorithms that achieve statistical parity in the sense that the false selection rate (FSR) is controlled and equalized among protected groups. The FASI algorithm operates by converting the outputs from black-box classifiers to R-values, which are intuitively appealing and easy to compute. Selection rules based on R-values are provably valid for FSR control, and avoid disparate impacts on protected groups. The effectiveness of FASI is demonstrated through both simulated and real data.

Friday, April 14

Speaker: Ryan Elmore, University of Colorado

Title: NBA Action, It’s FANtastic (and great for data analytics too!)

Abstract: In this talk, I will describe my two most recent statistical problems and solutions related to the National Basketball Association (NBA). In particular, I will discuss (1) the usefulness of a coach calling a timeout to thwart an opposition’s momentum and (2) a novel metric for rating the overall shooting effectiveness of players in the NBA. I will describe the motivation for each problem, how to find data for NBA analyses, modeling considerations, and our results. Lastly, I will describe why I think the analysis of sport, in general, provides an ideal venue for teaching/learning statistical or analytical concepts and techniques.

Friday, March 17

Speaker: Sarat Moka, UNSW
Location: Carslaw 175, or Zoom at https://uni-sydney.zoom.us/j/89991403193

Title: Best Subset Selection for Linear Dimension Reduction Models using Continuous Optimization

Abstract: Selecting the optimal variables in high-dimensional contexts is a challenging task in both supervised and unsupervised learning. This talk focuses on two popular linear dimension-reduction methods, principal components analysis (PCA) and partial least squares (PLS), with diverse applications in genomics, biology, environmental science, and engineering. PCA and PLS construct principal components that are combinations of original variables. However, interpreting principal components becomes challenging when the number of variables is large. To address this issue, we discuss a new approach that combines best subset selection with PCA and PLS frameworks. We use a continuous optimization algorithm to identify the most relevant variables for constructing principal components. Our approach is evaluated using two real datasets, one analysed using PCA and the other using PLS. Empirical results demonstrate the effectiveness of our method in identifying the optimal subset of variables. This is a joint work with Prof. Benoit Liquet and Prof. Samuel Muller.

Friday, March 10

Speaker: Suojin Wang, Texas A&M University
Location: Carslaw 175, or Zoom at https://uni-sydney.zoom.us/j/84730883100

Title: Robust regression using probabilistically linked data

Abstract: There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are non-deterministic, based on how likely a linking decision corresponds to a correct match, i.e., it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, i.e., the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modelling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.

Friday, March 3

Speaker: Akanksha Negi, Monash University
Location: Carslaw 175, or Zoom at https://uni-sydney.zoom.us/j/84730883100

Title: Difference-in-differences with a misclassified treatment

Abstract: This paper studies identification and estimation of the average treatment effect on the treated (ATT) in difference-in-difference designs when the variable that classifies individuals into treatment and control groups (treatment status, D) is differentially (or endogenously) mis-measured. We show that misclassification in D hampers consistent estimation of ATT because 1) it restricts us from identifying the truly treated from those misclassified as being treated and 2) differential misclassification in counterfactual trends may result in parallel trends being violated with D even when they hold with the true but unobserved D*. We propose a two-step estimator to correct for this problem using a flexible parametric specification which allows for considerable heterogeneity in treatment effects. The solution uses a single exclusion restriction embedded in a partial observability probit to point-identify the true ATT. Subsequently, we derive the asymptotic properties of this estimator in panel and repeated cross-section settings. Finally, we apply this method to a large scale in-kind transfer program in India which is known to suffer from targeting errors.

Friday, February 24

Speaker: Lucy Gao, University of British Columbia
Location: Zoom at https://uni-sydney.zoom.us/j/86954517372

Title: Valid inference after clustering with application to single-cell RNA-sequencing data

Abstract: In single-cell RNA-sequencing studies, researchers often model the variation between cells with a latent variable, such as cell type or pseudotime, and investigate associations between the genes and the latent variable. As the latent variable is unobserved, a two-step procedure seems natural: first estimate the latent variable, then test the genes for association with the estimated latent variable. However, if the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to control the type I error rate.
In my talk, I will introduce two different approaches to this problem. First, I will apply ideas from selective inference to develop a valid test for a difference in means between clusters obtained from the hierarchical clustering algorithm. Then, I will introduce count splitting: a flexible framework that enables valid inference after latent variable estimation in count-valued data, for virtually any latent variable estimation technique and inference approach.
This talk is based on joint work with Jacob Bien (University of Southern California), Daniela Witten and Anna Neufeld (University of Washington), as well as Alexis Battle and Joshua Popp (Johns Hopkins University).

Seminars in 2022

Friday, October 21

Speaker: Howard Bondell, University of Melbourne

Title: Do you have a moment? Bayesian inference using estimating equations via empirical likelihood

Abstract: Bayesian inference typically relies on specification of a likelihood as a key ingredient. Recently, likelihood-free approaches have become popular to avoid specification of potentially intractable likelihoods. Alternatively, in the Frequentist context, estimating equations are a popular choice for inference corresponding to an assumption on a set of moments (or expectations) of the underlying distribution, rather than its exact form. Common examples are in the use of generalised estimating equations with correlated responses, or in the use of M-estimators for robust regression avoiding the distributional assumptions on the errors. In this talk, I will discuss some of the motivation behind empirical likelihood, and how it can be used to incorporate a fully Bayesian analysis into these settings where only specification of moments is desired. This allows one to then take advantage of prior distributions that have been developed to accomplish various shrinkage tasks, both theoretically and in practice. I will further discuss computational issues that arise due to non-convexity of the support of this likelihood and the corresponding posterior, and show how this can be rectified to allow for MCMC and variational approaches to perform posterior inference.

Friday 16 Sep

Speaker: Mohammad Davoudabadi, University of Sydney

Title: Advanced Bayesian approaches for state-space models with a case study on soil carbon sequestration

Abstract: Sequestering carbon into the soil can mitigate the atmospheric concentration of greenhouse gases, improving crop productivity and yield financial gains for farmers through the sale of carbon credits. In this work, we develop and evaluate advanced Bayesian methods for modelling soil carbon sequestration and quantifying uncertainty around predictions that are needed to fit more complex soil carbon models, such as multiple-pool soil carbon dynamic models. This study demonstrates efficient computational methods using a one-pool model of the soil carbon dynamics previously used to predict soil carbon stock change under different agricultural practices applied at Tarlee, South Australia. We focus on methods that can improve the speed of computation when estimating parameters and model state variables in a statistically defensible way. This study also serves as a tutorial on advanced Bayesian methods for fitting complex state-space models.

Friday 26 Aug

Speaker: Pauline O'Shaughnessy, University of Wollongong

Title: Multiverse of Madness: Multivariate Moment-Based Density Estimation and Its (Possible) Applications

Abstract: Density approximation is a well-studied topic in statistics, probability and applied Mathematics. One of the recently popular methods is to construct a unique polynomial-based series expansion for a density function, which previously lacks practical uses due to the computational complexity. Recently, we developed a new approach to approximate the joint density function of multivariate distributions using moment-based orthogonal polynomial expansion on a bounded space, which is based on a carefully defined hyper-geometric differential equation with a generalized form. Exploring the applications of this moment-based density estimation is still underway; so far, we have tried to implement the density estimation method in data privacy, time series analysis, and missing data, which will be demonstrated in this talk. This is joint work with Bradley Wakefield, Prof Yan-Xia Lin, and Wei Mi from UOW.

Bio: Pauline is a lecturer in Statistics at the University of Wollongong. Before she joined UOW, Pauline was a statistical consultant at ANU. Her current research of interest is in data privacy and statistical disclosure control and mixed modelling. Pauline spent (too much) her spare time practising saxophone and trumpet for orchestra and concert band.

Friday 12 Aug

Speaker: Nhat Ho, University of Texas at Austin

Title: Instability, Computational Efficiency and Statistical Accuracy

Abstract: Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on n samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of singular statistical models, including Gaussian mixture estimation, single-index models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps; namely, with the number of iterations being reduced from polynomial to logarithmic in sample size n.

Bio: Nhat Ho is currently an Assistant Professor of Statistics and Data Sciences at the University of Texas at Austin. He is also a core member of the Machine Learning Laboratory. His current research focuses on the interplay of four principles of statistics and data science: heterogeneity of data, interpretability of models, stability, and scalability of optimization and sampling algorithms.

Friday 5 Aug

Speaker: Ole Maneesoonthorn, Melbourne Business School, University of Melbourne

Title: Inference of Volatility Models Using Triple Information Sources: An Approximate Bayesian Computation Approach

Abstract: This paper utilizes three sources of information, daily returns, high frequency data and market option prices, to conduct inference about stochastic volatility models. The inferential method of choice is the approximate Bayesian computation (ABC) method, which allows us to construct posterior distributions of the model unknowns from data summaries without assuming a large dimensional measurement model from the three information sources. We employ ABC cut posteriors to dissect the information sources in posterior inference and show that it significantly reduces the computational burden compared to conventional posterior sampling. The benefit of utilizing multiple information sources in inference is explored in the context of predictive performance of financial returns.

Friday 27 May

Speaker: Tao Zou, Australian National University

Title: Quasi-score matching estimation for spatial autoregressive models with random weights matrix and regressors

Abstract: Due to the rapid development of social networking sites, the spatial autoregressive (SAR) model has played an important role in social network studies. However, the commonly used quasi-maximum likelihood estimation (QMLE) for the SAR model is not computationally scalable as the network size is large. In addition, when establishing the asymptotic distribution of the parameter estimators of the SAR model, both weights matrix and regressors are assumed to be nonstochastic in classical spatial econometrics, which is perhaps not realistic in real applications. Motivated by the machine learning literature, quasi-score matching estimation for the SAR model is proposed. This new estimation approach is still likelihood-based, but significantly reduces the computational complexity of the QMLE. The asymptotic properties of parameter estimators under the random weights matrix and regressors are established, which provides a new theoretical framework for the asymptotic inference of the SAR type models. The usefulness of the quasi-score matching estimation and its asymptotic inference are illustrated via extensive simulation studies. This is a joint work with Dr Xuan Liang at ANU.

Friday 20 May

Speaker: Jing Cao, Southern Methodist University

Title:

Abstract: As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. Each instance is described by a feature vector. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In this study, we apply MIL to text sentiment analysis. The current neural-network-based approaches in text analysis enjoy high classification accuracies but usually lack interpretability. The proposed Bayesian MIL model treats each text document as a bag, where the words are the instances. The model has a two-layered structure. The first layer identifies whether a word is essential or not (i.e., primary instance), and the second layer assigns a sentiment score over the individual words of a document. The motivation of our approach is that by the combination of the attention mechanism from neural networks with a relatively simple statistical model, hopefully, we can combine the best of two worlds: the interpretability of a statistical model and the high predictive performance of neural-network models.

Bio:

Dr. Jing Cao is a professor of statistics from the Department of Statistical Science at Southern Methodist University in US. Her research interest includes Bayesian methods (Bayesian hierarchical models, Bayesian spatial-temporal modeling, Bayesian latent variable modeling, etc.), high-throughput data analysis, clinical trial design and analysis, machine learning and text mining. She has received funding from NIH, NSF, and IES to support her research. Dr. Cao also serves as the chair of the ASA Professional Ethics Committee.

Friday 13 May

Speaker: Ziyang Lyu, University of New South Wales

Title:

Abstract: We derive asymptotic results for the maximum likelihood and restricted maximum likelihood (REML) estimators of the parameters in the nested error regression model when both the number of independent clusters and the cluster sizes (the number of observations in each cluster) go to infinity. A set of conditions is given under which the estimators are shown to be asymptotically normal. There are no restrictions on the rate at which the cluster size tends to infinity. We also deal with the asymptotic distributions of the estimated best linear unbiased predictors (EBLUP) of the random effects, with ML/REML, estimated variance components, converge to the true distributions of the corresponding random effects, when both of the numbers of independent clusters and the cluster sizes (the number of observations in each cluster) go to infinity.

Bio:

Dr Ziyang Lyu is a postdoc at the UNSW under the supervision of Prof. Scott Sisson. His main research fields are small area estimations, model predictions, linear mixed models, the finite Gaussian Mixture models, and asymptotic theories on Maximum Likelihood inferences. He's also interested in exploring extreme value theorem and machine learning from statistical methodology, especially semi-supervised learning (Pattern recognition).

Friday 29 April

Speaker: Benoit Liquet-Weiland, Macquarie University

Title: Variable Selection and Dimension Reduction methods for high dimensional and Big-Data Set

Abstract: It is well established that incorporation of prior knowledge on the structure existing in the data for potential grouping of the covariates is key to more accurate prediction and improved interpretability.
In this talk, I will present new multivariate methods incorporating grouping structure in frequentist and Bayesian methodology for variable selection and dimension reduction to tackle the analysis of high dimensional and Big-Data set. We develop methods using both penalised likelihood methods and Bayesian spike and slab priors to induce structured sparsity. Illustration on genomics dataset will be presented.

Friday 22 April

Speaker: Yingxin Li, University of Sydney

Title: Statistical modelling and machine learning for single-cell data harmonisation and analysis

Abstract: Technological advances such as large-scale single-cell profiling have exploded in recent years and enabled unprecedented understanding of the behaviour of individual cells. Effectively harmonising multiple collections and different modalities of single-cell data and accurately annotating cell types using reference, which we consider as the step of intermediate data analysis in this thesis, serve as a foundation for the downstream analysis to uncover biological insights from single-cell data. This thesis proposed several statistical modelling and machine learning methods to address several challenges in intermediate data analysis in the single-cell omics era, including: (1) scMerge to effectively integrate multiple collections of single-cell RNA-sequencing (scRNA-seq) datasets from a single modality; (2) scClassify to annotate cell types for scRNA-seq data by capitalising on the large collection of well-annotated scRNA-seq datasets; and (3) scJoint to integrate unpaired atlas-scale single-cell multi-omics data and transfer labels from scRNA-seq datasets to scATAC-seq data. We illustrate that the proposed methods enable a novel and scalable workflow to integratively analyse large-cohort single-cell data, demonstrating using a collection of single-cell multi-omics COVID-19 datasets.

Bio: Yingxin Lin is a final year PhD student in Statistics at The University of Sydney under the supervision of Prof. Jean Yang, Dr. Rachel Wang and A/Prof. John Ormerod. Since the beginning of this year, she has been working as a postdoctoral researcher at the University of Sydney. She is a member of the School of Mathematics and Statistics and Sydney Precision Bioinformatics Alliance. Her research interests lie broadly in statistical modelling and machine learning for various omics, biomedical and clinical data, specifically focusing on methodological development and data analysis for single-cell omics data.

Friday 8 April

Speaker: Michael Stewart, University of Sydney

Title: Detection boundaries for sparse gamma scale mixture models

Abstract: Mixtures of distributions from a parametric family are useful for various statistical problems, including nonparametric density estimation, as well as model-based clustering. In clustering an enduringly difficult problem is choosing the number of clusters; when using mixtures models for model-based clustering this corresponds (roughly) to choosing the number of components in the mixture. The simplest version of this model selection problem is choosing between a known single-component mixture, and a "contaminated" version where a second unknown component is added. Due to certain structural irregularities, many standard asymptotic results from hypothesis testing do not apply in these "mixture detection" problems, including those relating to power under local alternatives. Detection boundaries have arisen over the past few decades as useful ways to describe what kinds of local alternatives are and are not detectable (asymptotically) in these problems, in particular in the "sparse" case where the mixing proportion of the contaminant is very small. We review early work on simple normal location mixtures, some interesting generalisations and also recent results for a gamma scale mixture model.

Friday 1 April

Speaker: Song Zhang, University of Texas Southwestern Medical Center

Title: Power Analysis for Cluster Randomized Trials with Multiple Primary Endpoints

Abstract: Cluster randomized trials (CRTs) are widely used in different areas of medicine and public health. Recently, with increasing complexity of medical therapies and technological advances in monitoring multiple outcomes, many clinical trials attempt to evaluate multiple primary endpoints. In this study, we present a power analysis method for CRTs with K > 2 binary co-primary endpoints. It is developed based on the GEE (generalized estimating equation) approach, and three types of correlations are considered: inter-subject correlation within each endpoint, intra-subject correlation across endpoints, and inter-subject correlation across endpoints. A closed-form joint distribution of the K test statistics is derived, which facilitates the evaluation of power and type I error for arbitrarily constructed hypotheses. We further present a theorem that characterizes the relationship between various correlations and testing power. We assess the performance of the proposed power analysis method based on extensive simulation studies. An application example to a real clinical trial is presented.

Bio: Dr. Song Zhang is a professor of biostatistics from the Department of Population and Data Sciences, University of Texas Southwestern Medical Center. He received his Ph.D. in statistics from the University of Missouri-Columbia in 2005. His research interest includes Bayesian hierarchical models with application to disease mapping, missing data imputation, joint modeling of longitudinal and survival outcomes, and genomic pathway analysis, as well as experimental design methods for clinical trials with clustered/longitudinal outcomes, different types of outcome measures, missing data patterns, correlation structures, and financial constraints. He has co-authored a book titled "Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research" (Chapman & Hall/CRC). As the principal investigator, Dr. Zhang has received funding from PCORI, NIH, and NSF to support his research.

Friday 11 March

Speaker: Nick Fisher, University of Sydney

Title: World University Ranking systems, Texas Target Practice, and a Gedankenexperiment

Abstract: World University Ranking (WUR) systems play a significant role in how universities are funded and whom they can attract as faculty and students. Yet, for the purpose of comparing universities as institutions of higher education, current systems are readily gamed, provide little guidance about what needs to be improved, and fail to allow for the diversity of stakeholder needs in making comparisons. We suggest a list of criteria that a WUR system should meet, and which none of the current popular systems appears satisfy. By using as a starting point the goal of creating value for the diverse and sometimes competing stakeholder requirements for a university, we suggest via a thought experiment a rating process that is consistent with all the criteria, and a way in which it might be trialled. Also, the resulting system itself adds value for individual users by allowing them to tune it to their own particular circumstances. However, an answer to the simple question “Which is the best university” may well be: there is no simple answer.