# University of Sydney Statistics Seminar Series

*
Unless otherwise specified seminars will be held on Fridays at 2pm in Carslaw 373
*

To be added to or removed from the mailing list, or for any other information, please contact Munir Hiabu.

## 2020 Semester 1

### Wednesday April 15 4pm, Location TBA

**
Nancy Reid
**

**
Title: TBA
**

## Abstract (click to expand)

TBA

### Friday March 27 2pm, AGR Carslaw 829

**
Wanchuang Zhu
**

**
Title: Integrated Partition-Mallows Model and Its Inference for Rank Aggregation
**

## Abstract (click to expand)

Learning how to rank and how to aggregate ranking lists has been an area of active research for many years and its advances have played a vital role in the recent boom of internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into relevant and irrelevant (background) groups and incorporating the Mallows model, we propose a framework that can not only distinguish quality differences among the rankers, but also provide the detailed ranking information for relevant entities. Advantages of the proposed approach in comparison with existing ones are demonstrated via simulation studies and real-data applications. Extensions of our method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed.

### Friday March 6 2pm, AGR Carslaw 829

**
Clara Grazian
**
(UNSW, School of Mathematics and Statistics)

**
Title: The importance of being conservative: Bayesian analysis for mixture models
**

## Abstract (click to expand)

From a Bayesian perspective, mixture models have been characterised by a restrictive prior modelling, since their ill-defined nature makes most of the improper priors not acceptable. In particular, recent results have shown the inconsistency of the posterior distribution on the number of components when using standard nonparametric prior processes. We propose an analysis of prior choices associated by their property of conservativeness in the number of components. Among the proposals, we derive a prior distribution on the number of clusters which considers the loss one would incur if the true value representing the number of components were not considered. The prior has an elegant and easy to implement structure, which allows to naturally include any prior information one may have as well as to opt for a default solution in cases where this information is not available. The methods are then applied on two real data-sets. The first data-set consists of retrieval times for monitoring IP packets in computer network systems. The second data-set consists of measures registered in antimicrobial susceptibility tests for 14 compounds used in the treatment of M. Tuberculosis. In both the situations, the number of clusters is uncertain and different solutions lead to different interpretations.

### Friday February 28 2pm, AGR Carslaw 829

**
Hien Nguyen
**
(La Trobe University, Department of Mathematics and Statistics)

**
Title: Shapley values for linear regression models and its application to explainable AI
**

## Abstract (click to expand)

The linear regression model is a mainstay in the toolkit of statisticians, and is a sufficiently powerful method for modelling the dependence structures between some variable of interest and its possible explanatory covariates. A common tool to employ in order to quantify the degree to which a linear regression model can explain some phenomenon is via the coefficient of determination, or R^2 coefficient. A potential question of interest is the proportion of which each explanatory value contributes to the overall value of the R^2. This can be answered using the Shapley value decomposition: a game theoretic mechanism for decomposing the utility of some cooperative game in an axiomatic way. We demonstrate how the Shapley value decomposition can be applied to the R^2 coefficient, and describe a process for asymptotic computation of its variability. We then describe how the Shapley value can be used in the setting of explainable AI, where one wishes to quantify the importance of variables in a machine learning context, in a model agnostic way.

### Wednesday February 26 4pm, New Law School Lecture Theatre 106

**
Jerome H. Friedman
**
(Stanford University, Department of Statistics)

**
Title: Contrast Trees and Distribution Boosting
**

## Abstract (click to expand)

Often machine learning methods are applied and results reported in cases where there is little to no information concerning accuracy of the output. Simply because a computer program returns a result does not insure its validity. If decisions are to be made based on such results it is important to have some notion of their veracity. Contrast trees represent a new approach for assessing the accuracy of many types of machine learning estimates that are not amenable to standard (cross) validation methods. They are easily interpreted and can be used as diagnostic tools to reveal and then understand the inaccuracies of models produced by any learning method. In situations where inaccuracies are detected boosted contrast trees can often improve performance. A special case, distribution boosting, provides an assumption free method for directly estimating the full conditional distribution of an outcome variable y for any given set of joint predictor variable values x.

## 2019 Semester 2

### Friday November 1 2pm, AGR Carslaw 829

**
Michael Stewart
**
(USyd, School of Mathematics and Statistics)

**
Title: Estimating non-smooth scale functionals of random effect distributions
**

## Abstract (click to expand)

We report on recent progress in ongoing work, jointly with Professor Alan Welsh of ANU, on inference concerning scale parameters of latent distributions in random effect models. An overarching motivation is robustness which in turn motivates the following two aims: (a) that robust, possibly non-smooth scale functionals, like interquartile range or median absolute deviation (from the median) might be of interest; (b) that we guard as much as possible against model misspecification. This leads us to consider bringing the substantial literature on semiparametric theory to bear on the problem. Semiparametric theory leads to optimal inference for Euclidean parameters in the presence of infinite-dimensional nuisance parameters, i.e. nuisance functions. We can thus regard the shape of the latent distribution, as well as the densities of residuals as nuisance functions and try to develop methods which attain good performance regardless of the value of the nuisance functions. The talk will give an overview of semiparametric methods in the case of a standard location/scale problem and in particular point out models where so-called efficient score functions may be identified which do not depend on nuisance functions. We then detail our efforts at transferring such properties to the random effects model. We point out connections with recently developed semiparametric methods in missing data models and also the limitations of such methods in our setting. Unsurprisingly, since we are dealing with latent distributions, Bayesian theory also has some relevance to our problem and we also point out ways we have tried to exploit this. Finally we present and discuss the results of some simulations which indicate that our methods may indeed offer some improvement over existing “naïve” methods.

### Friday September 27 2pm, AGR Carslaw 829

**
Kylie-Anne Richards
**
(UTS, Finance Discipline Group)

**
Title: Score Test for Marks in Hawkes Processes
**

## Abstract (click to expand)

A score statistic for detecting the impact of marks in a linear Hawkes self-exciting point process is proposed, with its asymptotic properties, finite sample performance, power properties using simulation and application to real data presented. A major advantage of the proposed inference procedure is the Hawkes process can be fit under the null hypothesis that marks do not impact the intensity process. Hence, for a given record of a point process, the intensity process is estimated once only and then assessed against any number of potential marks without refitting the joint likelihood each time. Marks can be multivariate as well as serially dependent. The score function for any given set of marks is easily constructed as the covariance of functions of future intensities fit to the unmarked process with functions of the marks under assessment. The asymptotic distribution of the score statistic is chi-squared distribution, with degrees of freedom equal to the number of parameters required to specify the boost function. Model based, or non-parametric estimation of required features of the marks marginal moments and serial dependence can be used. The use of sample moments of the marks in the test statistic construction do not impact size and power properties.

### Friday August 30 2pm, AGR Carslaw 829

**
Mohsen Pourahmadi
**
(Texas A&M University, Department of Statistics)

**
Title: Modeling Structured Correlation Matrices
**

## Abstract (click to expand)

There has been a flurry of activity in the last two decades in model- ing/reparametrizing correlation matrices going beyond the familiar Fisher z-transform of a single correlation coefficient. We present an overview of the developments focusing on reparametrizing Cholesky factors of correlation matrices using hyperspherical coordinates where the ensuing angles are meaningful geometrically. In spite of the lack of broadly accepted statistical interpretation, we demonstrate that these angles are quite flexible and effective for parsimonious modeling of large nearly block-structured correlation matrices commonly encountered in finance, environmental and biological sciences. Asymptotic nor- mality of the maximum likelihood estimates of these angles as new parameters is estab- lished. Real examples will be used to demonstrate the flexibility and applicability of the methodology. The role of an order among the variables and connection with the recent surge of interest in sparse estimation of directed acyclic graphs (DAG) will be discussed time permitting. (Joint work with Ruey Tsay, U of Chicago)

### Friday August 23, 2pm, New Law LT 024

**
Rebecca Barter
**
(UC Berkeley, Department of Statistics)

**
Title: Transitioning into the tidyverse
**

## Abstract (click to expand)

Most people who learned R before the tidyverse have likely started to feel a nibble of pressure to get aboard the tidyverse train. Sadly a fact of human nature is that once you’re comfortable doing something a certain way, it’s hard to find the motivation to learn a different way of doing something that you already know how to do. While the tidyverse is primarily made up of a set of super useful R packages (ggplot2, dplyr, purrr, tidyr, readr, tibble), it is also a way of thinking about implementing “tidy” data analysis. If you combine tidy thinking with the tidy packages, you will inevitably become a master of tidy analysis. This talk will provide a guide for easing into the tidyverse for new users by first focusing on piping, dplyr, and ggplot2, and then providing short summaries and references for the remaining packages that form the tidyverse (and how they play together). Additional resources for continuing your tidyverse journey will be provided.

**Target audience:**

People familiar with base R who want to familiarize themselves with the tidyverse, as well as those who have a little experience with the tidyverse and want a big picture of how it all fits together.

### Friday August 16, 2pm, AGR Carslaw 829

**
Anthony Hayter
**
( University of Denver, Daniels College of Business, Department of Business Information and Analytics)

**
Title: Recent Advances in Statistical Inference and Computational Methodologies
**

## Abstract (click to expand)

Some recent work in statistical inference and computational methodologies will be presented, with applications.
Recursive integration methods will be discussed, which provide an efficient way of evaluating high-dimensional integral expressions.
Discussions will also be provided of recent advances in win-probabilities, and latent score methodologies for modelling financial credit ratings.

Bio
Anthony Hayter is a Full Professor in the Department of Business Information and Analytics at the University of Denver.
He has an M.A. in mathematics from Cambridge University, and a Ph.D. in Statistics from Cornell University.
He is the author of the textbook “Probability and Statistics for Engineers and Scientists,” the 4th edition of which was published in 2012.
He has global interests and has taught statistics in some Japanese MBA programs.
He was awarded a Fulbright Foreign Scholarship Award in 2011-2012 and a Fulbright Specialist Grant in 2014 to assist the government, universities, and businesses in Thailand with surveys, data analysis, curriculum development and research projects.

### Friday August 16, 3pm, AGR Carslaw 829

**
Saumen Mandal
**
(University of Manitoba, Department of Statistics)

**
Title: Optimal designs with applications in estimation of parameters under constraints
**

## Abstract (click to expand)

There are a variety of problems in statistics, which demand the calculation of one or more optimizing probability distributions or measures. Examples include optimal regression design, maximum likelihood estimation and stratified sampling. In this talk, I will first give a brief introduction on optimal design theory. I will then present some potential applications of optimal design. In many regression designs it is desired to estimate certain parameters independently of others. Interest in this objective was motivated by a practical problem in Chemistry. We construct such designs by minimizing the squared covariances between the estimates of parameters or linear combination of parameters of a linear regression model. As a second application, I will consider a problem of determining maximum likelihood estimates under a hypothesis of marginal homogeneity for data in a square contingency table. This is an example of an optimization problem with respect to variables which should be nonnegative and satisfy several linear constraints. The constraints are based on the marginal homogeneity conditions. We solve this problem by simultaneous optimization techniques. We apply the methodology in some real data sets and report the optimizing distributions. The methodology can be applied to a wide class of estimation problems where constraints are imposed on the parameters. This is based on joint work with B. Torsney and M. Chowdhury.

### Friday August 9, 2pm, AGR Carslaw 829

**
Yundong Tu
**
( Peking University, Guanghua School of Management, Department of Business Statistics and Econometrics)

**
Title: Spurious Functional-coefficient Regression Models and Robust Inference with Marginal Integration
**

## Abstract (click to expand)

Functional-coefficient cointegrating models have become popular to model non-linear nonstationarity in econometrics (Cai et al., 2009; Xiao, 2009). However, there is rare study on testing the existence of functional-coefficient cointegration. Consequently, functional-coefficient regressions involving nonstationary regressors may be spurious. This paper investigates the effect that spurious functional-coefficient regression has on the model diagnostics. We find that common characteristics of spurious regressions are manifest, including divergent local significance tests, random local goodness-of-fit, and local Durbin-Watson ratio converging to zero, complementing those discovered in spurious linear and nonparametric regressions (Phillips, 1986, 2009). In addition, spuriousness causes the divergences of the global significance tests proposed by Xiao (2009) and Sun et al. (2016), which are likely to produce misleading conclusions for practitioners. To resolve the problems, we propose a simple-to-implement inference procedure based on a semiparametric balanced regression, by augmenting regressors of the original spurious regression with lagged dependent variable and independent variables, with the aid of the marginal integration. This procedure achieves spurious regression detection via standard nonparametric inferential asymptotics, and is found robust to the true relationship between the integrated processes. Monte Carlo simulations show that the balanced regression based tests have very good size and power in finite samples.

## 2019 Semester 1

### Friday July 19, 2pm, Carslaw 373

**
Dan Simpson
**
( University of Toronto, Department of Statistical Sciences)

**
Title: Placating pugilistic pachyderms: proper priors prevent poor performance
**

## Abstract (click to expand)

Modern statistical inference finds itself caught between two charging elephants: an elephant named model complexity and the elephant who answers only to "expressivity". Within the Bayesian framework, prior distributions are a way to try to balance these angry pachyderms. In this talk, I will cover a bunch of methods for specifying and evaluating prior distributions for complex statistical models.

### Friday July 12, 2pm, Carslaw 373

**
Weng Kee Wong
**
( UCLA School of Public Health, Department of Biostatistics)

**
Title: Using Animal Instincts to Find Efficient Experimental Designs for Biomedical Studies
**

## Abstract (click to expand)

This talk reviews and discusses nature-inspired metaheuristic algorithms as general purpose optimization tools for solving problems in statistics. The approach works quite magically and frequently finds an optimal solution or a nearly optimal solution quickly. There is virtually no explicit assumption required for such methods to be applicable and the user only needs to input a few easy tuning parameters. We focus on one of the more popular algorithms, particle swarm optimization (PSO), and demonstrate, as an application, its ability to find various types of optimal experimental designs for dose response studies and other biomedical problems, including optimal designs for generalized linear models with many interacting factors and standardized maximin optimal designs, where effective algorithms to find them have remained stubbornly elusive until now.

### Friday June 21, 2pm, Carslaw 373

**
Sergey Dolgov
**
( University of Bath, Department of Mathematical Sciences)

**
Title: Low-rank tensor decompositions for sampling of high-dimensional probability distributions
**

## Abstract (click to expand)

Uncertainty quantification and inverse problems in many variables are pressingly needed tasks, yet high-dimensional functions are notoriously difficult to integrate in order to compute desired quantities of interest. Functional approximations, in particular the low-rank separation of variables into tensor product decompositions, have become popular for reducing the computational cost of high-dimensional integration down to linear scaling in the number of variables. However, tensor approximations may be inefficient for non-smooth functions. Sampling based Monte Carlo methods are more general, but they may exhibit a very slow convergence, overlooking a hidden structure of the function. In this talk we review tensor product approximations for the problem of uncertainty quantification and Bayesian inference. This allows efficient integration of smooth PDE solutions, posterior density functions and quantities of interest. Moreover, we can use the low-rank approximation of the density function to construct efficient proposals in the MCMC algorithm, as well as in the importance weighting. This combined tensor approximation - MCMC method is more accurate also if the quantity of interest is not smooth, such as the indicator function of an event.

### Friday May 10, 2pm, Carslaw 373

**
Mehdi Maadooliat
**
(Marquette University, Department of Mathematics, Statistics and Computer Science)

**
Title: Collective Nonparametric Density and Spectral Density Estimation with Applications in Bioinformatics
**

## Abstract (click to expand)

In this talk, I review a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. In the second part of the talk, I present an extension of this approach for the simultaneous estimation of spectral density functions (SDFs) for a collection of stationary time series that share some common features. A collective estimation approach pools information and borrows strength across the SDFs to achieve better estimation efficiency. Also, each estimated spectral density has a concise representation using the coefficients of the basis expansion, and these coefficients can be used for visualization, clustering, and classification purposes. The Whittle pseudo-maximum likelihood approach is used to fit the model, and an alternating blockwise Newton-type algorithm is developed for the computation. A web-based shiny App is developed for visualization, training and learning the SDFs collectively using the proposed technique. Finally, we apply our method to cluster similar brain signals recorded by the electroencephalogram for identifying synchronized brain regions according to their spectral densities.

### Friday May 3, 2pm, Carslaw 373

**
Ioannis Kasparis
**
(University of Cyprus, Department of Economics)

**
Title: Regressions with Heavy Tailed Weakly Nonstationary Processes
**

## Abstract (click to expand)

We develop a limit theory for general additive functionals of Weakly Nonstationary Processes (WNPs) under heavy tailed innovations. In particular, we assume WNPs driven by innovations that are in the domain of attraction of an α-stable law with stability parameter α∈(0,2]. The current work generalises the recent limit theory of Duffy and Kasparis (2018), who consider WNPs under second moments. The defining characteristic of WNPs is that their empirical versions, upon standardisation, converge weakly to white noise processes rather than fractional Gaussian or fractional stable motions, which is typically the case under nonstationarity. As a consequence, the usual asymptotic methods (i.e. FCLTs) are not applicable, and different methods are required. The leading examples of WNPs under consideration are fractional d=1-1/α and mildly integrated processes driven by heavy tailed errors. Our main limit results are utilised for the asymptotic analysis both parametric and nonparametric regression estimators.

Joint work with S. Arvanitis (Athens University of Economics & Business) and J.A. Duffy (Oxford).

### Friday April 12, 2pm, Carslaw 373

**
Noel Cressie
**
(University of Wollongong, School of Mathematics and Applied Statistics)

**
Title: Inference for Spatio-Temporal Changes of Arctic Sea Ice
**

## Abstract (click to expand)

Arctic sea-ice extent has been of considerable interest to scientists in recent years, mainly due to its decreasing trend over the past two decades. In this talk, a hierarchical spatio-temporal generalized linear model (GLM) is fitted to binary Arctic-sea-ice data, where data dependencies are introduced in the model through a latent dynamic spatio-temporal linear mixed-effects model. By using a reduced number of spatial basis functions, the resulting model achieves both dimension reduction and non-stationarity for spatial fields at different time points. An EM algorithm is used to estimate model parameters, and a (empirical) hierarchical-statistical-modelling approach is used to obtain the predictive distribution of the latent spatio-temporal process. Spatial binary Arctic-sea-ice data for each September over the last 20 years are analysed in this way. Maps of predicted water-ice potential and their uncertainties and posterior summaries show the changes in Arctic sea-ice cover during this relatively short time period.

### Friday April 5, 2pm, Carslaw 373

**
Linh Nghiem
**
(Australian National University, Research School of Finance, Actuarial Studies and Statistics)

**
Title: Estimation in linear errors-in-variables models with unknown error distribution
**

## Abstract (click to expand)

Linear errors-in-variables models arise when some model covariates cannot be measured accurately. Although it is well known that not correcting for measurement errors leads to inconsistent and biased estimates, most correction methods typically requires that the measurement error distribution be known (or estimable from replicate data). A generalized method of moments approach can be used to estimate model parameters in the absence of knowledge of the error distributions, but requires the existence of a large number of model moments. We propose a new estimation method that does not require either of these assumptions. The new estimator is based on the phase function, a normalized version of the characteristic function. This approach only requires the model covariates to have asymmetric distributions, while the error distributions are symmetric. We prove that the phase function-based estimator is asymptotically normal, and that it has competitive performance in finite sample compared to the existing methods even while making fewer model assumptions with respect to measurement error. Furthermore, we propose a new modified bootstrap algorithm for a fast computation of the standard error of the estimates. Finally, the proposed method is applied to a real dataset concerning the measurement of air pollution. This work represents a completely new way of approaching the linear errors-in-variables model.

### Friday March 8, 2pm, Carslaw 373

**
Peter Robinson
**
( London School of Economics, Department of Economics)

**
Title: Long-Range Dependent Curve Time Series*
**

## Abstract (click to expand)

We introduce methods and theory for functional or curve time series with long range dependence. The temporal sum of the curve process is shown to be asymptotically normally distributed, the conditions for this covering a functional version of fractionally integrated autoregressive moving averages. We also construct an estimate of the long-run covariance function, which we use, via functional principal component analysis, in estimating the orthonormal functions spanning the dominant sub-space of the curves. In a semiparametric context, we propose an estimate of the memory parameter and establish its consistency. A Monte-Carlo study of finite sample performance is included, along with two empirical applications. The first of these finds a degree of stability and persistence in intra-day stock returns. The second finds similarity in the extent of long memory in incremental age-specific fertility rates across some developed nations

*joint work with Degui Li and Han Lin Shang.### Friday March 1, 2pm, Carslaw 373

**
Munir Hiabu
**
(University of Sydney, School of Mathematics and Statistics)

**
Title: Structured survival models with least squares backfitting
**

## Abstract (click to expand)

Linear models are great. They are very well understood, estimators can be quickly calculated via matrix algebra, and results have nice interpretations.

But there is one catch: They do not account for non-linearity.

The much hyped solution today for dealing with incredibly complex data is deep learning. They provide flexibility for fitting functions with very general shape.

But the downside here is that firstly, they need incredible amounts of replication to achieve an accurate fit and secondly, interpretation of the entering covariates is lost.

A middle ground solution is structured models.

Structured models are not as flexible as deep learning, but they still allow for enough non-linearity to be captured. Additionally, interpretation of the entering covariates is not lost. See here a Wikipedia article for the most prominent representative, (generalized) additive models, https://en.wikipedia.org/wiki/Additive_model.

The models are most often solved via a backfitting algorithm proposed by Buja, Hastie & Tibshirani (1989, Ann. Stat.). However, their backfitting algorithm is not derived from a proper optimisation criteria/loss function and is heuristic in nature. One major weakness from that is that this backfitting algorithm suffers problems with correlated covariates and multicollinearity.

In this talk I will propose least-squares motivated estimators which do not suffer this problem.

The least squares treatment will lead to integral equations of the second kind allowing for a nice mathematical treatment. I will mostly focus on survival models where interpretation is of the data is often key.

## 2018 Semester 2

### Friday November 2, 2pm, Carslaw 173

**
Ricardo Campello
**

University of Newcastle, School of Mathematical and Physical Sciences

## Abstract (click to expand)

**
Non-Parametric Density Estimates for Data Clustering, Visualisation, and Outlier Detection
**

Non-parametric density estimates are a useful tool for tackling different problems in statistical learning and data mining, most noticeably in the unsupervised and semi-supervised learning scenarios. In this talk, I elaborate on HDBSCAN*, a density-based framework for hierarchical and partitioning clustering, outlier detection, and data visualisation. Since its introduction in 2015, HDBSCAN* has gained increasing attention from both researchers and practitioners in data mining, with computationally efficient third-party implementations already available in major open-source software distributions such as R/CRAN and Python/SciKit-learn, as well as successful real-world applications reported in different fields. I will discuss the core HDBSCAN* algorithm and its interpretation from a non-parametric modelling perspective as well as from the perspective of graph theory. I will also discuss post-processing routines to perform hierarchy simplification, cluster evaluation, optimal cluster selection, visualisation, and outlier detection. Finally, I briefly survey a number of unsupervised and semi-supervised extensions of the HDBSCAN* framework currently under development along with students and collaborators, as well as some topics for future research.

*
Prof. Ricardo Campello received his Bachelor degree in Electronics Engineering from the State University of São Paulo, Brazil, in 1994, and his MSc and PhD degrees in Electrical and Computer Engineering from the State University of Campinas, Brazil, in 1997 and 2002, respectively. Among other appointments, he was a Post-doctoral Fellow at the University of Nice, France (fall/winter 2002 - 2003), an Assistant/Associate Professor in computer science at the University of São Paulo, Brazil (2007 - 2016), and a Visiting Professor in computer science at the University of Alberta, Canada (2011 - 2013), where he is currently an Adjunct Professor (since 2017). Between 2016 and 2018 he was a Professor in applied mathematics, College of Science and Engineering, James Cook University (JCU), Australia, where he was co-responsible for the development of a professional Master of Data Science online programme. Currently he holds a position of Adjunct Professor at JCU. He is a Professor of data science within the discipline of statistics in the University of Newcastle, Australia, since July/2018.
*

### Friday October 26, 2pm, Carslaw 173

**
Rachel Wang
**

University of Sydney, School of Mathematics and Statistics

## Abstract (click to expand)

**
Metropolis-Hastings MCMC with dual mini-batches
**

For many decades Markov chain Monte Carlo (MCMC) methods have been the main workhorse of Bayesian inference. However, traditional MCMC algorithms are computationally intensive. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using two mini-batches (MHDB) of the whole dataset each time and show that this gives rise to approximately a tempered stationary distribution. We prove that MHDB preserves the modes of the original target distribution and derive an error bound on the approximation for a general class of models including mixtures of exponential family distributions, linear binary classification and regression. To further extend the utility of the algorithm to high dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in neural network applications and show that compared with popular optimisation methods, our method is more robust to the choice of learning rate and improves testing accuracy.

*
Rachel Wang is currently a lecturer and DECRA fellow in the School of Mathematics and Statistics. She received her PhD in Statistics from UC Berkeley in 2015 and subsequently spent two years as a Stein Fellow / Lecturer in the Department of Statistics at Stanford University. Her research interests include statistical network theory, statistical machine learning, and their applications to complex genomic datasets.
*

### Friday October 19, 2pm, Carslaw 173

**
Yuguang Ipsen
**

Australian National University, Research School of Finance, Actuarial Studies & Statistics

## Abstract (click to expand)

**
New Class of Random Discrete Distributions on Infinite Simplex Derived from Negative Binomial Processes
**

The Poisson-Kingman distributions, PK(ρ), on the infinite simplex, can be constructed from a Poisson point process having intensity density ρ or by taking the ranked jumps up till a specified time of a subordinator with Levy density ρ, as proportions of the subordinator. As a natural extension, we replace the Poisson point process with a negative binomial point process having parameter r > 0 and Levy density ρ, thereby defining a new class PK^{(r)}(ρ) of distributions on the infinite simplex. The new class contains the two-parameter generalisation PD(α,θ) of Pitman and Yor (1997) when θ > 0. It also contains a class of distributions, PD_α(r) occurs naturally from the trimmed stable subordinator. We derive properties of the new distributions, including the joint density of its size-biased permutation, a stick-breaking representation as well as the exchangeable partition probability function and an analogous Ewens sampling formula for PD_α(r).

*
Joint work with Prof. Ross Maller and Dr. Soudabeh Shemehsavar.
*

### Friday October 5, 2pm, Carslaw 173

**
Steph de Silva
**

PricewaterhouseCoopers

## Abstract (click to expand)

**
The spiral of normalcy: on communicating in the data sciences
**

Communicating technical concepts around data science and statistics is a difficult, under rated but entirely essential skill to applied work in industry and beyond. Steph will talk about some of the reasons why we, as statisticians, mathematicians, data scientists and the like find ourselves in this position, what we might do about it and why it matters.

*
Steph has a Ph.D. in theoretical econometrics. Her first job after leaving academia was with the World Bank, working on survey data from the developing world. Real data came as a tremendous shock after years of simulating her own. She recovered eventually and went on to live a full and varied life.
*

*
These days, Steph is a data scientist working for a major consulting house where she’s living the generalist’s dream. She has no idea what she’ll be doing next week- and she’s absolutely sure she has no answers to the problems they’re going to throw at her when she gets there. Yet.
*

### Friday September 21, 2pm, Carslaw 173

**
Peter Radchenko
**

The University of Sydney Business School, Discipline of Business Analytics

## Abstract (click to expand)

**
Grouped Variable Selection with Discrete Optimization
**

We will discuss a new tractable framework for grouped variable selection with a cardinality constraint on the number of selected groups, leveraging tools in modern mathematical optimization. The proposed methodology covers both the case of high-dimensional linear regression and nonparametric sparse additive modelling. Computational experiments demonstrate the effectiveness of our proposal as an alternative method for sparse grouped variable selection - in terms of better predictive accuracy and greater model sparsity, at the cost of increased, but still reasonable, computation times. Empirical and theoretical evidence shows that the proposed estimators outperform their Group Lasso type counterparts in a wide variety of regimes.

*
Peter Radchenko is an Associate Professor of Business Analytics at the University of Sydney Business School. Prior to joining the University of Sydney in 2017, he held academic positions at the University of Chicago and the University of Southern California. He has a PhD in Statistics from Yale University, and an undergraduate degree in Mathematics and Applied Mathematics, from the Lomonosov Moscow State University. Peter Radchenko's primary research focus is on developing new methodology for dealing with massive and complex modern data. In particular, he has worked extensively in the area of high-dimensional regression, where the number of predictors is large relative to the number of observations.
*

### Friday September 14, 2pm, Carslaw 173

**
Stephan Huckemann
**

University of Göttingen, Institute for Mathematical Stochastics

## Abstract (click to expand)

**
Dirty Central Limit Theorems on Noneuclidean Spaces
**

For inference on means of random vectors, the central limit theorem (CLT) is a central tool. Fréchet (1948) extended the notion of means to arbitrary metric spaces, as minimizers of expected squared distance. For such, under mild conditions, a strong law has been provided 1977 by Ziezold and, in case of manifolds and additional stronger conditions, a CLT has been derived by Bhattacharya and Paragenaru (2005). In a local chart, this CLT features a classical normal limiting distribution with a classical rate of inverse square root of sample size. If these additional stronger conditions are not satisfied, CLTs may still hold but feature different rates and different limit distributions. We give examples of such "dirty" limit theorems featuring faster rates (stickiness) and slower rates (smeariness). The former may occur on NNC (nonnegative curvature) spaces, here the distribution around cut loci of means plays a central role, the latter on NPC (nonpositive curvature) spaces. Both effects may have serious practical applications.

*
Stephan Huckemann received his degree in mathematics from the University of Giessen (Germany) in 1987. He was a visiting lecturer and scholar at the University of Michigan, Ann Arbor (1987 - 1989) and a postdoctoral research fellow at the ETH Zürich, Switzerland (1989 - 1990). He then worked as a commercial software developer (1990 - 2001) and returned to academia as a contributor to the computer algebra system MuPAD at Sciface Software and the University of Paderborn, Germany (2001 - 2003). Working with H. Ziezold (2004 - 2006, Univ. of Kassel), P. Mihailescu (2007, Univ. of Göttingen), P.T. Kim (2009, Univ. of Guelph, Canada) and A. Munk (2007 - 2010, Univ. of Göttingen) he completed his Habilitation at the Univ. of Göttingen (Germany) in 2010 and was awarded a DFG Heisenberg research fellowship. As such he continued at the Institute for Mathematical Stochastics, Univ. of Göttingen while being a research group leader at the Statistical and Applied Mathematical Sciences Institute (2010/11 SAMSI, Durham, NC, USA). After substituting (2012 - 2013) for the Chair of Stochastics and Applications at the Univ. of Göttingen he holds the new Chair for Non-Euclidean Statistics.
*

### Friday September 7, 2pm, Carslaw 173

**
Mark Girolami
**

Department of Mathematics, Imperial College London

## Abstract (click to expand)

**
Markov Transition Operators defined by Hamiltonian Symplectic Flows and Langevin Diffusions on the Riemannian Manifold Structure of Statistical Models
**

The use of Differential Geometry in Statistical Science dates back to the early work of C.R. Rao in the 1940s when he sought to assess the natural distance between population distributions. The Fisher-Rao metric tensor defined the Riemannian manifold structure of probability measures and from this local manifold geodesic distances between probability measures could be properly defined. This early work was then taken up by many authors within the statistical sciences with an emphasis on the study of the efficiency of statistical estimators. The area of Information Geometry has developed substantially and has had major impact in areas of applied statistics such as Machine Learning and Statistical Signal Processing. A different perspective on the Riemannian structure of statistical manifolds can be taken to make breakthroughs in the contemporary statistical modelling problems. Langevin diffusions and Hamiltonian dynamics on the manifold of probability measures are defined to obtain Markov transition kernels for Monte Carlo based inference.

*
Mark Girolami holds the Chair of Statistics within the Department of Mathematics at Imperial College London where he is also Professor of Computing Science in the Department of Computing. He is an adjunct Professor of Statistics at the University of Warwick and is Director of the Lloyd’s Register Foundation Programme on Data Centric Engineering at the Alan Turing Institute where he served as one of the original founding Executive Directors. He is an elected member of the Royal Society of Edinburgh and previously was awarded a Royal Society - Wolfson Research Merit Award. Professor Girolami has been an EPSRC Research Fellow continuously since 2007 and in 2018 he was awarded the Royal Academy of Engineering Research Chair in Data Centric Engineering. His research focuses on applications of mathematical and computational statistics.
*

### Friday August 17, 2pm, Carslaw 173

**
Pavel Krivitsky
**

University of Wollongong, School of Mathematics and Applied Statistics

## Abstract (click to expand)

**
Inference for Social Network Models from Egocentrically-Sampled Data
**

Egocentric network sampling observes the network of interest from the point of view of a set of sampled actors, who provide information about themselves and anonymised information on their network neighbours. In survey research, this is often the most practical, and sometimes the only, way to observe certain classes of networks, with the sexual networks that underlie HIV transmission being the archetypal case. Although methods exist for recovering some descriptive network features, there is no rigorous and practical statistical foundation for estimation and inference for network models from such data. We identify a subclass of exponential-family random graph models (ERGMs) amenable to being estimated from egocentrically sampled network data, and apply pseudo-maximum-likelihood estimation to do so and to rigorously quantify the uncertainty of the estimates. For ERGMs parametrised to be invariant to network size, we describe a computationally tractable approach to this problem. We use this methodology to help understand persistent racial disparities in HIV prevalence in the US. Lastly, we discuss how questionnaire design affects what questions can and cannot be answered with this analysis.

This work is joint with Prof Martina Morris (University of Washington).

*
Dr Pavel N. Krivitsky received his PhD in Statistics in 2009 from University of Washington, and has been a Lecturer in Statistics at the University of Wollongong since 2013. His research interests include statistical modelling of social network data and processes for applications in epidemiology and the social sciences, statistical computing, and data privacy. He has contributed to theory and practice of latent variable and of exponential-family random graph models for networks, particularly models for network evolution and for valued relations, understanding effects of changing network size and composition, and estimation of complex network models from difficult (perturbed, egocentrically-sampled, etc.) data. He develops and maintains a number of popular R packages for social network analysis.
*

### Friday August 10, 2pm, Carslaw 173

**
Yuya Sasaki
**

Vanderbilt University, Department of Economics

## Abstract (click to expand)

**
Inference for Moments of Ratios with Robustness against Large Trimming Bias and Unknown Convergence Rate
**

We consider statistical inference for moments of the form E[B/A]. A naive sample mean is unstable with small denominator, A. This paper develops a method of robust inference, and proposes a data-driven practical choice of trimming observations with small A. Our sense of the robustness is twofold. First, bias correction allows for robustness against large trimming bias. Second, adaptive inference allows for robustness against unknown convergence rate. The proposed method allows for closer-to-optimal trimming, and more informative inference results in practice. This practical advantage is demonstrated for inverse propensity score weighting through simulation studies and real data analysis.

*
Yuya Sasaki is an Associate Professor of Economics at Vanderbilt University. He received his bachelor’s degree and master’s degrees at Utah State University with majors in economics, geography, and mathematics. He received a Ph.D. in economics from Brown University. Yuya Sasaki was an assistant professor of economics at Johns Hopkins University, and then moved to Vanderbilt University as an associate professor. The field of his specialization is econometrics. He is currently an associate editor of Journal of Econometric Methods.
*

### Friday July 20, 2pm, Carslaw 829

**
Hongyuan Cao
**

University of Missouri, Department of Statistics

## Abstract (click to expand)

**
Statistical Methods for Integrative Analysis of Multi-Omics Data
**

Genome-wise complex trait analysis (GCTA) was developed and applied to heritability analyses on complex traits and more recently extended to mental disorders. However, besides the intensive computation, previous literature also limits the scope to univariate phenotype, which ignores mutually informative but partially independent pieces of information provided in other phenotypes. Our goal is to use such auxiliary information to improve power. We show that the proposed method leads to a large power increase, while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method over several state-of-the-art methods. We illustration our methods on dataset from a schizophrenia study.

*
Dr. Cao is an assistant professor of statistics at University of Missouri-Columbia. She got her Ph.D. in statistics from UNC-Chapel Hill in 2010. She published over 20 papers among which several are in top statistics journals, such as Biometrika, Journal of the American Statistical Association and Journal of The Royal Statistical Society, Series B. She serves as an associate editor of Biometrics. Her research interests include high dimensional and large scale statistical analysis, survival analysis, longitudinal data analysis and bioinformatics.
*

### Friday July 13, 2pm, Carslaw 829

**
Johann Gagnon-Bartsch
**

University of Michigan, Department of Statistics

## Abstract (click to expand)

**
The LOOP Estimator: Adjusting for Covariates in Randomized Experiments
**

When conducting a randomized controlled trial, it is common to specify in advance, as part of the trial protocol, the statistical analyses that will be used to analyze the data. Typically these analyses will involve adjusting for small imbalances in baseline covariates. However, this poses a dilemma, since adjusting for too many covariates can hurt precision more than it helps, and it is often unclear which covariates are predictive of outcome prior to conducting the experiment. For example, both post-stratification and OLS regression adjustments can actually increase variance (relative to a simple difference in means) if too many covariates are used. OLS is also biased under the Neyman-Rubin model. Here we introduce the LOOP ("Leave-One-Out Potential outcomes") estimator of the average treatment effect. We leave out each observation and then impute that observation's treatment and control potential outcomes using a prediction algorithm, such as a random forest. This estimator is exactly unbiased under the Neyman-Rubin model, generally performs at least as well as the unadjusted estimator, and the experimental randomization largely justifies the statistical assumptions made. Importantly, the LOOP estimator also enables us to take advantage of automatic variable selection, and thus eliminates the guess work of selecting covariates prior to conducting the trial.

*
Johann Gagnon-Bartsch is an Assistant Professor of Statistics in the Department of Statistics at the University of Michigan. Gagnon-Bartsch received his bachelor’s degree from Stanford University with majors in Math, Physics, and International Relations. He completed a PhD at Berkeley in Statistics, and then spent three more years as a visiting assistant professor in the Berkeley Statistics department. Gagnon-Bartsch’s research focuses on causal inference, machine learning, and nonparametric methods with applications in the biological and social sciences.
*

## 2018 Semester 1

### Friday June 8, 2pm, Carslaw 173

**
Janice Scealy
**

Australian National University, Research School of Finance, Actuarial Studies & Statistics

## Abstract (click to expand)

**
Scaled von Mises-Fisher distributions and regression models for palaeomagnetic directional data
**

We propose a new distribution for analysing palaeomagnetic directional data that is a novel transformation of the von Mises-Fisher distribution. The new distribution has ellipse-like symmetry, as does the Kent distribution; however, unlike the Kent distribution the normalising constant in the new density is easy to compute and estimation of the shape parameters is straightforward. To accommodate outliers, the model also incorporates an additional shape parameter which controls the tail-weight of the distribution. We also develop a general regression model framework that allows both the mean direction and the shape parameters of the error distribution to depend on covariates. To illustrate, we analyse palaeomagnetic directional data from the GEOMAGIA50.v3 database. We predict the mean direction at various geological time points and show that there is significant heteroscedasticity present. It is envisaged that the regression structures and error distribution proposed here will also prove useful when covariate information is available with (i) other types of directional response data; and (ii) square-root transformed compositional data of general dimension. This is joint work with Andrew T. A. Wood.

*
Dr Janice Scealy is a senior lecturer in statistics in the Research School of Finance, Actuarial Studies and Statistics, ANU and she is currently an ARC DECRA fellow. Her research interests include developing new statistical analysis methods for data with complicated constraints, including compositional data defined on the simplex, spherical data, directional data and manifold-valued data defined on more general curved surfaces.
*

### Friday June 1, 2pm, Carslaw 173

**
Subhash Bagui
**

University of West Florida, Department of Mathematics and Statistics

## Abstract (click to expand)

**
Convergence of Known Distributions to Normality or Non-normality: An Elementary Ratio Technique
**

This talk presents an elementary informal technique for deriving the convergence of known discrete/continuous type distributions to limiting normal or non-normal distributions. The technique utilizes the ratio of the pmf/pdf at hand at two consecutive/nearby points. The presentation should be of interest to teachers and students of first year graduate level courses in probability and statistics.

*
Subhash C. Bagui received his B.Sc. in Statistics from University of Calcutta, M. Stat. from Indian Statistical Institute and Ph.D. from University of Alberta, Canada. He is currently a University Distinguished Professor at the University of West Florida. He has authored a book titled, "Handbook of Percentiles of Non-central t-distribution", and published many high quality peer reviewed journal articles. He is currently serving as associate editors/ editorial board members of several statistics journals. His research interests include nonparametric classification and clustering, statistical pattern recognition, machine learning, central limit theorem, and experimental designs. He is also a fellow of American Statistical Association (ASA) and Royal Statistical Society (RSS).
*

### Friday May 18, 2pm, Room TBA

**
Honours talks
**

University of Sydney, School of Mathematics and Statistics

### Friday May 4, 2pm, Carslaw 173

**
Nicholas Fisher
**

University of Sydney and ValueMetrics Australia

## Abstract (click to expand)

**
The Good, the Bad, and the Horrible: Interpreting Net-Promoter Score and the Safety Attitudes Questionnaire in the light of good market research practice
**

Net-Promoter Score (NPS) is a ubiquitous, easily-collected market research metric, having displaced many complete market research processes. Unfortunately, this has been its sole success. It possesses few, if any, of the characteristics that might be regarded as highly desirable in a high-level market research metric; on the contrary, it’s done considerable damage to companies, to their shareholders and to their customers. Given the current focus on the financial services sector and its systemic failures in delivering value to customers, it is high time to question reliance on NPS.

The Safety Attitudes Questionnaire is an instrument for assessing Safety Culture in the workplace, and is similarly wide-spread throughout industries where Safety is a critical issue. It has now been adapted to assess other forms of culture, such as Risk Culture. Unfortunately, it is also highly flawed, albeit for quite different reasons.

Examining these two methodologies through the lens of good market research practice brings their fundamental flaws into focus.

*
Nick Fisher has an honorary position as Visiting Professor of Statistics at the University of Sydney, and runs his own R&D consultancy specialising in Performance Measurement. Prior to taking up these positions in 2001, he was a Chief Research Scientist in CSIRO Mathematical and Information Sciences.
*

### Friday March 23, 2pm, Carslaw 173

**
Mikaela Jorgensen
**

Australian Institute of Health Innovation, Macquarie University, Sydney, Australia

## Abstract (click to expand)

**
Using routinely collected data in aged care research: a grey area
**

When the Department of Health launched the My Aged Care website in 2013 they “severely under-estimated the proportion of enquiries and referrals they would receive by fax". Yes, that's fax machines in *2013*. However, electronic data systems are increasingly starting to be used in aged care.

This presentation will discuss the joys of using messy routinely collected datasets to examine the care and outcomes of people using aged care services.

Does pressure injury incidence differ between residential aged care facilities? Is home care service use associated with time to entry into residential aged care? These questions, and more, will be discussed.

We'll take a dive into some multilevel mixed effects models, and resurface with some risk-adjusted funnel plots. People from all backgrounds with an interest in data analysis welcome.

*
Dr Mikaela Jorgensen is a health services researcher at the Australian Institute of Health Innovation, Macquarie University. She has followed the traditional career pathway from speech pathologist to analyst of linked routinely collected health datasets for the last five years.
*

### Friday March 16, 2pm, Carslaw 173

**
Jake Olivier
**

School of Mathematics and Statistics, UNSW, Sydney, Australia

## Abstract (click to expand)

**
The importance of statistics in shaping public policy
**

Statisticians have an important role to play in shaping public policy. Public discourse can often be divisive and emotive, and it can be difficult for the uninitiated to sift through the morass of "fake" and "real" news. Decisions need to be well-informed and statisticians should be leaders in identifying relevant data for a testable hypothesis using appropriate methodology. It is also important for a statistician to identify when the data or methods used are not up to the task or when there is too much uncertainty to make accurate decisions. I will discuss some examples from my own research. This includes, in ascending order of controversy, graduated licensing schemes, claims made by Australian politicians, gun control, and bicycle helmet laws. I will also discuss some methodological challenges in evaluating interventions including regression to the mean for Poisson processes.

*
Associate Professor Jake Olivier is a member of the School of Mathematics and Statistics at UNSW Sydney. He is originally from New Orleans and spent many years living in Mississippi for graduate school and early academic appointments. A/Prof Olivier is the incoming president of the NSW Branch of the Statistical Society of Australia and immediate past chair of the Biostatistics Section. He serves on the editorial boards of BMJ Open, PLOS ONE, Cogent Medicine and the Journal of the Australasian College of Road Safety. His research interests are cycling safety, the analysis of categorical data and methods for evaluating public health interventions.
*

### Friday March 9

**
Tim Swartz
**

Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC Canada

## Abstract (click to expand)

**
A buffet of problems in sports analytics
**

This talk explores some work that I have done and some topics that you may find interesting with respect to statistics in sport. I plan on discussing a number of problems with almost no discussion of technical details. Some of the sports include hockey, cricket, highland dance, soccer and golf.

### Friday February 9

**
Maria-Pia Victoria-Feser
**

Research Center for Statistics, Geneva School of Economics and Management, University of Geneva

## Abstract (click to expand)

**
A prediction divergence criterion for model selection and classification in high dimensional settings
**

A new class of model selection criteria is proposed which is suited for stepwise approaches or can be used as selection criteria in penalized estimation based methods. This new class, called the d-class of error measure, generalizes Efron's q-class. This class not only contains classical criteria such as Mallow's Cp or the AIC, but also enables one to define new criteria that are more general. Within this new class, we propose a model selection criterion based on a prediction divergence between two nested models' predictions that we call the Prediction Divergence Criterion (PDC). The PDC provides a different measure of prediction error than a criterion associated to each potential model within a sequence and for which the selection decision is based on the sign of differences between the criteria. The PDC directly measures the prediction error divergence between two nested models. As examples, we consider the linear regression models and (supervised) classification. We show that a selection procedure based on the PDC, compared to the Cp (in the linear case), has a smaller probability of overfitting hence leading to parsimonious models for the same out-of-sample prediction error. The PDC is particularly well suited in high dimensional and sparse situations and also under (small) model misspecifications. Examples on a malnutrition study and on acute leukemia classification will be presented.

## 2017 Semester 2

### 8th of December

**
Richard Hunt
**

University of Sydney

## Abstract (click to expand)

Location: Carslaw 173

**
A New Look at Gegenbauer Long Memory Processes
**

In this presentation we will look at Long Memory and Gegenbauer Long Memory processes, and methods for estimation of the parameters of these models. After a review of the history of the development of these processes, and some of the personalities involved, we will introduce a new method for the estimation of almost all the parameters of a k-factor Gegenbauer/GARMA process. The method essentially attempts to find parameters for the spectral density to ensure it most closely matches the (smoothed) periodogram. Simulations indicate that the new method has a similar level of accuracy to existing methods (Whittle, Conditional Sum-of-squares), but can be evaluated considerably faster, whilst making few distributional assumptions on the data.

### 24th of November

**
Prof. Sally Cripps
**

University of Sydney

## Abstract (click to expand)

Location: Carslaw 173

**
A spatio-temporal mixture model for Australian daily rainfall, 1876--2015:
Modeling daily rainfall over the Australian continent
**

Daily precipitation has an enormous impact on human activity, and the study of how it varies over time and space, and what global indicators influence it, is of paramount importance to Australian agriculture. The topic is complex and would benefit from a common and publicly available statistical framework that scales to large data sets. We propose a general Bayesian spatio-temporal mixture model accommodating mixed discrete-continuous data. Our analysis uses over 294 million daily rainfall measurements since 1876, spanning 17,606 rainfall measurement sites. The size of the data calls for a parsimonious yet flexible model as well as computationally efficient methods for performing the statistical inference. Parsimony is achieved by encoding spatial, temporal and climatic variation entirely within a mixture model whose mixing weights depend on covariates. Computational efficiency is achieved by constructing a Markov chain Monte Carlo sampler that runs in parallel in a distributed computing framework. We present examples of posterior inference on short-term daily component classification, monthly intensity levels, offsite prediction of the effects of climate drivers and long-term rainfall trends across the entire continent. Computer code implementing the methods proposed in this paper is available as an R package.

### Date: 22nd of November

**
Speaker: Charles Gray
**

La Trobe

## Abstract (click to expand)

Location: Carslaw 173

**
The Curious Case of the Disappearing Coverage: a detective story in visualisation
**

Do you identify as a member of the ggplot cohort of statisticians? Did you or your students learn statistics in the era of visualisation tools such as R's ggplot package? Would it have made a difference to how you engaged with statistical theory? In this talk, I'll reflect on learning statistics at the same time as visualisation, at the half-way point in my doctoral studies. I'll share how we solved some counterintuitive coverage probability simulation results through visualisation. I see this as an opportunity to generate discussion and learn from you: questions, comments, and a generally rowdy atmosphere are most welcome.

Date: 20th of October, 2017

Time: 1.15-3pm

Location: Access Grid Room

Date: 20th of October, 2017

Time: 1.15-3pm

Location: Access Grid Room

*Interview seminar*

Date: 13th of October, 2017

Time: 10-12pm

Location: Carslaw 535

Date: 13th of October, 2017

Time: 10-12pm

Location: Carslaw 535

*Interview seminar*

Date: 13th of October, 2017

No seminar (Honours presentations date)

Date: 13th of October, 2017

No seminar (Honours presentations date)

### 6th of October, 2017

**
Kim-Anh Lê Cao
**

University of Melbourne

## Abstract (click to expand)

Location: Carslaw 173

**
Challenges in microbiome data analysis (also known as "poop analyses")
**

Our recent breakthroughs and advances in culture independent techniques, such as shotgun metagenomics and 16S rRNA amplicon sequencing have dramatically changed the way we can examine microbial communities. But does the hype of microbiome outweighs the potential of our understanding of this ‘second genome’? There are many hurdles to tackle before we are able to identify and compare bacteria driving changes in their ecosystem. In addition to the bioinformatics challenges, current statistical methods are limited to make sense of these complex data that are inherently sparse, compositional and multivariate.

I will discuss some of the topical challenges in 16S data analysis, including the presence of confounding variables and batch effects, some experimental design considerations, and share my own personal story on how a team of rogue statisticians conducted their own mice microbiome experiment leading to somewhat surprising results! I will also present our latest analyses to identify multivariate microbial signatures in immune-mediated diseases and discuss what are the next analytical challenges I envision.

This presentation will combine the results of exciting and highly collaborative works between a team of eager data analysts, immunologists and microbiologists. For once, the speaker will abstain from talking about data integration, or mixOmics (oops! but if you are interested keep an eye out in PLOS Comp Biol).

*
Dr Kim-Anh Lê Cao (NHMRC career development fellow, Senior Lecturer) recently joined the University of Melbourne (Centre for Systems Genomics and School of Mathematics and Statistics). She was awarded her PhD from the Université de Toulouse, France and moved Australia as a postdoctoral research fellow at the Institute for Molecular Bioscience, University of Queensland. She was hired as a research and consultant at QFAB Bioinformatics where she developed a multidisciplinary approach to her research. Between 2014 - 2017 she led a computational biostatistics group at the biomedical research UQ Diamantina Institute. Dr Kim-Anh Lê Cao is an expert in multivariate statistical methods and novel developments. Since 2009, her team has been working on implementing the R toolkit mixOmics dedicated to the integrative analysis of `omics' data to help researchers mine and make sense of biological data (http://www.mixOmics.org).
*

### 22nd of September, 2017

**
Speaker: Sharon Lee
**

University of Queensland

## Abstract (click to expand)

Location: Carslaw 173

**
lustering and classification of batch data
**

Motivated by the analysis of batch cytometric data, we consider the problem of jointly modelling and clustering multiple heterogeneous data samples. Traditional mixture models cannot be applied directly to these data. Intuitive approaches such as pooling and post-hoc cluster matching fails to account for the variations between the samples. In this talk, we consider a hierarchical mixture model approach to handle inter-sample variations. The adoption of a skew mixture model with random effects terms for the location parameter allows for the simultaneous clustering and matching of clusters across the samples. In the case where data from multiple classes of objects are available, this approach can be further extended to perform classification of new samples into one of the predefined classes. Examples with real cytometry data will be given to illustrate this approach.

### 15th of September, 2017

**
Speaker: Emi Tanaka
**

University of Sydney

## Abstract (click to expand)

Location: Carslaw 173

**
Outlier detection for a complex linear mixed model: an application to plant breeding trials
**

Outlier detection is an important preliminary step in the data analysis often conducted through a form of residual analysis. A complex data, such as those that are analysed by linear mixed models, gives rise to distinct levels of residuals and thus offers additional challenges for the development of an outlier detection method. Plant breeding trials are routinely conducted over years and multiple locations with the aim to select the best genotype as parents or commercial release. These so-called multi-environmental trials (MET) is commonly analysed using linear mixed models which may include cubic splines and autoregressive process to account for spatial trends. We consider some statistics derived from mean and variance shift outlier model (MSOM/VSOM) and the generalised Cook's distance (GCD) for outlier detection. We present a simulation study based on a set of real wheat yield trials.

### 11th of August, 2017

**
Speaker: Ming Yuan
**

University of Wisconsin-Madison

## Abstract (click to expand)

Location: Carslaw 173

**
Quantitation in Colocalization Analysis: Beyond "Red + Yellow = Green"
**

"I see yellow; therefore, there is colocalization.” Is it really so simple when it comes to colocalization studies? Unfortunately, and fortunately, no. Colocalization is in fact a supremely powerful technique for scientists who want to take full advantage of what optical microscopy has to offer: quantitative, correlative information together with spatial resolution. Yet, methods for colocalization have been put into doubt now that images are no longer considered simple visual representations. Colocalization studies have notoriously been subject to misinterpretation due to difficulties in robust quantification and, more importantly, reproducibility, which results in a constant source of confusion, frustration, and error. In this talk, I will share some of our effort and progress to ease such challenges using novel statistical and computational tools.

*
Ming Yuan is Senior Investigator at Morgridge Institute for Research and Professor of Statistics at Columbia University and University of Wisconsin-Madison. He was previously Coca-Cola Junior Professor in the H. Milton School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his Ph.D. in Statistics and M.S. in Computer Science from University of Wisconsin-Madison. His main research interests lie in theory, methods and applications of data mining and statistical learning. Dr. Yuan has been serving on editorial boards of various top journals including The Annals of Statistics, Bernoulli, Biometrics, Electronic Journal of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, and Statistical Science. Dr. Yuan was awarded the John van Ryzin Award in 2004 by ENAR, CAREER Award in 2009 by NSF, and Guy Medal in Bronze from the Royal Statistical Society in 2014. He was also named a Fellow of IMS in 2015, and a Medallion Lecturer of IMS.
*

### 13th of July, 2017

**
Speaker: Irene Gijbels
**

University of Leuven (KU Leuven)

## Abstract (click to expand)

Location: AGR Carslaw 829

**
obust estimation and variable selection in linear regression
**

In this talk the interest is in robust procedures to select variables in a multiple linear regression modeling context. Throughout the talk the focus is on how to adapt the nonnegative garrote selection method to get to a robust variable selection method. We establish estimation and variable selection consistency properties of the developed method, and discuss robustness properties such as breakdown point and influence function. In a second part of the talk the focus is on heteroscedastic linear regression models, in which one also wants to select the variables that influence the variance part. Methods for robust estimation and variable selection are discussed, and illustrations of their influence functions are provided. Throughout the talk examples are given to illustrate the practical use of the methods.

### 30th of June, 2017

**
Speaker: Ines Wilms
**

University of Leuven (KU Leuven)

## Abstract (click to expand)

Location: AGR Carslaw 829

**
parse cointegration
**

Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. We provide a sparse estimator of the cointegrating vectors. Sparsity means that some elements of the cointegrating vectors are estimated as exactly zero. The sparse estimator is applicable in high-dimensional settings, where the time series length is short relative to the number of time series. Our method achieves better estimation accuracy than the traditional Johansen method in sparse and/or high-dimensional settings. We use the sparse method for interest rate growth forecasting and consumption growth forecasting. We show that forecast performance can be improved by sparsely estimating the cointegrating vectors.

(Joint work with Christophe Croux.)

### 19th of May, 2017

**
Speaker: Dianne Cook
**

Monash University

## Abstract (click to expand)

**
The glue that binds statistical inference, tidy data, grammar of graphics, data visualisation and visual inference
**

Buja et al (2009) and Majumder et al (2012) established and validated protocols that place data plots into the statistical inference framework. This combined with the conceptual grammar of graphics initiated by Wilkinson (1999), refined and made popular in the R package ggplot2 (Wickham, 2016) builds plots using a functional language. The tidy data concepts made popular with the R packages tidyr (Wickham, 2017) and dplyr (Wickham and Francois, 2016) completes the mapping from random variables to plot elements.

Visualisation plays a large role in data science today. It is important for exploring data and detecting unanticipated structure. Visual inference provides the opportunity to assess discovered structure rigorously, using p-values computed by crowd-sourcing lineups of plots. Visualisation is also important for communicating results, and we often agonise over different choices in plot design to arrive at a final display. Treating plots as statistics, we can make power calculations to objectively determine the best design.

This talk will be interactive. Email your favourite plot to dicook@monash.edu ahead of time. We will work in groups to break the plot down in terms of the grammar, relate this to random variables using tidy data concepts, determine the intended null hypothesis underlying the visualisation, and hence structure it as a hypothesis test. Bring your laptop, so we can collaboratively do this exercise.

Joint work with Heike Hofmann, Mahbubul Majumder and Hadley Wickham

### 5th of May, 2017

**
Speaker: Peter Straka
**

University of New South Wales

## Abstract (click to expand)

**
Extremes of events with heavy-tailed inter-arrival times
**

Heavy-tailed inter-arrival times are a signature of "bursty" dynamics, and have been observed in financial time series, earthquakes, solar flares and neuron spike trains. We propose to model extremes of such time series via a "Max-Renewal process" (aka "Continuous Time Random Maxima process"). Due to geometric sum-stability, the inter-arrival times between extremes are attracted to a Mittag-Leffler distribution: As the threshold height increases, the Mittag-Leffler shape parameter stays constant, while the scale parameter grows like a power-law. Although the renewal assumption is debatable, this theoretical result is observed for many datasets. We discuss approaches to fit model parameters and assess uncertainty due to threshold selection.

### Date: 28th of April, 2017

**
Speaker: Botond Szabo
**

Leiden University

## Abstract (click to expand)

**
An asymptotic analysis of nonparametric distributed methods
**

In the recent years in certain applications datasets have become so large that it becomes unfeasible, or computationally undesirable, to carry out the analysis on a single machine. This gave rise to divide-and-conquer algorithms where the data is distributed over several `local' machines and the computations are done on these machines parallel to each other. Then the outcome of the local computations are somehow aggregated to a global result in a central machine. Over the years various divide-and-conquer algorithms were proposed, many of them with limited theoretical underpinning. First we compare the theoretical properties of a (not complete) list of proposed methods on the benchmark nonparametric signal-in-white-noise model. Most of the investigated algorithms use information on aspects of the underlying true signal (for instance regularity), which is usually not available in practice. A central question is whether one can tune the algorithms in a data-driven way, without using any additional knowledge about the signal. We show that (a list of) standard data-driven techniques (both Bayesian and frequentist) can not recover the underlying signal with the minimax rate. This, however, does not imply the non-existence of an adaptive distributed method. To address the theoretical limitations of data-driven divide-and-conquer algorithms we consider a setting where the amount of information sent between the local and central machines is expensive and limited. We show that it is not possible to construct data-driven methods which adapt to the unknown regularity of the underlying signal and at the same time communicates the optimal amount of information between the machines. This is a joint work with Harry van Zanten.

*
Botond Szabo is an Assistant Professor at the University of Leiden, The Netherlands. Botond received his phd in Mathematical Statistics from the Eindhoven University of technology, the Netherlands in 2014 under the supervision of Prof.dr. Harry van Zanten and Prof.dr. Aad van der Vaart.
His research interests cover Nonparametric Bayesian Statistics, Adaptation, Asymptotic Statistics, Operation research and Graph Theory. He received the Savage Award in Theory and Methods: Runner up for the best PhD dissertation in the field of Bayesian statistics and econometrics in the category Theory and Methods and the "Van Zwet Award” for the best PhD dissertation in the Netherlands in Statistics and Operation Research 2015. He is an Associate Editor of Bayesian Analysis. You can find more about him here: http://math.bme.hu/~bszabo/index_en.html .
*

### Date: 7th of April, 2017

**
Speaker: John Ormerod
**

Sydney University

## Abstract (click to expand)

**
Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?
**

We introduce a new class of priors for Bayesian hypothesis testing, which we name "cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). Lindley's paradox will also be discussed. A novel construct involving a hypothetical data-model pair will be used to extend cake priors to handle the case where there are zero free parameters under the null hypothesis. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show (under certain assumptions) that these Bayesian hypothesis tests are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. This sharply contrasts with classical tests, where the level of the test is held constant and so are not Chernoff-consistent.

Joint work with: Michael Stewart, Weichang Yu, and Sarah Romanes.

### Date: 7th of April, 2017

**
Speaker: Shige Peng
**

Shandong University

## Abstract (click to expand)

**
Data-based Quantitative Analysis under Nonlinear Expectations
**

Traditionally, a real-life random sample is often treated as measurements resulting from an i.i.d. sequence of random variables or, more generally, as an outcome of either linear or nonlinear regression models driven by an i.i.d. sequence. In many situations, however, this standard modeling approach fails to address the complexity of real-life random data. We argue that it is necessary to take into account the uncertainty hidden inside random sequences that are observed in practice.

To deal with this issue, we introduce a robust nonlinear expectation to quantitatively measure and calculate this type of uncertainty. The corresponding fundamental concept of a `nonlinear i.i.d. sequence' is used to model a large variety of real-world random phenomena. We give a robust and simple algorithm, called `phi-max-mean,' which can be used to measure such type of uncertainties, and we show that it provides an asymptotically optimal unbiased estimator to the corresponding nonlinear distribution.

### Date: 17th of March, 2017

**
Speaker: Joe Neeman
**

University of Texas Austin

## Abstract (click to expand)

**
Gaussian vectors, half-spaces, and convexity
**

Let \(A\) be a subset of \(R^n\) and let \(B\) be a half-space with the same Gaussian measure as \(A\). For a pair of correlated Gaussian vectors \(X\) and \(Y\), \(\mathrm{Pr}(X \in A, Y \in A)\) is smaller than \(\mathrm{Pr}(X \in B, Y \in B)\); this was originally proved by Borell, who also showed various other extremal properties of half-spaces. For example, the exit time of an Ornstein-Uhlenbeck process from \(A\) is stochastically dominated by its exit time from \(B\).

We will discuss these (and other) inequalities using a kind of modified convexity.

### Date: 3rd of March, 2017

**
Speaker: Ron Shamir
**

Tel Aviv University

## Abstract (click to expand)

**
Modularity, classification and networks in analysis of big biomedical data
**

Supervised and unsupervised methods have been used extensively to analyze genomics data, with mixed results. On one hand, new insights have led to new biological findings. On the other hand, analysis results were often not robust. Here we take a look at several such challenges from the perspectives of networks and big data. Specifically, we ask if and how the added information from a biological network helps in these challenges. We show both examples where the network added information is invaluable, and others where it is questionable. We also show that by collectively analyzing omic data across multiple studies of many diseases, robustness greatly improves.

### Date: 31st of January, 2017

**
Speaker: Genevera Allen
**

Rice University

## Abstract (click to expand)

**
Networks for Big Biomedical data
**

Cancer and neurological diseases are among the top five causes of death in Australia. The good news is new Big Data technologies may hold the key to understanding causes and possible cures for cancer as well as understanding the complexities of the human brain.

Join us on a voyage of discovery as we highlight how data science is transforming medical research: networks can be used to visualise and mine big biomedical data disrupting neurological diseases. Using real case studies see how cutting-edge data science is bringing us closer than ever before to major medical breakthroughs.

### Information for visitors

Enquiries about the Statistics Seminar should be directed to the organiser Munir Hiabu.