# University of Sydney Statistics Seminar Series

Unless otherwise specified seminars will be held on Fridays at 2pm in Carslaw 173

## 2018 Semester 2

### Friday November 23, 2pm, Carslaw 173

Munir Hiabu
University of Sydney, School of Mathematics and Statistics

Structured survival models with least squares backfitting

Linear models are great. They are very well understood, estimators can be quickly calculated via matrix algebra, and results have nice interpretations.

But there is one catch: They do not account for non-linearity.

The much hyped solution today for dealing with incredibly complex data is deep learning. They provide flexibility for fitting functions with very general shape.

But the downside here is that firstly, they need incredible amounts of replication to achieve an accurate fit and secondly, interpretation of the entering covariates is lost.

A middle ground solution is structured models.

Structured models are not as flexible as deep learning, but they still allow for enough non-linearity to be captured. Additionally, interpretation of the entering covariates is not lost. See here a Wikipedia article for the most prominent representative, (generalized) additive models, https://en.wikipedia.org/wiki/Additive_model.

The models are most often solved via a backfitting algorithm proposed by Buja, Hastie & Tibshirani (1989, Ann. Stat.). However, their backfitting algorithm is not derived from a proper optimisation criteria/loss function and is heuristic in nature. One major weakness from that is that this backfitting algorithm suffers problems with correlated covariates and multicollinearity.

In this talk I will propose least-squares motivated estimators which do not suffer this problem.

The least squares treatment will lead to integral equations of the second kind allowing for a nice mathematical treatment. I will mostly focus on survival models where interpretation is of the data is often key.

### Friday November 2, 2pm, Carslaw 173

Ricardo Campello
University of Newcastle, School of Mathematical and Physical Sciences

Non-Parametric Density Estimates for Data Clustering, Visualisation, and Outlier Detection

Non-parametric density estimates are a useful tool for tackling different problems in statistical learning and data mining, most noticeably in the unsupervised and semi-supervised learning scenarios. In this talk, I elaborate on HDBSCAN*, a density-based framework for hierarchical and partitioning clustering, outlier detection, and data visualisation. Since its introduction in 2015, HDBSCAN* has gained increasing attention from both researchers and practitioners in data mining, with computationally efficient third-party implementations already available in major open-source software distributions such as R/CRAN and Python/SciKit-learn, as well as successful real-world applications reported in different fields. I will discuss the core HDBSCAN* algorithm and its interpretation from a non-parametric modelling perspective as well as from the perspective of graph theory. I will also discuss post-processing routines to perform hierarchy simplification, cluster evaluation, optimal cluster selection, visualisation, and outlier detection. Finally, I briefly survey a number of unsupervised and semi-supervised extensions of the HDBSCAN* framework currently under development along with students and collaborators, as well as some topics for future research.

Prof. Ricardo Campello received his Bachelor degree in Electronics Engineering from the State University of São Paulo, Brazil, in 1994, and his MSc and PhD degrees in Electrical and Computer Engineering from the State University of Campinas, Brazil, in 1997 and 2002, respectively. Among other appointments, he was a Post-doctoral Fellow at the University of Nice, France (fall/winter 2002 - 2003), an Assistant/Associate Professor in computer science at the University of São Paulo, Brazil (2007 - 2016), and a Visiting Professor in computer science at the University of Alberta, Canada (2011 - 2013), where he is currently an Adjunct Professor (since 2017). Between 2016 and 2018 he was a Professor in applied mathematics, College of Science and Engineering, James Cook University (JCU), Australia, where he was co-responsible for the development of a professional Master of Data Science online programme. Currently he holds a position of Adjunct Professor at JCU. He is a Professor of data science within the discipline of statistics in the University of Newcastle, Australia, since July/2018.

### Friday October 26, 2pm, Carslaw 173

Rachel Wang
University of Sydney, School of Mathematics and Statistics

Metropolis-Hastings MCMC with dual mini-batches

For many decades Markov chain Monte Carlo (MCMC) methods have been the main workhorse of Bayesian inference. However, traditional MCMC algorithms are computationally intensive. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using two mini-batches (MHDB) of the whole dataset each time and show that this gives rise to approximately a tempered stationary distribution. We prove that MHDB preserves the modes of the original target distribution and derive an error bound on the approximation for a general class of models including mixtures of exponential family distributions, linear binary classification and regression. To further extend the utility of the algorithm to high dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in neural network applications and show that compared with popular optimisation methods, our method is more robust to the choice of learning rate and improves testing accuracy.

Rachel Wang is currently a lecturer and DECRA fellow in the School of Mathematics and Statistics. She received her PhD in Statistics from UC Berkeley in 2015 and subsequently spent two years as a Stein Fellow / Lecturer in the Department of Statistics at Stanford University. Her research interests include statistical network theory, statistical machine learning, and their applications to complex genomic datasets.

### Friday October 19, 2pm, Carslaw 173

Yuguang Ipsen
Australian National University, Research School of Finance, Actuarial Studies & Statistics

New Class of Random Discrete Distributions on Infinite Simplex Derived from Negative Binomial Processes

The Poisson-Kingman distributions, PK(ρ), on the infinite simplex, can be constructed from a Poisson point process having intensity density ρ or by taking the ranked jumps up till a specified time of a subordinator with Levy density ρ, as proportions of the subordinator. As a natural extension, we replace the Poisson point process with a negative binomial point process having parameter r > 0 and Levy density ρ, thereby defining a new class PK^{(r)}(ρ) of distributions on the infinite simplex. The new class contains the two-parameter generalisation PD(α,θ) of Pitman and Yor (1997) when θ > 0. It also contains a class of distributions, PD_α(r) occurs naturally from the trimmed stable subordinator. We derive properties of the new distributions, including the joint density of its size-biased permutation, a stick-breaking representation as well as the exchangeable partition probability function and an analogous Ewens sampling formula for PD_α(r).

Joint work with Prof. Ross Maller and Dr. Soudabeh Shemehsavar.

### Friday October 5, 2pm, Carslaw 173

Steph de Silva
PricewaterhouseCoopers

The spiral of normalcy: on communicating in the data sciences

Communicating technical concepts around data science and statistics is a difficult, under rated but entirely essential skill to applied work in industry and beyond. Steph will talk about some of the reasons why we, as statisticians, mathematicians, data scientists and the like find ourselves in this position, what we might do about it and why it matters.

Steph has a Ph.D. in theoretical econometrics. Her first job after leaving academia was with the World Bank, working on survey data from the developing world. Real data came as a tremendous shock after years of simulating her own. She recovered eventually and went on to live a full and varied life.

These days, Steph is a data scientist working for a major consulting house where she’s living the generalist’s dream. She has no idea what she’ll be doing next week- and she’s absolutely sure she has no answers to the problems they’re going to throw at her when she gets there. Yet.

### Friday September 21, 2pm, Carslaw 173

Grouped Variable Selection with Discrete Optimization

We will discuss a new tractable framework for grouped variable selection with a cardinality constraint on the number of selected groups, leveraging tools in modern mathematical optimization. The proposed methodology covers both the case of high-dimensional linear regression and nonparametric sparse additive modelling. Computational experiments demonstrate the effectiveness of our proposal as an alternative method for sparse grouped variable selection - in terms of better predictive accuracy and greater model sparsity, at the cost of increased, but still reasonable, computation times. Empirical and theoretical evidence shows that the proposed estimators outperform their Group Lasso type counterparts in a wide variety of regimes.

Peter Radchenko is an Associate Professor of Business Analytics at the University of Sydney Business School. Prior to joining the University of Sydney in 2017, he held academic positions at the University of Chicago and the University of Southern California. He has a PhD in Statistics from Yale University, and an undergraduate degree in Mathematics and Applied Mathematics, from the Lomonosov Moscow State University. Peter Radchenko's primary research focus is on developing new methodology for dealing with massive and complex modern data. In particular, he has worked extensively in the area of high-dimensional regression, where the number of predictors is large relative to the number of observations.

### Friday September 14, 2pm, Carslaw 173

Stephan Huckemann
University of Göttingen, Institute for Mathematical Stochastics

Dirty Central Limit Theorems on Noneuclidean Spaces

For inference on means of random vectors, the central limit theorem (CLT) is a central tool. Fréchet (1948) extended the notion of means to arbitrary metric spaces, as minimizers of expected squared distance. For such, under mild conditions, a strong law has been provided 1977 by Ziezold and, in case of manifolds and additional stronger conditions, a CLT has been derived by Bhattacharya and Paragenaru (2005). In a local chart, this CLT features a classical normal limiting distribution with a classical rate of inverse square root of sample size. If these additional stronger conditions are not satisfied, CLTs may still hold but feature different rates and different limit distributions. We give examples of such "dirty" limit theorems featuring faster rates (stickiness) and slower rates (smeariness). The former may occur on NNC (nonnegative curvature) spaces, here the distribution around cut loci of means plays a central role, the latter on NPC (nonpositive curvature) spaces. Both effects may have serious practical applications.

Stephan Huckemann received his degree in mathematics from the University of Giessen (Germany) in 1987. He was a visiting lecturer and scholar at the University of Michigan, Ann Arbor (1987 - 1989) and a postdoctoral research fellow at the ETH Zürich, Switzerland (1989 - 1990). He then worked as a commercial software developer (1990 - 2001) and returned to academia as a contributor to the computer algebra system MuPAD at Sciface Software and the University of Paderborn, Germany (2001 - 2003). Working with H. Ziezold (2004 - 2006, Univ. of Kassel), P. Mihailescu (2007, Univ. of Göttingen), P.T. Kim (2009, Univ. of Guelph, Canada) and A. Munk (2007 - 2010, Univ. of Göttingen) he completed his Habilitation at the Univ. of Göttingen (Germany) in 2010 and was awarded a DFG Heisenberg research fellowship. As such he continued at the Institute for Mathematical Stochastics, Univ. of Göttingen while being a research group leader at the Statistical and Applied Mathematical Sciences Institute (2010/11 SAMSI, Durham, NC, USA). After substituting (2012 - 2013) for the Chair of Stochastics and Applications at the Univ. of Göttingen he holds the new Chair for Non-Euclidean Statistics.

### Friday September 7, 2pm, Carslaw 173

Mark Girolami
Department of Mathematics, Imperial College London

Markov Transition Operators defined by Hamiltonian Symplectic Flows and Langevin Diffusions on the Riemannian Manifold Structure of Statistical Models

The use of Differential Geometry in Statistical Science dates back to the early work of C.R. Rao in the 1940s when he sought to assess the natural distance between population distributions. The Fisher-Rao metric tensor defined the Riemannian manifold structure of probability measures and from this local manifold geodesic distances between probability measures could be properly defined. This early work was then taken up by many authors within the statistical sciences with an emphasis on the study of the efficiency of statistical estimators. The area of Information Geometry has developed substantially and has had major impact in areas of applied statistics such as Machine Learning and Statistical Signal Processing. A different perspective on the Riemannian structure of statistical manifolds can be taken to make breakthroughs in the contemporary statistical modelling problems. Langevin diffusions and Hamiltonian dynamics on the manifold of probability measures are defined to obtain Markov transition kernels for Monte Carlo based inference.

Mark Girolami holds the Chair of Statistics within the Department of Mathematics at Imperial College London where he is also Professor of Computing Science in the Department of Computing. He is an adjunct Professor of Statistics at the University of Warwick and is Director of the Lloyd’s Register Foundation Programme on Data Centric Engineering at the Alan Turing Institute where he served as one of the original founding Executive Directors. He is an elected member of the Royal Society of Edinburgh and previously was awarded a Royal Society - Wolfson Research Merit Award. Professor Girolami has been an EPSRC Research Fellow continuously since 2007 and in 2018 he was awarded the Royal Academy of Engineering Research Chair in Data Centric Engineering. His research focuses on applications of mathematical and computational statistics.

### Friday August 17, 2pm, Carslaw 173

Pavel Krivitsky
University of Wollongong, School of Mathematics and Applied Statistics

Inference for Social Network Models from Egocentrically-Sampled Data

Egocentric network sampling observes the network of interest from the point of view of a set of sampled actors, who provide information about themselves and anonymised information on their network neighbours. In survey research, this is often the most practical, and sometimes the only, way to observe certain classes of networks, with the sexual networks that underlie HIV transmission being the archetypal case. Although methods exist for recovering some descriptive network features, there is no rigorous and practical statistical foundation for estimation and inference for network models from such data. We identify a subclass of exponential-family random graph models (ERGMs) amenable to being estimated from egocentrically sampled network data, and apply pseudo-maximum-likelihood estimation to do so and to rigorously quantify the uncertainty of the estimates. For ERGMs parametrised to be invariant to network size, we describe a computationally tractable approach to this problem. We use this methodology to help understand persistent racial disparities in HIV prevalence in the US. Lastly, we discuss how questionnaire design affects what questions can and cannot be answered with this analysis. This work is joint with Prof Martina Morris (University of Washington).

Dr Pavel N. Krivitsky received his PhD in Statistics in 2009 from University of Washington, and has been a Lecturer in Statistics at the University of Wollongong since 2013. His research interests include statistical modelling of social network data and processes for applications in epidemiology and the social sciences, statistical computing, and data privacy. He has contributed to theory and practice of latent variable and of exponential-family random graph models for networks, particularly models for network evolution and for valued relations, understanding effects of changing network size and composition, and estimation of complex network models from difficult (perturbed, egocentrically-sampled, etc.) data. He develops and maintains a number of popular R packages for social network analysis.

### Friday August 10, 2pm, Carslaw 173

Yuya Sasaki
Vanderbilt University, Department of Economics

Inference for Moments of Ratios with Robustness against Large Trimming Bias and Unknown Convergence Rate

We consider statistical inference for moments of the form E[B/A]. A naive sample mean is unstable with small denominator, A. This paper develops a method of robust inference, and proposes a data-driven practical choice of trimming observations with small A. Our sense of the robustness is twofold. First, bias correction allows for robustness against large trimming bias. Second, adaptive inference allows for robustness against unknown convergence rate. The proposed method allows for closer-to-optimal trimming, and more informative inference results in practice. This practical advantage is demonstrated for inverse propensity score weighting through simulation studies and real data analysis.

Yuya Sasaki is an Associate Professor of Economics at Vanderbilt University. He received his bachelor’s degree and master’s degrees at Utah State University with majors in economics, geography, and mathematics. He received a Ph.D. in economics from Brown University. Yuya Sasaki was an assistant professor of economics at Johns Hopkins University, and then moved to Vanderbilt University as an associate professor. The field of his specialization is econometrics. He is currently an associate editor of Journal of Econometric Methods.

### Friday July 20, 2pm, Carslaw 829

Hongyuan Cao
University of Missouri, Department of Statistics

Statistical Methods for Integrative Analysis of Multi-Omics Data

Genome-wise complex trait analysis (GCTA) was developed and applied to heritability analyses on complex traits and more recently extended to mental disorders. However, besides the intensive computation, previous literature also limits the scope to univariate phenotype, which ignores mutually informative but partially independent pieces of information provided in other phenotypes. Our goal is to use such auxiliary information to improve power. We show that the proposed method leads to a large power increase, while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method over several state-of-the-art methods. We illustration our methods on dataset from a schizophrenia study.

Dr. Cao is an assistant professor of statistics at University of Missouri-Columbia. She got her Ph.D. in statistics from UNC-Chapel Hill in 2010. She published over 20 papers among which several are in top statistics journals, such as Biometrika, Journal of the American Statistical Association and Journal of The Royal Statistical Society, Series B. She serves as an associate editor of Biometrics. Her research interests include high dimensional and large scale statistical analysis, survival analysis, longitudinal data analysis and bioinformatics.

### Friday July 13, 2pm, Carslaw 829

Johann Gagnon-Bartsch
University of Michigan, Department of Statistics

The LOOP Estimator: Adjusting for Covariates in Randomized Experiments

When conducting a randomized controlled trial, it is common to specify in advance, as part of the trial protocol, the statistical analyses that will be used to analyze the data. Typically these analyses will involve adjusting for small imbalances in baseline covariates. However, this poses a dilemma, since adjusting for too many covariates can hurt precision more than it helps, and it is often unclear which covariates are predictive of outcome prior to conducting the experiment. For example, both post-stratification and OLS regression adjustments can actually increase variance (relative to a simple difference in means) if too many covariates are used. OLS is also biased under the Neyman-Rubin model. Here we introduce the LOOP ("Leave-One-Out Potential outcomes") estimator of the average treatment effect. We leave out each observation and then impute that observation's treatment and control potential outcomes using a prediction algorithm, such as a random forest. This estimator is exactly unbiased under the Neyman-Rubin model, generally performs at least as well as the unadjusted estimator, and the experimental randomization largely justifies the statistical assumptions made. Importantly, the LOOP estimator also enables us to take advantage of automatic variable selection, and thus eliminates the guess work of selecting covariates prior to conducting the trial.

Johann Gagnon-Bartsch is an Assistant Professor of Statistics in the Department of Statistics at the University of Michigan. Gagnon-Bartsch received his bachelor’s degree from Stanford University with majors in Math, Physics, and International Relations. He completed a PhD at Berkeley in Statistics, and then spent three more years as a visiting assistant professor in the Berkeley Statistics department. Gagnon-Bartsch’s research focuses on causal inference, machine learning, and nonparametric methods with applications in the biological and social sciences.

## 2018 Semester 1

### Friday June 8, 2pm, Carslaw 173

Janice Scealy
Australian National University, Research School of Finance, Actuarial Studies & Statistics

Scaled von Mises-Fisher distributions and regression models for palaeomagnetic directional data

We propose a new distribution for analysing palaeomagnetic directional data that is a novel transformation of the von Mises-Fisher distribution. The new distribution has ellipse-like symmetry, as does the Kent distribution; however, unlike the Kent distribution the normalising constant in the new density is easy to compute and estimation of the shape parameters is straightforward. To accommodate outliers, the model also incorporates an additional shape parameter which controls the tail-weight of the distribution. We also develop a general regression model framework that allows both the mean direction and the shape parameters of the error distribution to depend on covariates. To illustrate, we analyse palaeomagnetic directional data from the GEOMAGIA50.v3 database. We predict the mean direction at various geological time points and show that there is significant heteroscedasticity present. It is envisaged that the regression structures and error distribution proposed here will also prove useful when covariate information is available with (i) other types of directional response data; and (ii) square-root transformed compositional data of general dimension. This is joint work with Andrew T. A. Wood.

Dr Janice Scealy is a senior lecturer in statistics in the Research School of Finance, Actuarial Studies and Statistics, ANU and she is currently an ARC DECRA fellow. Her research interests include developing new statistical analysis methods for data with complicated constraints, including compositional data defined on the simplex, spherical data, directional data and manifold-valued data defined on more general curved surfaces.

### Friday June 1, 2pm, Carslaw 173

Subhash Bagui
University of West Florida, Department of Mathematics and Statistics

Convergence of Known Distributions to Normality or Non-normality: An Elementary Ratio Technique

This talk presents an elementary informal technique for deriving the convergence of known discrete/continuous type distributions to limiting normal or non-normal distributions. The technique utilizes the ratio of the pmf/pdf at hand at two consecutive/nearby points. The presentation should be of interest to teachers and students of first year graduate level courses in probability and statistics.

Subhash C. Bagui received his B.Sc. in Statistics from University of Calcutta, M. Stat. from Indian Statistical Institute and Ph.D. from University of Alberta, Canada. He is currently a University Distinguished Professor at the University of West Florida. He has authored a book titled, "Handbook of Percentiles of Non-central t-distribution", and published many high quality peer reviewed journal articles. He is currently serving as associate editors/ editorial board members of several statistics journals. His research interests include nonparametric classification and clustering, statistical pattern recognition, machine learning, central limit theorem, and experimental designs. He is also a fellow of American Statistical Association (ASA) and Royal Statistical Society (RSS).

### Friday May 18, 2pm, Room TBA

Honours talks
University of Sydney, School of Mathematics and Statistics

### Friday May 4, 2pm, Carslaw 173

Nicholas Fisher
University of Sydney and ValueMetrics Australia

The Good, the Bad, and the Horrible: Interpreting Net-Promoter Score and the Safety Attitudes Questionnaire in the light of good market research practice

Net-Promoter Score (NPS) is a ubiquitous, easily-collected market research metric, having displaced many complete market research processes. Unfortunately, this has been its sole success. It possesses few, if any, of the characteristics that might be regarded as highly desirable in a high-level market research metric; on the contrary, it’s done considerable damage to companies, to their shareholders and to their customers. Given the current focus on the financial services sector and its systemic failures in delivering value to customers, it is high time to question reliance on NPS.

The Safety Attitudes Questionnaire is an instrument for assessing Safety Culture in the workplace, and is similarly wide-spread throughout industries where Safety is a critical issue. It has now been adapted to assess other forms of culture, such as Risk Culture. Unfortunately, it is also highly flawed, albeit for quite different reasons.

Examining these two methodologies through the lens of good market research practice brings their fundamental flaws into focus.

Nick Fisher has an honorary position as Visiting Professor of Statistics at the University of Sydney, and runs his own R&D consultancy specialising in Performance Measurement. Prior to taking up these positions in 2001, he was a Chief Research Scientist in CSIRO Mathematical and Information Sciences.

### Friday March 23, 2pm, Carslaw 173

Mikaela Jorgensen
Australian Institute of Health Innovation, Macquarie University, Sydney, Australia

Using routinely collected data in aged care research: a grey area

When the Department of Health launched the My Aged Care website in 2013 they “severely under-estimated the proportion of enquiries and referrals they would receive by fax". Yes, that's fax machines in *2013*. However, electronic data systems are increasingly starting to be used in aged care.

This presentation will discuss the joys of using messy routinely collected datasets to examine the care and outcomes of people using aged care services.

Does pressure injury incidence differ between residential aged care facilities? Is home care service use associated with time to entry into residential aged care? These questions, and more, will be discussed.

We'll take a dive into some multilevel mixed effects models, and resurface with some risk-adjusted funnel plots. People from all backgrounds with an interest in data analysis welcome.

Dr Mikaela Jorgensen is a health services researcher at the Australian Institute of Health Innovation, Macquarie University. She has followed the traditional career pathway from speech pathologist to analyst of linked routinely collected health datasets for the last five years.

### Friday March 16, 2pm, Carslaw 173

Jake Olivier
School of Mathematics and Statistics, UNSW, Sydney, Australia

The importance of statistics in shaping public policy

Statisticians have an important role to play in shaping public policy. Public discourse can often be divisive and emotive, and it can be difficult for the uninitiated to sift through the morass of "fake" and "real" news. Decisions need to be well-informed and statisticians should be leaders in identifying relevant data for a testable hypothesis using appropriate methodology. It is also important for a statistician to identify when the data or methods used are not up to the task or when there is too much uncertainty to make accurate decisions. I will discuss some examples from my own research. This includes, in ascending order of controversy, graduated licensing schemes, claims made by Australian politicians, gun control, and bicycle helmet laws. I will also discuss some methodological challenges in evaluating interventions including regression to the mean for Poisson processes.

Associate Professor Jake Olivier is a member of the School of Mathematics and Statistics at UNSW Sydney. He is originally from New Orleans and spent many years living in Mississippi for graduate school and early academic appointments. A/Prof Olivier is the incoming president of the NSW Branch of the Statistical Society of Australia and immediate past chair of the Biostatistics Section. He serves on the editorial boards of BMJ Open, PLOS ONE, Cogent Medicine and the Journal of the Australasian College of Road Safety. His research interests are cycling safety, the analysis of categorical data and methods for evaluating public health interventions.

### Friday March 9

Tim Swartz
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC Canada

A buffet of problems in sports analytics

This talk explores some work that I have done and some topics that you may find interesting with respect to statistics in sport. I plan on discussing a number of problems with almost no discussion of technical details. Some of the sports include hockey, cricket, highland dance, soccer and golf.

### Friday February 9

Maria-Pia Victoria-Feser
Research Center for Statistics, Geneva School of Economics and Management, University of Geneva

A prediction divergence criterion for model selection and classification in high dimensional settings

A new class of model selection criteria is proposed which is suited for stepwise approaches or can be used as selection criteria in penalized estimation based methods. This new class, called the d-class of error measure, generalizes Efron's q-class. This class not only contains classical criteria such as Mallow's Cp or the AIC, but also enables one to define new criteria that are more general. Within this new class, we propose a model selection criterion based on a prediction divergence between two nested models' predictions that we call the Prediction Divergence Criterion (PDC). The PDC provides a different measure of prediction error than a criterion associated to each potential model within a sequence and for which the selection decision is based on the sign of differences between the criteria. The PDC directly measures the prediction error divergence between two nested models. As examples, we consider the linear regression models and (supervised) classification. We show that a selection procedure based on the PDC, compared to the Cp (in the linear case), has a smaller probability of overfitting hence leading to parsimonious models for the same out-of-sample prediction error. The PDC is particularly well suited in high dimensional and sparse situations and also under (small) model misspecifications. Examples on a malnutrition study and on acute leukemia classification will be presented.

## 2017 Semester 2

Date: 8th of December, 2017
Richard Hunt
University of Sydney
Location: Carslaw 173
Title: A New Look at Gegenbauer Long Memory Processes
Abstract:
In this presentation we will look at Long Memory and Gegenbauer Long Memory processes, and methods for estimation of the parameters of these models. After a review of the history of the development of these processes, and some of the personalities involved, we will introduce a new method for the estimation of almost all the parameters of a k-factor Gegenbauer/GARMA process. The method essentially attempts to find parameters for the spectral density to ensure it most closely matches the (smoothed) periodogram. Simulations indicate that the new method has a similar level of accuracy to existing methods (Whittle, Conditional Sum-of-squares), but can be evaluated considerably faster, whilst making few distributional assumptions on the data.

Date: 24th of November, 2017
Prof. Sally Cripps
University of Sydney
Location: Carslaw 173
Title: A spatio-temporal mixture model for Australian daily rainfall, 1876--2015 Modeling daily rainfall over the Australian continent
Abstract:
Daily precipitation has an enormous impact on human activity, and the study of how it varies over time and space, and what global indicators influence it, is of paramount importance to Australian agriculture. The topic is complex and would benefit from a common and publicly available statistical framework that scales to large data sets. We propose a general Bayesian spatio-temporal mixture model accommodating mixed discrete-continuous data. Our analysis uses over 294 million daily rainfall measurements since 1876, spanning 17,606 rainfall measurement sites. The size of the data calls for a parsimonious yet flexible model as well as computationally efficient methods for performing the statistical inference. Parsimony is achieved by encoding spatial, temporal and climatic variation entirely within a mixture model whose mixing weights depend on covariates. Computational efficiency is achieved by constructing a Markov chain Monte Carlo sampler that runs in parallel in a distributed computing framework. We present examples of posterior inference on short-term daily component classification, monthly intensity levels, offsite prediction of the effects of climate drivers and long-term rainfall trends across the entire continent. Computer code implementing the methods proposed in this paper is available as an R package.

Date: 22nd of November, 2017
Speaker: Charles Gray
La Trobe
Location: Carslaw 173
Title: The Curious Case of the Disappearing Coverage: a detective story in visualisation
Abstract:
Do you identify as a member of the ggplot cohort of statisticians? Did you or your students learn statistics in the era of visualisation tools such as R's ggplot package? Would it have made a difference to how you engaged with statistical theory? In this talk, I'll reflect on learning statistics at the same time as visualisation, at the half-way point in my doctoral studies. I'll share how we solved some counterintuitive coverage probability simulation results through visualisation. I see this as an opportunity to generate discussion and learn from you: questions, comments, and a generally rowdy atmosphere are most welcome.

Date: 20th of October, 2017
Time: 1.15-3pm
Location: Access Grid Room
Interview seminar

Date: 13th of October, 2017
Time: 10-12pm
Location: Carslaw 535
Interview seminar

Date: 13th of October, 2017
No seminar (Honours presentations date)

Date: 6th of October, 2017
Kim-Anh Le Cao
University of Melbourne
Location: Carslaw 173
Time: 2-3pm
Title: Challenges in microbiome data analysis (also known as "poop analyses")
Abstract:

Our recent breakthroughs and advances in culture independent techniques, such as shotgun metagenomics and 16S rRNA amplicon sequencing have dramatically changed the way we can examine microbial communities. But does the hype of microbiome outweighs the potential of our understanding of this ‘second genome’? There are many hurdles to tackle before we are able to identify and compare bacteria driving changes in their ecosystem. In addition to the bioinformatics challenges, current statistical methods are limited to make sense of these complex data that are inherently sparse, compositional and multivariate.

I will discuss some of the topical challenges in 16S data analysis, including the presence of confounding variables and batch effects, some experimental design considerations, and share my own personal story on how a team of rogue statisticians conducted their own mice microbiome experiment leading to somewhat surprising results! I will also present our latest analyses to identify multivariate microbial signatures in immune-mediated diseases and discuss what are the next analytical challenges I envision.

This presentation will combine the results of exciting and highly collaborative works between a team of eager data analysts, immunologists and microbiologists. For once, the speaker will abstain from talking about data integration, or mixOmics (oops! but if you are interested keep an eye out in PLOS Comp Biol).

Dr Kim-Anh Lê Cao (NHMRC career development fellow, Senior Lecturer) recently joined the University of Melbourne (Centre for Systems Genomics and School of Mathematics and Statistics). She was awarded her PhD from the Université de Toulouse, France and moved Australia as a postdoctoral research fellow at the Institute for Molecular Bioscience, University of Queensland. She was hired as a research and consultant at QFAB Bioinformatics where she developed a multidisciplinary approach to her research. Between 2014 - 2017 she led a computational biostatistics group at the biomedical research UQ Diamantina Institute. Dr Kim-Anh Lê Cao is an expert in multivariate statistical methods and novel developments. Since 2009, her team has been working on implementing the R toolkit mixOmics dedicated to the integrative analysis of omics' data to help researchers mine and make sense of biological data (http://www.mixOmics.org).

Date: 22nd of September, 2017
Speaker: Sharon Lee
University of Queensland
Location: Carslaw 173
Title: Clustering and classification of batch data
Abstract:
Motivated by the analysis of batch cytometric data, we consider the problem of jointly modelling and clustering multiple heterogeneous data samples. Traditional mixture models cannot be applied directly to these data. Intuitive approaches such as pooling and post-hoc cluster matching fails to account for the variations between the samples. In this talk, we consider a hierarchical mixture model approach to handle inter-sample variations. The adoption of a skew mixture model with random effects terms for the location parameter allows for the simultaneous clustering and matching of clusters across the samples. In the case where data from multiple classes of objects are available, this approach can be further extended to perform classification of new samples into one of the predefined classes. Examples with real cytometry data will be given to illustrate this approach.

Date: 15th of September, 2017
Speaker: Emi Tanaka
University of Sydney
Location: Carslaw 173
Title: Outlier detection for a complex linear mixed model: an application to plant breeding trials
Abstract:
Outlier detection is an important preliminary step in the data analysis often conducted through a form of residual analysis. A complex data, such as those that are analysed by linear mixed models, gives rise to distinct levels of residuals and thus offers additional challenges for the development of an outlier detection method. Plant breeding trials are routinely conducted over years and multiple locations with the aim to select the best genotype as parents or commercial release. These so-called multi-environmental trials (MET) is commonly analysed using linear mixed models which may include cubic splines and autoregressive process to account for spatial trends. We consider some statistics derived from mean and variance shift outlier model (MSOM/VSOM) and the generalised Cook's distance (GCD) for outlier detection. We present a simulation study based on a set of real wheat yield trials.

Date: 11th of August, 2017
Speaker: Ming Yuan
Location: Carslaw 173
Title: Quantitation in Colocalization Analysis: Beyond "Red + Yellow = Green"
Abstract:
"I see yellow; therefore, there is colocalization.” Is it really so simple when it comes to colocalization studies? Unfortunately, and fortunately, no. Colocalization is in fact a supremely powerful technique for scientists who want to take full advantage of what optical microscopy has to offer: quantitative, correlative information together with spatial resolution. Yet, methods for colocalization have been put into doubt now that images are no longer considered simple visual representations. Colocalization studies have notoriously been subject to misinterpretation due to difficulties in robust quantification and, more importantly, reproducibility, which results in a constant source of confusion, frustration, and error. In this talk, I will share some of our effort and progress to ease such challenges using novel statistical and computational tools.

Bio: Ming Yuan is Senior Investigator at Morgridge Institute for Research and Professor of Statistics at Columbia University and University of Wisconsin-Madison. He was previously Coca-Cola Junior Professor in the H. Milton School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his Ph.D. in Statistics and M.S. in Computer Science from University of Wisconsin-Madison. His main research interests lie in theory, methods and applications of data mining and statistical learning. Dr. Yuan has been serving on editorial boards of various top journals including The Annals of Statistics, Bernoulli, Biometrics, Electronic Journal of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, and Statistical Science. Dr. Yuan was awarded the John van Ryzin Award in 2004 by ENAR, CAREER Award in 2009 by NSF, and Guy Medal in Bronze from the Royal Statistical Society in 2014. He was also named a Fellow of IMS in 2015, and a Medallion Lecturer of IMS.

Date: 13th of July, 2017
Speaker: Irene Gijbels
University of Leuven (KU Leuven)
Location: AGR Carslaw 829
Title: Robust estimation and variable selection in linear regression
Abstract:
In this talk the interest is in robust procedures to select variables in a multiple linear regression modeling context. Throughout the talk the focus is on how to adapt the nonnegative garrote selection method to get to a robust variable selection method. We establish estimation and variable selection consistency properties of the developed method, and discuss robustness properties such as breakdown point and influence function. In a second part of the talk the focus is on heteroscedastic linear regression models, in which one also wants to select the variables that influence the variance part. Methods for robust estimation and variable selection are discussed, and illustrations of their influence functions are provided. Throughout the talk examples are given to illustrate the practical use of the methods.

Date: 30th of June, 2017
Speaker: Ines Wilms
University of Leuven (KU Leuven)
Location: AGR Carslaw 829
Title: Sparse cointegration
Abstract:
Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. We provide a sparse estimator of the cointegrating vectors. Sparsity means that some elements of the cointegrating vectors are estimated as exactly zero. The sparse estimator is applicable in high-dimensional settings, where the time series length is short relative to the number of time series. Our method achieves better estimation accuracy than the traditional Johansen method in sparse and/or high-dimensional settings. We use the sparse method for interest rate growth forecasting and consumption growth forecasting. We show that forecast performance can be improved by sparsely estimating the cointegrating vectors.

Joint work with Christophe Croux.

Date: 19th of May, 2017
Speaker: Dianne Cook
Monash University
Title:The glue that binds statistical inference, tidy data, grammar of graphics, data visualisation and visual inference
Abstract:

Buja et al (2009) and Majumder et al (2012) established and validated protocols that place data plots into the statistical inference framework. This combined with the conceptual grammar of graphics initiated by Wilkinson (1999), refined and made popular in the R package ggplot2 (Wickham, 2016) builds plots using a functional language. The tidy data concepts made popular with the R packages tidyr (Wickham, 2017) and dplyr (Wickham and Francois, 2016) completes the mapping from random variables to plot elements.

Visualisation plays a large role in data science today. It is important for exploring data and detecting unanticipated structure. Visual inference provides the opportunity to assess discovered structure rigorously, using p-values computed by crowd-sourcing lineups of plots. Visualisation is also important for communicating results, and we often agonise over different choices in plot design to arrive at a final display. Treating plots as statistics, we can make power calculations to objectively determine the best design.

This talk will be interactive. Email your favourite plot to dicook@monash.edu ahead of time. We will work in groups to break the plot down in terms of the grammar, relate this to random variables using tidy data concepts, determine the intended null hypothesis underlying the visualisation, and hence structure it as a hypothesis test. Bring your laptop, so we can collaboratively do this exercise.

Joint work with Heike Hofmann, Mahbubul Majumder and Hadley Wickham

Date: 5th of May, 2017
Speaker: Peter Straka
University of New South Wales
Title: Extremes of events with heavy-tailed inter-arrival times
Abstract:

Heavy-tailed inter-arrival times are a signature of "bursty" dynamics, and have been observed in financial time series, earthquakes, solar flares and neuron spike trains. We propose to model extremes of such time series via a "Max-Renewal process" (aka "Continuous Time Random Maxima process"). Due to geometric sum-stability, the inter-arrival times between extremes are attracted to a Mittag-Leffler distribution: As the threshold height increases, the Mittag-Leffler shape parameter stays constant, while the scale parameter grows like a power-law. Although the renewal assumption is debatable, this theoretical result is observed for many datasets. We discuss approaches to fit model parameters and assess uncertainty due to threshold selection.

Date: 28th of April, 2017
Speaker: Botond Szabo
Leiden University
Title: An asymptotic analysis of nonparametric distributed methods
Abstract:

In the recent years in certain applications datasets have become so large that it becomes unfeasible, or computationally undesirable, to carry out the analysis on a single machine. This gave rise to divide-and-conquer algorithms where the data is distributed over several local' machines and the computations are done on these machines parallel to each other. Then the outcome of the local computations are somehow aggregated to a global result in a central machine. Over the years various divide-and-conquer algorithms were proposed, many of them with limited theoretical underpinning. First we compare the theoretical properties of a (not complete) list of proposed methods on the benchmark nonparametric signal-in-white-noise model. Most of the investigated algorithms use information on aspects of the underlying true signal (for instance regularity), which is usually not available in practice. A central question is whether one can tune the algorithms in a data-driven way, without using any additional knowledge about the signal. We show that (a list of) standard data-driven techniques (both Bayesian and frequentist) can not recover the underlying signal with the minimax rate. This, however, does not imply the non-existence of an adaptive distributed method. To address the theoretical limitations of data-driven divide-and-conquer algorithms we consider a setting where the amount of information sent between the local and central machines is expensive and limited. We show that it is not possible to construct data-driven methods which adapt to the unknown regularity of the underlying signal and at the same time communicates the optimal amount of information between the machines. This is a joint work with Harry van Zanten.

Botond Szabo is an Assistant Professor at the University of Leiden, The Netherlands. Botond received his phd in Mathematical Statistics from the Eindhoven University of technology, the Netherlands in 2014 under the supervision of Prof.dr. Harry van Zanten and Prof.dr. Aad van der Vaart. His research interests cover Nonparametric Bayesian Statistics, Adaptation, Asymptotic Statistics, Operation research and Graph Theory. He received the Savage Award in Theory and Methods: Runner up for the best PhD dissertation in the field of Bayesian statistics and econometrics in the category Theory and Methods and the "Van Zwet Award” for the best PhD dissertation in the Netherlands in Statistics and Operation Research 2015. He is an Associate Editor of Bayesian Analysis. You can find more about him here: http://math.bme.hu/~bszabo/index_en.html .

Date: 7th of April, 2017
Speaker: John Ormerod
Sydney University
Title: Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?
Abstract:

We introduce a new class of priors for Bayesian hypothesis testing, which we name "cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). Lindley's paradox will also be discussed. A novel construct involving a hypothetical data-model pair will be used to extend cake priors to handle the case where there are zero free parameters under the null hypothesis. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show (under certain assumptions) that these Bayesian hypothesis tests are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. This sharply contrasts with classical tests, where the level of the test is held constant and so are not Chernoff-consistent.

Joint work with: Michael Stewart, Weichang Yu, and Sarah Romanes.

Date: 7th of April, 2017
Speaker: Shige Peng
Shandong University
Title: Data-based Quantitative Analysis under Nonlinear Expectations
Abstract:

Traditionally, a real-life random sample is often treated as measurements resulting from an i.i.d. sequence of random variables or, more generally, as an outcome of either linear or nonlinear regression models driven by an i.i.d. sequence. In many situations, however, this standard modeling approach fails to address the complexity of real-life random data. We argue that it is necessary to take into account the uncertainty hidden inside random sequences that are observed in practice.

To deal with this issue, we introduce a robust nonlinear expectation to quantitatively measure and calculate this type of uncertainty. The corresponding fundamental concept of a nonlinear i.i.d. sequence' is used to model a large variety of real-world random phenomena. We give a robust and simple algorithm, called phi-max-mean,' which can be used to measure such type of uncertainties, and we show that it provides an asymptotically optimal unbiased estimator to the corresponding nonlinear distribution.

Date: 17th of March, 2017
Speaker: Joe Neeman
University of Texas Austin
Title: Gaussian vectors, half-spaces, and convexity
Abstract:

Let $$A$$ be a subset of $$R^n$$ and let $$B$$ be a half-space with the same Gaussian measure as $$A$$. For a pair of correlated Gaussian vectors $$X$$ and $$Y$$, $$\mathrm{Pr}(X \in A, Y \in A)$$ is smaller than $$\mathrm{Pr}(X \in B, Y \in B)$$; this was originally proved by Borell, who also showed various other extremal properties of half-spaces. For example, the exit time of an Ornstein-Uhlenbeck process from $$A$$ is stochastically dominated by its exit time from $$B$$.

We will discuss these (and other) inequalities using a kind of modified convexity.

Date: 3rd of March, 2017
Speaker: Ron Shamir
Tel Aviv University
Title: Modularity, classification and networks in analysis of big biomedical data
Abstract:

Supervised and unsupervised methods have been used extensively to analyze genomics data, with mixed results. On one hand, new insights have led to new biological findings. On the other hand, analysis results were often not robust. Here we take a look at several such challenges from the perspectives of networks and big data. Specifically, we ask if and how the added information from a biological network helps in these challenges. We show both examples where the network added information is invaluable, and others where it is questionable. We also show that by collectively analyzing omic data across multiple studies of many diseases, robustness greatly improves.

Date: 31st of January, 2017
Speaker: Genevera Allen
Rice University
Title: Networks for Big Biomedical data
Abstract:

Cancer and neurological diseases are among the top five causes of death in Australia. The good news is new Big Data technologies may hold the key to understanding causes and possible cures for cancer as well as understanding the complexities of the human brain.

Join us on a voyage of discovery as we highlight how data science is transforming medical research: networks can be used to visualise and mine big biomedical data disrupting neurological diseases. Using real case studies see how cutting-edge data science is bringing us closer than ever before to major medical breakthroughs.

### Information for visitors

Enquiries about the Statistics Seminar should be directed to the organiser Garth Tarr.