Seminars in 2012

Seminars in 2011

University of Sydney Statistics Seminar Series

Unless otherwise specified seminars will be held on Fridays at 2pm in Carslaw 173

To be added to or removed from the mailing list, or for any other information, please contact Garth Tarr.

2018 Semester 2

Friday September 14, 2pm, Carslaw 173

Stephan Huckemann
University of Göttingen, Institute for Mathematical Stochastics

Dirty Central Limit Theorems on Noneuclidean Spaces

For inference on means of random vectors, the central limit theorem (CLT) is a central tool. Fréchet (1948) extended the notion of means to arbitrary metric spaces, as minimizers of expected squared distance. For such, under mild conditions, a strong law has been provided 1977 by Ziezold and, in case of manifolds and additional stronger conditions, a CLT has been derived by Bhattacharya and Paragenaru (2005). In a local chart, this CLT features a classical normal limiting distribution with a classical rate of inverse square root of sample size. If these additional stronger conditions are not satisfied, CLTs may still hold but feature different rates and different limit distributions. We give examples of such "dirty" limit theorems featuring faster rates (stickiness) and slower rates (smeariness). The former may occur on NNC (nonnegative curvature) spaces, here the distribution around cut loci of means plays a central role, the latter on NPC (nonpositive curvature) spaces. Both effects may have serious practical applications.

Stephan Huckemann received his degree in mathematics from the University of Giessen (Germany) in 1987. He was a visiting lecturer and scholar at the University of Michigan, Ann Arbor (1987 - 1989) and a postdoctoral research fellow at the ETH Zürich, Switzerland (1989 - 1990). He then worked as a commercial software developer (1990 - 2001) and returned to academia as a contributor to the computer algebra system MuPAD at Sciface Software and the University of Paderborn, Germany (2001 - 2003). Working with H. Ziezold (2004 - 2006, Univ. of Kassel), P. Mihailescu (2007, Univ. of Göttingen), P.T. Kim (2009, Univ. of Guelph, Canada) and A. Munk (2007 - 2010, Univ. of Göttingen) he completed his Habilitation at the Univ. of Göttingen (Germany) in 2010 and was awarded a DFG Heisenberg research fellowship. As such he continued at the Institute for Mathematical Stochastics, Univ. of Göttingen while being a research group leader at the Statistical and Applied Mathematical Sciences Institute (2010/11 SAMSI, Durham, NC, USA). After substituting (2012 - 2013) for the Chair of Stochastics and Applications at the Univ. of Göttingen he holds the new Chair for Non-Euclidean Statistics.

Friday September 7, 2pm, Carslaw 173

Steph de Silva




Friday August 10, 2pm, Carslaw 173

Yuya Sasaki
Vanderbilt University, Department of Economics

Inference for Moments of Ratios with Robustness against Large Trimming Bias and Unknown Convergence Rate

We consider statistical inference for moments of the form E[B/A]. A naive sample mean is unstable with small denominator, A. This paper develops a method of robust inference, and proposes a data-driven practical choice of trimming observations with small A. Our sense of the robustness is twofold. First, bias correction allows for robustness against large trimming bias. Second, adaptive inference allows for robustness against unknown convergence rate. The proposed method allows for closer-to-optimal trimming, and more informative inference results in practice. This practical advantage is demonstrated for inverse propensity score weighting through simulation studies and real data analysis.

Yuya Sasaki is an Associate Professor of Economics at Vanderbilt University. He received his bachelor’s degree and master’s degrees at Utah State University with majors in economics, geography, and mathematics. He received a Ph.D. in economics from Brown University. Yuya Sasaki was an assistant professor of economics at Johns Hopkins University, and then moved to Vanderbilt University as an associate professor. The field of his specialization is econometrics. He is currently an associate editor of Journal of Econometric Methods.

Friday July 20, 2pm, Carslaw 829

Hongyuan Cao
University of Missouri, Department of Statistics

Statistical Methods for Integrative Analysis of Multi-Omics Data

Genome-wise complex trait analysis (GCTA) was developed and applied to heritability analyses on complex traits and more recently extended to mental disorders. However, besides the intensive computation, previous literature also limits the scope to univariate phenotype, which ignores mutually informative but partially independent pieces of information provided in other phenotypes. Our goal is to use such auxiliary information to improve power. We show that the proposed method leads to a large power increase, while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method over several state-of-the-art methods. We illustration our methods on dataset from a schizophrenia study.

Dr. Cao is an assistant professor of statistics at University of Missouri-Columbia. She got her Ph.D. in statistics from UNC-Chapel Hill in 2010. She published over 20 papers among which several are in top statistics journals, such as Biometrika, Journal of the American Statistical Association and Journal of The Royal Statistical Society, Series B. She serves as an associate editor of Biometrics. Her research interests include high dimensional and large scale statistical analysis, survival analysis, longitudinal data analysis and bioinformatics.

Friday July 13, 2pm, Carslaw 829

Johann Gagnon-Bartsch
University of Michigan, Department of Statistics

The LOOP Estimator: Adjusting for Covariates in Randomized Experiments

When conducting a randomized controlled trial, it is common to specify in advance, as part of the trial protocol, the statistical analyses that will be used to analyze the data. Typically these analyses will involve adjusting for small imbalances in baseline covariates. However, this poses a dilemma, since adjusting for too many covariates can hurt precision more than it helps, and it is often unclear which covariates are predictive of outcome prior to conducting the experiment. For example, both post-stratification and OLS regression adjustments can actually increase variance (relative to a simple difference in means) if too many covariates are used. OLS is also biased under the Neyman-Rubin model. Here we introduce the LOOP ("Leave-One-Out Potential outcomes") estimator of the average treatment effect. We leave out each observation and then impute that observation's treatment and control potential outcomes using a prediction algorithm, such as a random forest. This estimator is exactly unbiased under the Neyman-Rubin model, generally performs at least as well as the unadjusted estimator, and the experimental randomization largely justifies the statistical assumptions made. Importantly, the LOOP estimator also enables us to take advantage of automatic variable selection, and thus eliminates the guess work of selecting covariates prior to conducting the trial.

Johann Gagnon-Bartsch is an Assistant Professor of Statistics in the Department of Statistics at the University of Michigan. Gagnon-Bartsch received his bachelor’s degree from Stanford University with majors in Math, Physics, and International Relations. He completed a PhD at Berkeley in Statistics, and then spent three more years as a visiting assistant professor in the Berkeley Statistics department. Gagnon-Bartsch’s research focuses on causal inference, machine learning, and nonparametric methods with applications in the biological and social sciences.

2018 Semester 1

Friday June 8, 2pm, Carslaw 173

Janice Scealy
Australian National University, Research School of Finance, Actuarial Studies & Statistics

Scaled von Mises-Fisher distributions and regression models for palaeomagnetic directional data

We propose a new distribution for analysing palaeomagnetic directional data that is a novel transformation of the von Mises-Fisher distribution. The new distribution has ellipse-like symmetry, as does the Kent distribution; however, unlike the Kent distribution the normalising constant in the new density is easy to compute and estimation of the shape parameters is straightforward. To accommodate outliers, the model also incorporates an additional shape parameter which controls the tail-weight of the distribution. We also develop a general regression model framework that allows both the mean direction and the shape parameters of the error distribution to depend on covariates. To illustrate, we analyse palaeomagnetic directional data from the GEOMAGIA50.v3 database. We predict the mean direction at various geological time points and show that there is significant heteroscedasticity present. It is envisaged that the regression structures and error distribution proposed here will also prove useful when covariate information is available with (i) other types of directional response data; and (ii) square-root transformed compositional data of general dimension. This is joint work with Andrew T. A. Wood.

Dr Janice Scealy is a senior lecturer in statistics in the Research School of Finance, Actuarial Studies and Statistics, ANU and she is currently an ARC DECRA fellow. Her research interests include developing new statistical analysis methods for data with complicated constraints, including compositional data defined on the simplex, spherical data, directional data and manifold-valued data defined on more general curved surfaces.

Friday June 1, 2pm, Carslaw 173

Subhash Bagui
University of West Florida, Department of Mathematics and Statistics

Convergence of Known Distributions to Normality or Non-normality: An Elementary Ratio Technique

This talk presents an elementary informal technique for deriving the convergence of known discrete/continuous type distributions to limiting normal or non-normal distributions. The technique utilizes the ratio of the pmf/pdf at hand at two consecutive/nearby points. The presentation should be of interest to teachers and students of first year graduate level courses in probability and statistics.

Subhash C. Bagui received his B.Sc. in Statistics from University of Calcutta, M. Stat. from Indian Statistical Institute and Ph.D. from University of Alberta, Canada. He is currently a University Distinguished Professor at the University of West Florida. He has authored a book titled, "Handbook of Percentiles of Non-central t-distribution", and published many high quality peer reviewed journal articles. He is currently serving as associate editors/ editorial board members of several statistics journals. His research interests include nonparametric classification and clustering, statistical pattern recognition, machine learning, central limit theorem, and experimental designs. He is also a fellow of American Statistical Association (ASA) and Royal Statistical Society (RSS).

Friday May 18, 2pm, Room TBA

Honours talks
University of Sydney, School of Mathematics and Statistics

Friday May 4, 2pm, Carslaw 173

Nicholas Fisher
University of Sydney and ValueMetrics Australia

The Good, the Bad, and the Horrible: Interpreting Net-Promoter Score and the Safety Attitudes Questionnaire in the light of good market research practice

Net-Promoter Score (NPS) is a ubiquitous, easily-collected market research metric, having displaced many complete market research processes. Unfortunately, this has been its sole success. It possesses few, if any, of the characteristics that might be regarded as highly desirable in a high-level market research metric; on the contrary, it’s done considerable damage to companies, to their shareholders and to their customers. Given the current focus on the financial services sector and its systemic failures in delivering value to customers, it is high time to question reliance on NPS.

The Safety Attitudes Questionnaire is an instrument for assessing Safety Culture in the workplace, and is similarly wide-spread throughout industries where Safety is a critical issue. It has now been adapted to assess other forms of culture, such as Risk Culture. Unfortunately, it is also highly flawed, albeit for quite different reasons.

Examining these two methodologies through the lens of good market research practice brings their fundamental flaws into focus.

Nick Fisher has an honorary position as Visiting Professor of Statistics at the University of Sydney, and runs his own R&D consultancy specialising in Performance Measurement. Prior to taking up these positions in 2001, he was a Chief Research Scientist in CSIRO Mathematical and Information Sciences.

Friday March 23, 2pm, Carslaw 173

Mikaela Jorgensen
Australian Institute of Health Innovation, Macquarie University, Sydney, Australia

Using routinely collected data in aged care research: a grey area

When the Department of Health launched the My Aged Care website in 2013 they “severely under-estimated the proportion of enquiries and referrals they would receive by fax". Yes, that's fax machines in *2013*. However, electronic data systems are increasingly starting to be used in aged care.

This presentation will discuss the joys of using messy routinely collected datasets to examine the care and outcomes of people using aged care services.

Does pressure injury incidence differ between residential aged care facilities? Is home care service use associated with time to entry into residential aged care? These questions, and more, will be discussed.

We'll take a dive into some multilevel mixed effects models, and resurface with some risk-adjusted funnel plots. People from all backgrounds with an interest in data analysis welcome.

Dr Mikaela Jorgensen is a health services researcher at the Australian Institute of Health Innovation, Macquarie University. She has followed the traditional career pathway from speech pathologist to analyst of linked routinely collected health datasets for the last five years.

Friday March 16, 2pm, Carslaw 173

Jake Olivier
School of Mathematics and Statistics, UNSW, Sydney, Australia

The importance of statistics in shaping public policy

Statisticians have an important role to play in shaping public policy. Public discourse can often be divisive and emotive, and it can be difficult for the uninitiated to sift through the morass of "fake" and "real" news. Decisions need to be well-informed and statisticians should be leaders in identifying relevant data for a testable hypothesis using appropriate methodology. It is also important for a statistician to identify when the data or methods used are not up to the task or when there is too much uncertainty to make accurate decisions. I will discuss some examples from my own research. This includes, in ascending order of controversy, graduated licensing schemes, claims made by Australian politicians, gun control, and bicycle helmet laws. I will also discuss some methodological challenges in evaluating interventions including regression to the mean for Poisson processes.

Associate Professor Jake Olivier is a member of the School of Mathematics and Statistics at UNSW Sydney. He is originally from New Orleans and spent many years living in Mississippi for graduate school and early academic appointments. A/Prof Olivier is the incoming president of the NSW Branch of the Statistical Society of Australia and immediate past chair of the Biostatistics Section. He serves on the editorial boards of BMJ Open, PLOS ONE, Cogent Medicine and the Journal of the Australasian College of Road Safety. His research interests are cycling safety, the analysis of categorical data and methods for evaluating public health interventions.

Friday March 9

Tim Swartz
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC Canada

A buffet of problems in sports analytics

This talk explores some work that I have done and some topics that you may find interesting with respect to statistics in sport. I plan on discussing a number of problems with almost no discussion of technical details. Some of the sports include hockey, cricket, highland dance, soccer and golf.

Friday February 9

Maria-Pia Victoria-Feser
Research Center for Statistics, Geneva School of Economics and Management, University of Geneva

A prediction divergence criterion for model selection and classification in high dimensional settings

A new class of model selection criteria is proposed which is suited for stepwise approaches or can be used as selection criteria in penalized estimation based methods. This new class, called the d-class of error measure, generalizes Efron's q-class. This class not only contains classical criteria such as Mallow's Cp or the AIC, but also enables one to define new criteria that are more general. Within this new class, we propose a model selection criterion based on a prediction divergence between two nested models' predictions that we call the Prediction Divergence Criterion (PDC). The PDC provides a different measure of prediction error than a criterion associated to each potential model within a sequence and for which the selection decision is based on the sign of differences between the criteria. The PDC directly measures the prediction error divergence between two nested models. As examples, we consider the linear regression models and (supervised) classification. We show that a selection procedure based on the PDC, compared to the Cp (in the linear case), has a smaller probability of overfitting hence leading to parsimonious models for the same out-of-sample prediction error. The PDC is particularly well suited in high dimensional and sparse situations and also under (small) model misspecifications. Examples on a malnutrition study and on acute leukemia classification will be presented.

2017 Semester 2

Date: 8th of December, 2017
Richard Hunt
University of Sydney
Location: Carslaw 173
Title: A New Look at Gegenbauer Long Memory Processes
In this presentation we will look at Long Memory and Gegenbauer Long Memory processes, and methods for estimation of the parameters of these models. After a review of the history of the development of these processes, and some of the personalities involved, we will introduce a new method for the estimation of almost all the parameters of a k-factor Gegenbauer/GARMA process. The method essentially attempts to find parameters for the spectral density to ensure it most closely matches the (smoothed) periodogram. Simulations indicate that the new method has a similar level of accuracy to existing methods (Whittle, Conditional Sum-of-squares), but can be evaluated considerably faster, whilst making few distributional assumptions on the data.

Date: 24th of November, 2017
Prof. Sally Cripps
University of Sydney
Location: Carslaw 173
Title: A spatio-temporal mixture model for Australian daily rainfall, 1876--2015 Modeling daily rainfall over the Australian continent
Daily precipitation has an enormous impact on human activity, and the study of how it varies over time and space, and what global indicators influence it, is of paramount importance to Australian agriculture. The topic is complex and would benefit from a common and publicly available statistical framework that scales to large data sets. We propose a general Bayesian spatio-temporal mixture model accommodating mixed discrete-continuous data. Our analysis uses over 294 million daily rainfall measurements since 1876, spanning 17,606 rainfall measurement sites. The size of the data calls for a parsimonious yet flexible model as well as computationally efficient methods for performing the statistical inference. Parsimony is achieved by encoding spatial, temporal and climatic variation entirely within a mixture model whose mixing weights depend on covariates. Computational efficiency is achieved by constructing a Markov chain Monte Carlo sampler that runs in parallel in a distributed computing framework. We present examples of posterior inference on short-term daily component classification, monthly intensity levels, offsite prediction of the effects of climate drivers and long-term rainfall trends across the entire continent. Computer code implementing the methods proposed in this paper is available as an R package.

Date: 22nd of November, 2017
Speaker: Charles Gray
La Trobe
Location: Carslaw 173
Title: The Curious Case of the Disappearing Coverage: a detective story in visualisation
Do you identify as a member of the ggplot cohort of statisticians? Did you or your students learn statistics in the era of visualisation tools such as R's ggplot package? Would it have made a difference to how you engaged with statistical theory? In this talk, I'll reflect on learning statistics at the same time as visualisation, at the half-way point in my doctoral studies. I'll share how we solved some counterintuitive coverage probability simulation results through visualisation. I see this as an opportunity to generate discussion and learn from you: questions, comments, and a generally rowdy atmosphere are most welcome.

Date: 20th of October, 2017
Time: 1.15-3pm
Location: Access Grid Room
Interview seminar

Date: 13th of October, 2017
Time: 10-12pm
Location: Carslaw 535
Interview seminar

Date: 13th of October, 2017
No seminar (Honours presentations date)

Date: 6th of October, 2017
Kim-Anh Le Cao
University of Melbourne
Location: Carslaw 173
Time: 2-3pm
Title: Challenges in microbiome data analysis (also known as "poop analyses")

Our recent breakthroughs and advances in culture independent techniques, such as shotgun metagenomics and 16S rRNA amplicon sequencing have dramatically changed the way we can examine microbial communities. But does the hype of microbiome outweighs the potential of our understanding of this ‘second genome’? There are many hurdles to tackle before we are able to identify and compare bacteria driving changes in their ecosystem. In addition to the bioinformatics challenges, current statistical methods are limited to make sense of these complex data that are inherently sparse, compositional and multivariate.

I will discuss some of the topical challenges in 16S data analysis, including the presence of confounding variables and batch effects, some experimental design considerations, and share my own personal story on how a team of rogue statisticians conducted their own mice microbiome experiment leading to somewhat surprising results! I will also present our latest analyses to identify multivariate microbial signatures in immune-mediated diseases and discuss what are the next analytical challenges I envision.

This presentation will combine the results of exciting and highly collaborative works between a team of eager data analysts, immunologists and microbiologists. For once, the speaker will abstain from talking about data integration, or mixOmics (oops! but if you are interested keep an eye out in PLOS Comp Biol).

Dr Kim-Anh Lê Cao (NHMRC career development fellow, Senior Lecturer) recently joined the University of Melbourne (Centre for Systems Genomics and School of Mathematics and Statistics). She was awarded her PhD from the Université de Toulouse, France and moved Australia as a postdoctoral research fellow at the Institute for Molecular Bioscience, University of Queensland. She was hired as a research and consultant at QFAB Bioinformatics where she developed a multidisciplinary approach to her research. Between 2014 - 2017 she led a computational biostatistics group at the biomedical research UQ Diamantina Institute. Dr Kim-Anh Lê Cao is an expert in multivariate statistical methods and novel developments. Since 2009, her team has been working on implementing the R toolkit mixOmics dedicated to the integrative analysis of `omics' data to help researchers mine and make sense of biological data (

Date: 22nd of September, 2017
Speaker: Sharon Lee
University of Queensland
Location: Carslaw 173
Title: Clustering and classification of batch data
Motivated by the analysis of batch cytometric data, we consider the problem of jointly modelling and clustering multiple heterogeneous data samples. Traditional mixture models cannot be applied directly to these data. Intuitive approaches such as pooling and post-hoc cluster matching fails to account for the variations between the samples. In this talk, we consider a hierarchical mixture model approach to handle inter-sample variations. The adoption of a skew mixture model with random effects terms for the location parameter allows for the simultaneous clustering and matching of clusters across the samples. In the case where data from multiple classes of objects are available, this approach can be further extended to perform classification of new samples into one of the predefined classes. Examples with real cytometry data will be given to illustrate this approach.

Date: 15th of September, 2017
Speaker: Emi Tanaka
University of Sydney
Location: Carslaw 173
Title: Outlier detection for a complex linear mixed model: an application to plant breeding trials
Outlier detection is an important preliminary step in the data analysis often conducted through a form of residual analysis. A complex data, such as those that are analysed by linear mixed models, gives rise to distinct levels of residuals and thus offers additional challenges for the development of an outlier detection method. Plant breeding trials are routinely conducted over years and multiple locations with the aim to select the best genotype as parents or commercial release. These so-called multi-environmental trials (MET) is commonly analysed using linear mixed models which may include cubic splines and autoregressive process to account for spatial trends. We consider some statistics derived from mean and variance shift outlier model (MSOM/VSOM) and the generalised Cook's distance (GCD) for outlier detection. We present a simulation study based on a set of real wheat yield trials.

Date: 11th of August, 2017
Speaker: Ming Yuan
University of Wisconsin-Madison
Location: Carslaw 173
Title: Quantitation in Colocalization Analysis: Beyond "Red + Yellow = Green"
"I see yellow; therefore, there is colocalization.” Is it really so simple when it comes to colocalization studies? Unfortunately, and fortunately, no. Colocalization is in fact a supremely powerful technique for scientists who want to take full advantage of what optical microscopy has to offer: quantitative, correlative information together with spatial resolution. Yet, methods for colocalization have been put into doubt now that images are no longer considered simple visual representations. Colocalization studies have notoriously been subject to misinterpretation due to difficulties in robust quantification and, more importantly, reproducibility, which results in a constant source of confusion, frustration, and error. In this talk, I will share some of our effort and progress to ease such challenges using novel statistical and computational tools.

Bio: Ming Yuan is Senior Investigator at Morgridge Institute for Research and Professor of Statistics at Columbia University and University of Wisconsin-Madison. He was previously Coca-Cola Junior Professor in the H. Milton School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his Ph.D. in Statistics and M.S. in Computer Science from University of Wisconsin-Madison. His main research interests lie in theory, methods and applications of data mining and statistical learning. Dr. Yuan has been serving on editorial boards of various top journals including The Annals of Statistics, Bernoulli, Biometrics, Electronic Journal of Statistics, Journal of the American Statistical Association, Journal of the Royal Statistical Society Series B, and Statistical Science. Dr. Yuan was awarded the John van Ryzin Award in 2004 by ENAR, CAREER Award in 2009 by NSF, and Guy Medal in Bronze from the Royal Statistical Society in 2014. He was also named a Fellow of IMS in 2015, and a Medallion Lecturer of IMS.

Date: 13th of July, 2017
Speaker: Irene Gijbels
University of Leuven (KU Leuven)
Location: AGR Carslaw 829
Title: Robust estimation and variable selection in linear regression
In this talk the interest is in robust procedures to select variables in a multiple linear regression modeling context. Throughout the talk the focus is on how to adapt the nonnegative garrote selection method to get to a robust variable selection method. We establish estimation and variable selection consistency properties of the developed method, and discuss robustness properties such as breakdown point and influence function. In a second part of the talk the focus is on heteroscedastic linear regression models, in which one also wants to select the variables that influence the variance part. Methods for robust estimation and variable selection are discussed, and illustrations of their influence functions are provided. Throughout the talk examples are given to illustrate the practical use of the methods.

Date: 30th of June, 2017
Speaker: Ines Wilms
University of Leuven (KU Leuven)
Location: AGR Carslaw 829
Title: Sparse cointegration
Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. We provide a sparse estimator of the cointegrating vectors. Sparsity means that some elements of the cointegrating vectors are estimated as exactly zero. The sparse estimator is applicable in high-dimensional settings, where the time series length is short relative to the number of time series. Our method achieves better estimation accuracy than the traditional Johansen method in sparse and/or high-dimensional settings. We use the sparse method for interest rate growth forecasting and consumption growth forecasting. We show that forecast performance can be improved by sparsely estimating the cointegrating vectors.

Joint work with Christophe Croux.

Date: 19th of May, 2017
Speaker: Dianne Cook
Monash University
Title:The glue that binds statistical inference, tidy data, grammar of graphics, data visualisation and visual inference

Buja et al (2009) and Majumder et al (2012) established and validated protocols that place data plots into the statistical inference framework. This combined with the conceptual grammar of graphics initiated by Wilkinson (1999), refined and made popular in the R package ggplot2 (Wickham, 2016) builds plots using a functional language. The tidy data concepts made popular with the R packages tidyr (Wickham, 2017) and dplyr (Wickham and Francois, 2016) completes the mapping from random variables to plot elements.

Visualisation plays a large role in data science today. It is important for exploring data and detecting unanticipated structure. Visual inference provides the opportunity to assess discovered structure rigorously, using p-values computed by crowd-sourcing lineups of plots. Visualisation is also important for communicating results, and we often agonise over different choices in plot design to arrive at a final display. Treating plots as statistics, we can make power calculations to objectively determine the best design.

This talk will be interactive. Email your favourite plot to ahead of time. We will work in groups to break the plot down in terms of the grammar, relate this to random variables using tidy data concepts, determine the intended null hypothesis underlying the visualisation, and hence structure it as a hypothesis test. Bring your laptop, so we can collaboratively do this exercise.

Joint work with Heike Hofmann, Mahbubul Majumder and Hadley Wickham

Date: 5th of May, 2017
Speaker: Peter Straka
University of New South Wales
Title: Extremes of events with heavy-tailed inter-arrival times

Heavy-tailed inter-arrival times are a signature of "bursty" dynamics, and have been observed in financial time series, earthquakes, solar flares and neuron spike trains. We propose to model extremes of such time series via a "Max-Renewal process" (aka "Continuous Time Random Maxima process"). Due to geometric sum-stability, the inter-arrival times between extremes are attracted to a Mittag-Leffler distribution: As the threshold height increases, the Mittag-Leffler shape parameter stays constant, while the scale parameter grows like a power-law. Although the renewal assumption is debatable, this theoretical result is observed for many datasets. We discuss approaches to fit model parameters and assess uncertainty due to threshold selection.

Date: 28th of April, 2017
Speaker: Botond Szabo
Leiden University
Title: An asymptotic analysis of nonparametric distributed methods

In the recent years in certain applications datasets have become so large that it becomes unfeasible, or computationally undesirable, to carry out the analysis on a single machine. This gave rise to divide-and-conquer algorithms where the data is distributed over several `local' machines and the computations are done on these machines parallel to each other. Then the outcome of the local computations are somehow aggregated to a global result in a central machine. Over the years various divide-and-conquer algorithms were proposed, many of them with limited theoretical underpinning. First we compare the theoretical properties of a (not complete) list of proposed methods on the benchmark nonparametric signal-in-white-noise model. Most of the investigated algorithms use information on aspects of the underlying true signal (for instance regularity), which is usually not available in practice. A central question is whether one can tune the algorithms in a data-driven way, without using any additional knowledge about the signal. We show that (a list of) standard data-driven techniques (both Bayesian and frequentist) can not recover the underlying signal with the minimax rate. This, however, does not imply the non-existence of an adaptive distributed method. To address the theoretical limitations of data-driven divide-and-conquer algorithms we consider a setting where the amount of information sent between the local and central machines is expensive and limited. We show that it is not possible to construct data-driven methods which adapt to the unknown regularity of the underlying signal and at the same time communicates the optimal amount of information between the machines. This is a joint work with Harry van Zanten.

About the speaker:

Botond Szabo is an Assistant Professor at the University of Leiden, The Netherlands. Botond received his phd in Mathematical Statistics from the Eindhoven University of technology, the Netherlands in 2014 under the supervision of Prof.dr. Harry van Zanten and Prof.dr. Aad van der Vaart. His research interests cover Nonparametric Bayesian Statistics, Adaptation, Asymptotic Statistics, Operation research and Graph Theory. He received the Savage Award in Theory and Methods: Runner up for the best PhD dissertation in the field of Bayesian statistics and econometrics in the category Theory and Methods and the "Van Zwet Award” for the best PhD dissertation in the Netherlands in Statistics and Operation Research 2015. He is an Associate Editor of Bayesian Analysis. You can find more about him here: .

Date: 7th of April, 2017
Speaker: John Ormerod
Sydney University
Title: Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?

We introduce a new class of priors for Bayesian hypothesis testing, which we name "cake priors". These priors circumvent Bartlett's paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). Lindley's paradox will also be discussed. A novel construct involving a hypothetical data-model pair will be used to extend cake priors to handle the case where there are zero free parameters under the null hypothesis. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show (under certain assumptions) that these Bayesian hypothesis tests are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. This sharply contrasts with classical tests, where the level of the test is held constant and so are not Chernoff-consistent.

Joint work with: Michael Stewart, Weichang Yu, and Sarah Romanes.

Date: 7th of April, 2017
Speaker: Shige Peng
Shandong University
Title: Data-based Quantitative Analysis under Nonlinear Expectations

Traditionally, a real-life random sample is often treated as measurements resulting from an i.i.d. sequence of random variables or, more generally, as an outcome of either linear or nonlinear regression models driven by an i.i.d. sequence. In many situations, however, this standard modeling approach fails to address the complexity of real-life random data. We argue that it is necessary to take into account the uncertainty hidden inside random sequences that are observed in practice.

To deal with this issue, we introduce a robust nonlinear expectation to quantitatively measure and calculate this type of uncertainty. The corresponding fundamental concept of a `nonlinear i.i.d. sequence' is used to model a large variety of real-world random phenomena. We give a robust and simple algorithm, called `phi-max-mean,' which can be used to measure such type of uncertainties, and we show that it provides an asymptotically optimal unbiased estimator to the corresponding nonlinear distribution.

Date: 17th of March, 2017
Speaker: Joe Neeman
University of Texas Austin
Title: Gaussian vectors, half-spaces, and convexity

Let \(A\) be a subset of \(R^n\) and let \(B\) be a half-space with the same Gaussian measure as \(A\). For a pair of correlated Gaussian vectors \(X\) and \(Y\), \(\mathrm{Pr}(X \in A, Y \in A)\) is smaller than \(\mathrm{Pr}(X \in B, Y \in B)\); this was originally proved by Borell, who also showed various other extremal properties of half-spaces. For example, the exit time of an Ornstein-Uhlenbeck process from \(A\) is stochastically dominated by its exit time from \(B\).

We will discuss these (and other) inequalities using a kind of modified convexity.

Date: 3rd of March, 2017
Speaker: Ron Shamir
Tel Aviv University
Title: Modularity, classification and networks in analysis of big biomedical data

Supervised and unsupervised methods have been used extensively to analyze genomics data, with mixed results. On one hand, new insights have led to new biological findings. On the other hand, analysis results were often not robust. Here we take a look at several such challenges from the perspectives of networks and big data. Specifically, we ask if and how the added information from a biological network helps in these challenges. We show both examples where the network added information is invaluable, and others where it is questionable. We also show that by collectively analyzing omic data across multiple studies of many diseases, robustness greatly improves.

Date: 31st of January, 2017
Speaker: Genevera Allen
Rice University
Title: Networks for Big Biomedical data

Cancer and neurological diseases are among the top five causes of death in Australia. The good news is new Big Data technologies may hold the key to understanding causes and possible cures for cancer as well as understanding the complexities of the human brain.

Join us on a voyage of discovery as we highlight how data science is transforming medical research: networks can be used to visualise and mine big biomedical data disrupting neurological diseases. Using real case studies see how cutting-edge data science is bringing us closer than ever before to major medical breakthroughs.

Information for visitors

Enquiries about the Statistics Seminar should be directed to the organiser Garth Tarr.