University of Sydney Statistics Seminar Series 2014

Unless otherwise specified seminars will be held on Fridays at 2pm in Carslaw 173

7th March 2014: Sanjaya Dissanayake
School of Mathematics and Statistics, The University of Sydney
Title: Gegenbauer Processes with Long Memory and Extensions
This talk presents an approximation of a Gegenbauer autoregressive moving average (GARMA) process driven by Gaussian white noise. The process model characterised by long memory using a finite order moving average is considered. Using a derived state space form the parameters are estimated by pseudo maximum likelihood via the Kalman filter. It is comparatively assessed initially with a finite order autoregressive approximation for choice and feasibility in establishing the order of the model. An extensive Monte Carlo experiment is executed to show that the optimal order is not very large (around 35) and rather insensitive to the series length. A rolling forecasting experiment is performed to validate the choice of the order of approximation in terms of predictive accuracy. Proposed state space methodology is applied to two different yearly sunspot series, and compared with other conventional and hybrid time series methods in the literature. The effect of a seasonal filter on GARMA processes is also examined.

This methodology is extended to the class of GARMA models with Generalized Autoregressive Conditionally Heteroskedastic (GARCH) errors. Finally, a unit root test of the index of a Gegenbauer polynomial in terms of the psedo maximum likelihood estimator is also considered.

Joint work with Professors M.S. Peiris, T. Proietti and Q. Wang.

21st March 2014: Zdravko Botev
The University of New South Wales
Title: When normality is a problem: A reliable method for evaluating the cumulative distribution function of the multivariate normal distribution.
he normal distribution is one of the most significant and beautiful mathematical objects. Had the normal law been known to the ancient Pythagoreans, they would have venerated it on a par with prime numbers. For a single Gaussian random variable a mathematician can use the error or erf(x) function to find any probability. However, there is no simple effective method to calculate probabilities associated with a Gaussian vector with a given full covariance matrix. To find these probabilities one has to compute an intractable high-dimensional integral. This talk suggests a good Monte Carlo estimator that leverages two different variational approximations.

28th March 2014: Pierre Del-Moral
The University of New South Wales
Title: Particle Monte Carlo methods in statistical learning and rare event simulation
In the last three decades, there has been a dramatic increase in the use of particle methods as a powerful tool in real-world applications of Monte Carlo simulation in computational physics, population biology, computer sciences, and statistical machine learning. Ideally suited to parallel and distributed computation, these advanced particle algorithms include nonlinear interacting jump diffusions; quantum, diffusion, and resampled Monte Carlo methods; Feynman-Kac particle models; genetic and evolutionary algorithms; sequential Monte Carlo methods; adaptive and interacting Markov chain Monte Carlo models; bootstrapping methods; ensemble Kalman filters; and interacting particle filters.

This lecture presents a comprehensive treatment of mean field particle simulation models and interdisciplinary research topics, including sequential Monte Carlo methodologies, genetic particle algorithms, genealogical tree-based algorithms, and quantum and diffusion Monte Carlo methods.

Along with covering refined convergence analysis of particle algorithms, we also discuss applications related to parameter estimation in hidden Markov chain models, stochastic optimization, nonlinear filtering and multiple target tracking, stochastic optimization, calibration and uncertainty propagation in numerical codes, rare event simulation, financial mathematics, and free energy and quasi-invariant measures arising in computational physics and dynamic population biology.

This presentation shows how mean field particle simulation has revolutionized the field of Monte Carlo integration and stochastic algorithms. It will help theoretical probability researchers, applied statisticians, biologists, statistical physicists, and computer scientists work better across their own disciplinary boundaries.

11th April 2014: Yuliya Karpievitch
The University of Tasmania
Title: Quantitative Analysis of Mass Spectrometry Data
Quantification of liquid chromatography mass spectrometry (LC-MS) data is complicated by missing values. Some are easier to deal with than others. For example, values missing completely at random can be ignored or imputed based on the observed peptide abundances. Left-censored values that fall bellow the detection limit are harder to deal with as observed values are not representative of the missing ones. Here I will present a statistical model that takes into account the 'missingness' mechanism and imputes the values accordingly. I will also discuss a normalisation method that removes biases of arbitrary complexity.

16th May 2014: Ngoc Tram
Austin Texis
Title: Size-biased permutation for a finite i.i.d sequence
Line up n blocks of lengths X_1 ... X_n, where the X_i's are independent and identically distributed (i.i.d) positive random variables. Throw a ball uniformly at random on the interval 0 to X_1 + ... + X_n, and record length of the block X_i that the ball falls in, and then remove this block. This is called size-biased sampling, since the bigger blocks are more likely to be discovered earlier. Do this recursively n times. This yields a size-biased permutation of the finite i.i.d sequence (X_1, ... X_n).

This model is known as Kingman's paintbox, put forward by Kingman in 1976 to study random infinite partitions. In the infinite version of the above model, the X_i's are jumps of a subordinator. In 1992, Perman, Pitman and Yor derived various distributional properties of this infinite size-biased permutation. Their work found applications in species sampling, oil and gas discovery, topic modeling in Bayesian statistics, amongst many others.

In this talk, we will derive distributional properties of finite i.i.d size-biased permutation, both for fixed and asymptotic n. We have multiple derivations using tools from Perman-Pitman-Yor, as well as the induced order statistics literature. Their comparisons lead to new results, as well as simpler proofs of existing ones. Our main contribution describes the asymptotic distribution of the last few terms in a finite i.i.d size-biased permutation via a Poisson coupling with its few smallest order statistics. For example, we will answer the question: what is the asymptotic probability that the smallest block is discovered last?

30th May 2014: Ian Marschner
Macquarie University
Title: Constrained GLMs using combinatorial EM algorithms with biostatistical applications
Motivated by applications in biostatistics, I will discuss computational methods for various constrained generalized linear models (GLMs) in which the linear predictor cannot range over the entire real line. Common examples include the binomial model with log or identity link, and the Poisson model with identity link. These models are important in biostatistics for obtaining adjusted relative risks, risk differences and rate differences. I will begin by illustrating the surprisingly unstable iterative behavior exhibited by conventional GLM software (primarily using R). This instability stems from the fact that Fisher scoring may have a repelling fixed point for such non-canonical models, which can induce periodicity and chaos in the iterative sequence. I will then discuss a class of algorithms called combinatorial EM (CEM) algorithms, which are an extension of the standard EM algorithm. CEM algorithms provide a stable alternative to standard GLM algorithms and are particularly suited to semi-parametric extensions through generalized additive models. I will primarily use the log link binomial model as a case study, including some practical data analysis examples, but I will also mention how CEM algorithms apply to other models.

6th June 2014: Michael Stewart
Sydney University
Title: Mixture Detection for Exponential Families
We discuss the deceptively rich parametric testing problem of distinguishing between one-component and two-component mixtures. We start with the simple case of normal location mixtures, tracing the history of work on this problem from the discovery of a slowly diverging log-likelihood ratio statistic in the 1980s, through to it's limiting extreme-value behaviour in the 1990s and early 2000s. We then discuss some interesting applications related to signal detection, multiple testing and variable selection in high-dimensional regression and classification problems, including the introduction of the higher criticism procedure. Finally we discuss recent work extending some limiting distribution and local power results beyond normal location mixtures to mixtures of general one-parameter exponential families, including a special role played by variance-stabilising transformations. We also indicate how certain results may be extended to a more general class of models which include change-point and hidden Markov models.

13th June 2014: Ciprian A. Tudor
Paris 1
Title: Stein method and Malliavin calculus: the basics and some new results
The Stein method allows to measure the distance between the laws of two random variables. Recently, this method combined with the Malliavin calculus, led to several interesting results related to the approximation of the normal distribution. We will present the basic facts related to this theory and we will give some recent applications to limit theorems.

1st August 2014: Rachel Wang
Title: New gene coexpression measures in large heterogenous samples using count statistics
With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the Big Data challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable if not better general performance. They have the important advantage of detecting subtle functional relationships that could be easily missed by other methods while remaining sensitive to common types of dependence relationships.

8th August 2014: Jingjing Wu
University of Calgary
Title: An Efficient and Robust Estimation Based on Profiling
The successful application of the Hellinger distance approach to fully parametric models is well known. The corresponding optimal estimators, known as minimum Hellinger distance (MHD) estimators, are efficient and have excellent robustness properties (Beran, 1977). This combination of efficiency and robustness makes MHD estimators appealing in practice. However, their application to semiparametric statistical models, which have a nuisance parameter (typically of infinite dimension), has not been fully studied. In this paper, we investigate a methodology to extend the MHD approach to general semiparametric models. We introduce the profile Hellinger distance and use it to construct a minimum profile Hellinger distance (MPHD) estimator of the finite-dimensional parameter of interest. This approach is analogous in some sense to the profile likelihood approach. We investigate the asymptotic properties such as the asymptotic normality, efficiency, and adaptivity of the proposed estimator. We also investigate its robustness properties. We present its small-sample properties using a Monte Carlo study.

22nd August 2014: Ganes S Ganeslingam
Massey University
Title: Ranked Set Sampling versus Simple Random Sampling in the Estimation of the Mean and the Ratio.
It is common in practice that the experimental units can be ranked easily using a cheaply measurable covariate than quantification of the main variable of interest which requires expensive measurements. In such situation ranked set sampling is more beneficial and cost effective. Environmental monitoring and assessment, for example, requires observational data where the ranked set sampling is proved to achieve observational economy when compared to the traditional simple random sampling. Ranked set sampling employs judgment ordering to obtain the actual sample and hence yield a sample of observations that is more representative of the underlying population. Therefore, either greater confidence is gained for a fixed number of observations, or for a desired level of confidence, a smaller number of observations is needed. In either way it is a big gain to the researcher. In this paper, we introduce the basic concepts of ranked set sampling and its application in the estimation of the population mean.

5th September 2014: Inge Koch
University of Adelaide
Title: Analysis of Spatial Data from Proteomics Imaging Mass Spectrometry
Mass spectrometry (MS) has become a versatile and powerful tool in proteomics for the analysis of complex biological systems. Unlike the common MS techniques the more recent imaging mass spectrometry (IMS) preserves the spatial distribution inherent in tissue samples. IMS data consist of tens of thousands of spectra measured over a large range of masses, the variables. Each spectrum arises from a grid point on the surface of a tissue section. Motivated by the requirements in cancer research to differentiate cell populations and tissue types of such data accurately and efficiently, we consider two approaches -- normalisation and feature extraction -- and we illustrate these approaches on IMS data obtained from tissue sections of patients with ovarian cancer. In proteomics, normalisation refers to the process of scaling spectra in order to correct for artefacts occurring during data acquisition. Normalising the mass spectra is essential in IMS for an interpretation of the data. We propose a new and efficient normalisation of the mass spectra, based on peak intensity correction (PIC), and illustrate its effect for individual mass images and cluster maps. The selection of mass variables which distinguish cancer tissue from non-cancerous tissue regions -- or responders from non-responders -- is an important step towards identification of biomarkers. We consider a combined cluster analysis and feature extraction approach for derived binary mass data. This approach exploits the difference in proportions of occurrence (DIPPS) statistic of subsets of data in the selection and ranking of variables. We apply these ideas to the cancer and non-cancerous regions of the tissue sections, and we summarise the `best' variables in a single image which has a natural interpretation.

26th September 2014: Dongsheng Wu
University of Alabama in Huntsville
Title: Regularity of Local Times of Fractional Brownian Sheets
As typical anisotropic Gaussian random fields, fractional Brownian sheets have been intensively studied in recent years. In this talk, we study the regularity of local times of fractional Brownian sheets, including the existence, joint continuity and smoothness (in the Meyer-Watanabe sense) of the local times. As applications, we derive regularity results of their collision local times, intersection local times and self-intersection local times. The main tools applied in our derivation are sectorial local nondeterminism of fractional Brownian sheets, Fourier analysis and chaos expansion of the local times. This talk is based on joint works with A. Ayache, Z. Chen and Y. Xiao.

Information for visitors

Enquiries about the Statistics Seminar should be directed to the organizer John Ormerod.
Last updated on 22 July 2014.