University Logo
Sydney Precision Bioinformatics Alliance

The goal of the group's work is to develop methods to improve the interpretation of omics data, leading to tangible translational benefits to important biological problems. You could become part of the team working on one of the projects detailed below.

Future Projects

Integrative Omics

Embryonic stem cell (ESC)-specific Pathway Identification and Annotation Using Multi-layered Omics Data and Statistical Learning (1 Honours Student)

While all cells from a given organism have the same DNA sequence that codes for the same genes, different cell types of that organism only have a subset of genes "turned on". Genes are commonly annotated into pathways for summarising their collective effect in biological systems. One of the main drawbacks in current pathway annotation is that they are NOT cell type-specific. We propose to identify and curate cell type-specific pathways for embryonic stem cells (ESCs) using our multi-layered omics data. The key assumption is that genes within a pathway should have correlated expression profile changes when perturbed. We have collected ESC differentiation data profiled in a time-course on both proteome and transcriptome levels. Following the above assumption, we aim to (1) identify pathways that are regulated specifically in ESC differentiation; and (2) curate ESC-specific pathways using statistical learning. This project will expose honours student to the development and application of cutting-edge statistical learning methods to the state-of-the-art bio-molecular applications. It sits at the heart of interdisciplinary research.

Contact: Dr. Pengyi Yang to discuss and/or apply.

Is integrative-omics the new currency for solutions to complex diseases? Exploring Correlations in High-dimensional Data (1 Honours Student)

In the surge of large volumes of high-throughput biological data being generated, more researchers are looking to integrate data of different types to inform hypotheses. For example, in complex metabolic diseases such as T2D and obesity, it is crucial to interrogate multiple data types to gain a comprehensive picture of the system defects and may eventually lead to identification of T2D or obesity markers. In this project, we aim to apply multivariate statistical approaches to integrate the data and build better predictors from multiple data sources. In this project, we will explore ways of weighting the relatively sparse proteomics data with information borrowed from the transcriptomics data. This involves, exploring or developing methods to correlate multiple high dimensional datasets to identify common and differentiating patterns.

Contact: Prof. Jean Yang to discuss and/or apply.

Machine Learning for Trans-omics Data (1 PhD Student)

A PhD position is available in developing and applying computational and machine learning models for multilayered trans-omics data sets generated by state-of-the-art mass spectrometer (MS) and next generation sequencer (NGS). Our research project, funded by the Australian Research Council (ARC), aims to develop novel machine learning algorithms to analyse and integrate large-scale MS-based omics data with ultra-fast NGS-based omics data generated from complex biological systems. Characterising the signaling cascades, transcriptional networks, and translational protein networks and their cross-talks are critical for comprehensive understanding of complex biological systems. Our large-scale multilayered trans-omics data generated from state-of-the- art technological platforms provides a unique opportunity to uncover the novel biology and molecular mechanisms that are critical for treating complex diseases and personalised medicine. In this project, we aim to develop novel machine learning methods that are capable of extracting key patterns from each omic layer and integrate such information across multiple omic layers.

Contact: Dr. Pengyi Yang to discuss and/or apply.

Variable Selection

Ultra-high Variable Screening (1 Student)

This project will review recent literature in ultra-high variable screening, computationally fast and ingenious methods to sort out the 'good from the ugly'. There are millions of variables - too many for any regression procedure to be run with the full data. The task is to safely eliminate that part of the design matrix which is guaranteed to be uninformative. Such variable screening is an essential preprocessing step for successful model selection methods.

Contact: A. Prof. Samuel Mueller to discuss and/or apply.

Robust Model Selection Criteria (1 Student)

Mueller and Welsh introduced methods to robustly select variables in a regression-type model using the bootstrap. This project would revisit their methods and special additional cases will be identified and investigated. One aim of the project could be to make available an R-package or at least an R-function. There are also additional algorithms that could be explored that do not require to have to consider all possible submodels. That is, how to robustly reduce the powerset of models with fast and robust methods before turning attention to more computationally expensive but more efficient model selectors is an important question as well.

Contact: A. Prof. Samuel Mueller to discuss and/or apply.

Heterogeneous Data Classification

Bayesian Approaches to Differential Distribution (1 Honours Student)

The distribution of genes is potentially informative when trying to distinguish between health samples and diseased samples. Traditionally this has been performed via a hypothesis testing approach which tests for differences in the mean gene expression levels between healthy samples and diseased samples, which is called differential expression. In this project we will perform analogous Bayesian test for differences across the whole distribution of gene expression levels between two states. A multiple testing approach will be developed to take into account false discoveries. This work will be motivated by real gene expression data from melanoma patients where it is hoped that this new approach will be able to uncover new biomarkers for the disease.

Contact: Dr. John Ormerod to discuss and/or apply.

Rare cell type discovery using AdaSampling (1 Honours Student)

Single-cell RNA sequencing (scRNA-seq) is a revolutionary technique that enables the gene expression profiling of thousands of cells. One of the key task in scRNA-seq data analysis is to identify rare cell types that are hiding in the tissue samples such as various cancer stem cells in tumour tissues. AdaSampling is a semi-supervised machine learning approach that we developed recently for detecting noisy samples in a given dataset. In this project, we will be look into transferring the AdaSampling technique for identifying rare cell types that are previously unknown and therefore labeled incorrectly in the initial dataset.

Contact: Dr. Pengyi Yang or Prof. Jean Yang to discuss and/or apply.

Brain Omics and the Connectome

Molecular-pathology: Exploring the cellular composition of diseased tissue through 'Omics technology. (1 Student)

Gene expression profiles of post mortem tissue from whole brain are composed of multiple cell types providing a snapshot of the cellular pathology of the tissue. We can explore how these bulk tissue profiles can be integrated with single-cell sequencing data or highthroughput imaging data to describe the presence, behaviour and interaction of various cell types and how this affects disease. Integrating these data-types will require the development of novel unsupervised clustering methods and image-based segmentation and analysis approaches to quantify cell subpopulations

Contact: Dr. Ellis Patrick to discuss and/or apply.

Causal associations in Alzheimer's disease (AD) (1 Student)

Mendelian randomization (MR) exploits the natural genetic variability in a population to make causal inferences about the relationships between certain classes of molecules and disease. With a few assumptions, MR seeks to identify associations between genes and disease that are independent of potential confounders while also establishing the causal direction of these associations. Recent advances in sparse regression approaches could extend the capacity of MR to use the information in thousands of SNPs near a gene to increase the likelihood of identifying strong gene-phenotype relationships. We could explore the use of the MR framework to develop tools for filtering the list of genes whose expression is associated with AD. First we could integrate sparse regression techniques such as elastic net to improve the power of MR making it appropriate for use in a high-throughput setting. This will allow us to apply MR to the matched genotyping and gene expression data to begin prioritising causal associations in Alzheimer's disease for further investigation. We could also seek to refine our MR algorithm by including the information on cell-type proportions. This will include using the cell-type proportion estimates to reduce noise in our models and also to prioritize the cell-types through which the AD-genes are be acting. Or we could extend MR to provide a causal understanding of gene expression changes in AD at a system level. We will use MR to contribute to a directional prediction in various annotated and PPI networks to begin to establish a direction of information flow throughout the network.

Contact: Dr. Ellis Patrick to discuss and/or apply.

Dimension Reduction for Resting State fMRI Data (1 Honours Student)

Anatomical, functional and effective networks within the brain are currently being elucidated at fine temporal and spatial resolution using magnetic resonance imaging, via both functional MRI (fMRI). The concepts behind local region clustering such as superpixels are becoming increasingly popular for use in computer vision applications, data visualization and dimensional reduction strategies. This project involves exploring ideas and models for segmenting fMRI imaging data by borrowing information across multiple samples. Specific applications of this information sharing may be to improve the identification of interesting biologically features or improve sample classification in large p small n datasets.

Contact: Prof. Jean Yang or Dr. John Ormerod to discuss and/or apply.


Methods Towards Personalised Medicine (1 Honours Student)

Over the past decade, new and more powerful genomic tools have been applied to the study of complex disease such as cancer and generated a myriad of complex data. However, our general ability to analyse this data lags far behind our ability to produce it. This project is to develop statistical method that deliver better prediction of response to drug therapy. In particular, this project investigate whether it is possible to establish the patient or sample specific network based (matrix) by integrating public repository and gene expression data.

Contact: Prof. Jean Yang to discuss and/or apply.

Classification Using Statistical Networks (1 Honours Student)

Classical approaches in classification are primarily based on single features that exhibit effect size difference between classes. In omics data, this is equivalent to finding differential expression of genes or proteins between different treatment classes. Recently, network-based approaches utilising interaction information between genes have emerged and our recent work (Barter et al. 2014) further reveals that simple network based methods are able to classify alternate subsets of patients compared to gene-based approaches. This suggests that next-generation methods of gene expression signature modelling may benefit from harnessing data from external networks. This project will further explore the strength and weaknesses of utilizing statistical network as a feature in classification. The project will also extend Barter et al, 2014 by examining the effect of robust networks obtain from external databases or complementary datasets and evaluate its effect in classification (prognostic) setting.

Contact: Prof. Jean Yang to discuss and/or apply.