Western University Computer ScienceWestern Science

PhD Defense


Ruipeng Lu

Computational Modelling of Human Transcriptional Regulation by an Information Theory-based Approach


Thesis Examiners:

External Examiner:
Thursday, April 12, 2018
1:00 p.m.
Medical Sciences Bldg, MSB 384
Dr. Peter Rogan
Dr. Bob Mercer
Dr. Dan Lizotte

Dr. Ilka Kenemann (BioChemistry)
Dr. Wyeth Wasserman



ChIP-seq experiments can identify the genome-wide binding site motifs of a transcription factor (TF) and determine its sequence specificity. Multiple algorithms were developed to derive TF binding site (TFBS) motifs from ChIP-seq data, including the entropy minimization-based Bipad that can derive both contiguous and bipartite motifs. Prior studies applying these algorithms to ChIP-seq data only analyzed a small number of top peaks with the highest signal strengths, biasing their resultant position weight matrices (PWMs) towards consensus-like, strong binding sites; nor did they derive bipartite motifs, disabling the accurate modelling of binding behavior of dimeric TFs.

This thesis presents a novel motif discovery pipeline by adding the recursive masking and thresholding functionalities to Bipad to improve detection of primary binding motifs. Analyzing 765 ENCODE ChIP-seq datasets with this pipeline generated contiguous and bipartite information theory-based PWMs (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The accuracy of these iPWMs were determined via four independent validation methods, including detection of experimentally proven TFBSs, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. Novel cofactor motifs supported previously unreported TF coregulatory interactions. This thesis further presents a unified framework to identify variants in hereditary breast and ovarian cancer (HBOC), successfully applying these iPWMs to prioritize TFBS variants in 20 complete genes of HBOC patients.

The spatial distribution and information composition of cis-regulatory modules (e.g. TFBS clusters) in promoters substantially determine gene expression patterns and TF target genes. Multiple algorithms (e.g. information density- based clustering (IDBC)) were developed to detect TFBS clusters. Prior studies predicting tissue-specific gene expression levels and differentially expressed (DE) TF targets used signal strengths and counts of ChIP-seq peaks to inaccurately represent TFBS strengths and counts, log odds-based PWMs to inaccurately quantify TFBS strengths. This thesis presents a general machine learning framework that uses the Bray-Curtis similarity measure to quantify the similarity between tissue-wide expression profiles of genes, and IDBC-identified clusters from iPWM-detected TFBSs to predict gene expression profiles and DE direct TF targets. Multiple clusters enabled gene expression to be robust against deleterious TFBS mutations.