Seminar

Qing Zhou

Extracting Sequence Features to Predict Protein-DNA Interactions: A Comparative Study

Predicting where proteins especially transcription factors (TFs) interact with DNA is an important problem in biology. Most influential computational strategies for predicting TF binding sites are based on generative models in the form of position-specific weight matrices (PWMs). We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which are frequently shown to be more efficient than those methods based only on PWMs. Predictive modeling approaches integrate genomic sequence information with expression or ChIP-binding information through sequence feature extraction and selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive splines, neural networks, support vector machines, boosting, and Bayesian additive regression trees (BART). These methods are applied to simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF-TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.



Seminar Date:
April 16, 2008