Seminar
Qing Zhou
Extracting Sequence Features to Predict Protein-DNA Interactions: A
Comparative Study
Predicting where proteins especially transcription factors (TFs) interact
with DNA is an important problem in biology. Most influential
computational strategies for predicting TF binding sites are based on
generative models in the form of position-specific weight matrices (PWMs).
We present here a systematic study of predictive modeling approaches to
the TF-DNA binding problem, which are frequently shown to be more
efficient than those methods based only on PWMs. Predictive modeling
approaches integrate genomic sequence information with expression or
ChIP-binding information through sequence feature extraction and
selection. We examine a few state-of-the-art learning methods including
stepwise linear regression, multivariate adaptive splines, neural
networks, support vector machines, boosting, and Bayesian additive
regression trees (BART). These methods are applied to simulated datasets
and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2,
respectively, in human embryonic stem cells. We find that, with proper
learning methods, predictive modeling approaches can significantly improve
the predictive power and identify more biologically interesting features,
such as TF-TF interactions, than the PWM approach. In particular, BART and
boosting show the best and the most robust overall performance among all
the methods.