Seminars

Application of Scan Statistics on Genomic Data: Searching Palindrome Clusters

63
reads

I-Ping Tu

2012-10-12
12:30:00 - 14:30:00

103 , Mathematics Research Center Building (ori. New Math. Bldg.)

A DNA palindrome is a special DNA letter pattern where the segment in one strand is identical to the other strand by an inversion symmetry and enables the formation of a secondary structure that possibly confers regulatory functions in varied biological processes including transcription, replication and gene deletion. As such, searching a non- random clustering of DNA palindromes along the chromosome becomes an important bioinformatic task, which depends on the null palindrome occurrence rate. To this end, the conventional method is to get the average rate. However, by computationally inserting a hot-spot containing 3000 bp’s into a simulated herpes virus genome, we found the average rate method reported twice the actual rate. In this study, to deal with the overestimation problem by the conventional method, we propose a Markov chain based estimator and use it to bypass the direct counting of the number of palindromes, and thus to reduce the influence of the hot-spot. Compared to the average rate method, our method is shown to be more robust against the hot-spots. Furthermore, our method can be readily generalized for a higher-order or a segmented Markov model, and extended to calculate the occurrence rate for palindromes with gaps. Finally, we provide a p-value approximation for various scan statistics to test non-random palindrome clustering under a Markov model. (This is a joint work with Yuan-Fu Huang and Shao-Hsuan Wang.)