Inspired by concepts from ergodic theory, we give a new approach to coding
sequence (CDS) density estimation for the human genome. Our approach is
based on the introduction and study of topological pressure: a numerical
quantity assigned to any finite sequence based on an appropriate notion of
"weighted information content". For human DNA sequences, each
codon is assigned a suitable weight, and using a window size of
approximately 60,000bp, we obtain a very strong positive correlation
between CDS density and topological pressure. Inspired again by ergodic
theory, we use the weightings on the codons to define a probability
distribution on finite sequences, which is effective in distinguishing
between coding and non-coding human DNA sequences of lengths approximately
5,000bp. The theoretical underpinning for our approach is the theory of
thermodynamic formalism from the dynamical systems literature. This is
joint work with David Koslicki (OSU).
|