Molecular evolution is the result of DNA-level modifications, resulting from a variety of mutagenic mechanisms, followed by fixing of a small subset of those mutations at the population level. In other words there are two steps:
I am interested in the second step, mutation selection, because selection results from functional and/or structural restraints. If we can discover the fingerprint of selective pressures then it informs us about the role(s) a given residue may play. Moreover, if the structure of a protein is also known we can find putative functionally important residues by searching for evolutionary conservation that cannot be explained by the structure alone.
Most current protein phylogeny methods use empirical rate matrices that capture the combined effects of mutation generation and mutation selection. Therefore they cannot distinguish between those two processes. In addition, phylogeny methods almost always assume that all residues in a protein evolve in the same manner even though we know this is far from the truth. To remedy this I have developed tools that model different aspects of amino acids (volume, charge, hydrophobicity, ...) directly, and can do so in a site-specific manner.
Improving protein rate matrices by separating mutation and selection processes
Despite their many shortcomings, empirical rate matrices are still the
best method to determine phylogenetic trees for protein sequences. These
matrices are made by estimating the relative mutation rates for any pair of
amino acids based on very large sequence alignments for many proteins and many
different organisms. This procedure averages out any variation in mutation
behaviour between different groups of organisms, different genes, and different
residues in the protein.
Based on preliminary data I hypothesize that the mutation rates at the protein level are strongly influenced by mutation processes at the DNA level, including the genetic code, relative base mutation rates, and base composition. In other words, if an empirical rate matrix is based on sequences from a mix of genomes with high GC content and others with low GC content, the end result will average out those differences and will not be ideal for either case. If, instead, we can split the rate matrix in a GC-content independent part and a part that accounts for GC content then we can optimize that matrix for any sequence by adjusting for the GC-content.
An important side effect of accounting for DNA-level mutation processes is that the DNA-independent part of the rate matrix becomes more strongly dominated by selection forces at the protein level. This will provide a cleaner look at the selection on amino acid properties.