I’ve been playing with some sequence analysis and phylogentic tree construction programs recently because I would like to introduce these sorts of data analysis into my biology classes. As a sample protein, I decided to use BLyS / BAFF, a protein important in regulating B Cell numbers. I’ve always wondered about the origin of this kind of molecule, since working on it in grad school, and this seemed like a decent way to get some ideas about where it might come from.
The first thing I did was go to the NIH’s National Library of Medicine website: http://www.ncbi.nlm.nih.gov
It’s easy to search for any protein / gene / whole genome you are interested in examining. Knowing that BLyS is vital in humans and mice, I chose to start with the human sequence. I retrieved it as the following:
>gi|20196464|dbj|BAB90856.1| BLyS [Homo sapiens] MDDSTEREQSRLTSCLKKREEMKLKECVSILPRKESPSVRSSKDGKLLAATLLLALLSCCLTVVSFYQVA ALQGDLASLRAELQGHHAEKLPAGAGAPKAGLEEAPAVTAGLKIFEPPAPGEGNSSQNSRNKRAV
The easiest tool to find similar proteins in other animals is the Basic Local Alignment Search Tool for proteins, or BLASTp. Just using default settings, I pasted the sequence in the search field and hit go. (note, I actually just used the accession number, not the whole sequence)
This retrieved tons of proteins with similar sequences from the vast database of sequence information, from which I chose several model species. One thing I wanted to do was to include several primates as a sort of internal calibration (assuming that they would all have very similar sequences compared to more distantly related species). I also wanted to get a few animals’ sequences who are quite distantly related to humans (frog and ground tit fir that bill)
Once I had a list, I put them all into a single text file and then used that in a second program. This time, I decided that the best ‘multiple alignment tool’ would be CLUSTALX. It’s been around for a while and can create data in a number of different forms. Besides, it’s free and versions are available for both mac and PC.
Again, for starters, I just accepted the default parameters and did a quick alignment:
Obviously, there’s something odd about the canid familiars (dog) sequence, but before I did anything about that, I just wanted to see what a phylogenetic tree looked like. This is another thing that Clustal does well, it will export your sequence alignment as tree data in a number of formats, then I could plug that data into one final program. This last is a web based program that I access through a french site (but you can probably find it in a number of places). The program is called DRAWGRAM. It accepts alignment data and outputs a graphical tree representation of the alignment.
This is an important logical step… What I’m doing is asking for a family tree of sorts to be displayed that represents the relationship of the sequences I provided. We might want to assume that this also tells us how related the organisms that have these proteins are – and that’s not wrong, but it’s also not thorough as we’re only using ONE protein to make that assumption.
Here’s my first tree:
Finally, I went back and truncated the Canis sequence to a place where I suspect the protein actually starts – my sequence from the NCBI gave me a string of Amino Acids at the front of the protein that I think are probably not there, but just got added by some computer algorithm without proper human oversight.
Once I did that Canis (by the way, I remained the sequence ‘DOG’ so I was sure it was the new one) fell in line with a sequence more similar to that seen in cats (felis):
- Dereeper A., Audic S., Claverie J.M., Blanc G. BLAST-EXPLORER helps you building datasets for phylogenetic analysis. BMC Evol Biol. 2010 Jan 12;10:8. (PubMed)
- Dereeper A.*, Guignon V.*, Blanc G., Audic S., Buffet S., Chevenet F., Dufayard J.F., Guindon S., Lefort V., Lescot M., Claverie J.M., Gascuel O. Phylogeny.fr: robust phylogenetic analysis for the non-specialist. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W465-9. Epub 2008 Apr 19. (PubMed) *: joint first authors
- Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2). 1989, Cladistics 5: 164-166
- Larkin,M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G. (2007) Clustal W and Clustal X version 2.0. Bioinformatics, 23:2947-2948.
- Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research, 25:4876-4882.