BLAST
- why study internals of BLAST?
- more thorough understanding of results
- better troubleshooting for unexpected results
- refining results
- BLAST was born in 1989 and published in 1990 when a team of Biologists and Computer Scientists had an interesting idea
- there were already many sequence matching programs that did a pretty decent job given the available set of sequences
- the problem was these programs took a while to run
- the original idea for BLAST: look for obvious sequence similarities, not subtle ones, and do it quickly
- if we ratchet up the threshold for similarity, we should find pretty clear homologies, and do it before "you take more than one sip of coffee"
- BLAST never intended to compete with earlier, more precise matching programs, but they found that BLAST could also quickly find some weak similarities as well
How Does it Work?
- consider how we might compare these two sentences:
"Fred ran around like a maniac, but he actually won the game!"
"I saw that Fred won the match despite doing a lot of dumb things."
- matching pieces, not the whole
- matching similar pieces, not just exact matches
- matching meaning, not just syntax
- consider how Google ranks and matches web pages against keywords (a mix of analytical and empirical)
- well, let's go back to basics: think about how we might compare two sequences that don't match exactly
- the general idea is to use a 2D grid to figure out the optimal alignment and each diagonal represents a possible alignment
- we can score each alignment based on how many pairs match
- but, what about insertions and deletions? we need a way to allow gaps in the alignment
- this means there's more than one interpretation for each diagonal (for each cell actually)
- another way to look at it is the alignment may be a jagged diagonal in the grid if gaps are allowed
- the good news is we can still score the jagged diagonal more or less the same way (with penalty for gaps) and a computer can fill this grid in pretty quickly
- here is an example that shows how we can keep up with a score and also keep up with the "flow" of these possibly jagged diagonals
- let's try our hand at manually following a variation of the Smith-Waterman algorithm
- here is a practice problem
Biology Knows the Score
- now what we need is a better, more biologically relevant, way to score these alignments
- in the 1970s, some biologists figured out how likely one amino acid would be substituted by another using chemical properties
- they created a matrix called PAM1 (Percent Accepted Mutation), which established relative weights indicating when a pair of acids were 85% or more identical
- the matrix allowed for variations, such as PAM30 (PAM1 multiplied 30x), which represents 30 substitutions per 100 amino acids
- such a variation would be less specific than PAM1, but useful for measuring more distant relationships
- in the 1990s, biologists figured out clusters of identity empirically from protein databases
- they created a matrix called BLOSUM62 (BLOcks SUbstitution Matrix); this matrix indicated clusters with 62% identity to some other block
- again variations emerged such as BLOSUM45 (45% identity) and BLOSUM80 (80% identity)
- BLOSUM80 is helpful when the comparison is known to be relatively close; BLOSUM45 is useful for a more divergent comparison
- here are some common matrices
- how do PAM and BLOSUM compare? BLOSUM45 is about like PAM250, BLOSUM62 is like PAM180, BLOSUM80 is like PAM120
- all of these matrices are scaled and rounded to derive simpler numbers (whole numbers, not real numbers)
- sometimes you need to do this in reverse; that is, we need to normalize the numbers with a lambda value that gets something closer to the original real numbers
- it is these numbers that are used to calculate the E-value or Expect value (e.g. E = 0.0001 means this is expected to occur about once in 10,000)
- but what about context? think about how something in one position constrains the probability of what comes next
- start with the probability of a 'q' showing up in an English word
- now what about the letter after that? instead of being equally likely among 26 possibilities, it's much more likely to be a 'u'
- this kind of conditional probability issue is the kind of thing the more sophisticated matching programs used
- and it is also one reason why they typically take a long time to run
- for BLAST, we ignore context and call the measure an approximation; and it turns out it still works pretty well
- the initial version of BLAST assumed that the letters of the sequence are independent and identically distributed; this is still the case
- the initial version also had some assumptions about the length of sequences and that the alignments did not contain any gaps; these are no longer assumed
BLAST in Three Easy Steps (sort of)
- BLAST does its work in phases: seeding, extension, and evaluation
- seeding finds words/neighborhoods with significance > seeding threshold (T in the picture)
- score significance by summing matches using a matrix (also uses a "two-hit" algorithm)
- extension adds to both ends of a word/neighborhood
- however, we need a way to know when to stop adding; so track how far away we get from the best score we've seen so far
- then, if that distance passes an extension threshold (X in the picture), stop adding and back up to that highest score to end the extension
- evaluation throws out any alignment whose score is below an alignment threshold