Conditional Probability and Bayesian Networks
- a discrete probability distribution gives the probability for each of a collection of possible outcomes, e.g.
P( Grade(easy class) ) = < (A is .80), (B is .12), (C is .05), (D is .02), (F is .01) >
P( Grade(hard class) ) = < (A is .02), (B is .08), (C is .20), (D is .50), (F is .20) >
- the outcomes should sum to 1.0 or 100%
- these distributions can be combined to calculate the combined probability of two events, e.g.
P( Grade(easy class) = C ^ Grade(hard class) = B) = .05 * .08 = .04
P( Grade(easy class) = A ^ Grade(hard class) = D) = .90 * .50 = .45
- conditional probability establishes distributions given some prior knowledge (sometimes called evidence), e.g.
P(RetirementNestEgg(me)) = < (Lap of Luxury is .05), (Well Off is .25), (Doing OK is .50), (Roughing It is .15), (Cardboard House is .05) >
P(RetirementNestEgg(me) | WinLottery(me)) = <
(Lap of Luxury is .20), (Well Off is .45), (Doing OK is .30), (Roughing It is .04), (Cardboard House is .01) >
- the second distribution is different because of the knowledge that I will win the lottery
- a more realistic conditional probability would include a distribution for winning the lottery as well
- in this scenario, we show all possible outcomes of the first distribution paired with all possible outcomes of the second distribution
- this is sometimes called a joint probability distribution, e.g.
| |
Lap of Luxury |
Well Off |
Doing OK |
Roughing It |
Cardboard House |
| Win Lottery |
.0001 |
.00024 |
.00015 |
.00002 |
.000005 |
| Not Win Lottery |
.1999 |
.44476 |
.29985 |
.03998 |
.009995 |
- the classic application of conditional probability is medical diagnosis
- consider an example using a tool called Hugin that models conditional probability with Bayesian networks
- thanks to Bayes law, we know how to use joint distributions to infer in both directions: given evidence on one side, we can infer the probability on the other
- thus conditional probability is a powerful tool for analyzing cause and effect relationships
The Markov Property and Hidden Markov Models
- the Markov property is a simplifying assumption about dependencies among multiple conditional probability distributions
- it assumes that the conditional probability distribution of future states of a process depend only upon the present state, not any prior states
- this allows for a simpler (albeit approximate) method for calculating a joint distribution because we ignore some (most?) dependencies
- the basic idea is that we choose which dependencies are the most important ones a priori to create what is called a Markov model
- consider a Markov model for phonemes for the word "tomato"
- to make this model infer words from phonemes, we need to create causal (a la conditional probability) relationships between phonemes and words
- thus word nodes are associated with phoneme nodes
- we observe the phoneme nodes (via audio input) and, using the causal model, we infer what words most likely produced these phonemes
- the word layer is considered hidden inside the model and we determine the sequence of words with the highest probability of "causing" the evidence (the observed phonemes)
- in this diagram, the variables across the top are the words which conditionally cause the phonemes across the bottom

- often more then one word may lead to a given phoneme (think of common phonemes in the English language)
- similarly one word may be associated with multiple phonemes
- some of the common algorithms used to infer the hidden sequence given the observed states include
- the Baum-Welch algorithm (a variation of a forward-backward algorithm) and
- the Viterbi algorithm
- how would an HMM be used in bioinformatics? turns out these models are quite useful in inferring structure (or other useful attributes) from sequence
- consider how conditional probability captures more than simple position counts
- how then would someone design an HMM?
- well, it's not magic; in fact, a number of bioinformatics programs generate an HMM automatically, e.g. PSI-BLAST, HMMER
- some of the first HMMs for English language understanding were derived from counting occurrences of certain sequences from a body of English literature
- consider how we might build an HMM to look for a particular protein domain or perhaps an entire gene