Let be the number of occurrences of the word followed by the word in the corpus.
The equation for bigram probabilities is as follows:
[4]
Where the unigram probability depends on how likely it is to see the word in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:
Note that is a proper distribution, as the values defined in the above way are non-negative and sum to one.
The parameter is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.
The value of the normalizing constant is calculated to make the sum of conditional probabilities over all equal to one.
Observe that (provided ) for each which occurs at least once in the context of in the corpus we discount the probability by exactly the same constant amount ,
so the total discount depends linearly on the number of unique words that can occur after .
This total discount is a budget we can spread over all proportionally to .
As the values of sum to one, we can simply define to be equal to this total discount:
This equation can be extended to n-grams. Let be the words before :
[5]
This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.[6] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.