This document describes the theory behind the n-gram models generated by the wordpredictor package. It also provides code examples that describe how to use the package.
The goal of the wordpredictor package is to provide a flexible and easy to use framework for generating n-gram models for word prediction.
The package allows generating n-gram models. It also allows exploring n-gram frequencies using plots. Additionally it provides methods for measuring n-gram model performance using Perplexity and accuracy.
The n-gram model may be customized using several options such as n-gram size, data cleaning options and options for converting text to tokens.
The n-gram model generated by the wordpredictor package uses the Markov model for approximating the language model. It means that the probability of a word depends only on the probability of the n-1 previous words.
Maximum Likelihood Estimation (MLE) is used to calculate the probability of a word. The probability of a word is calculated by regarding the word as the last component of a n-gram. The total number of occurrences of the n-gram is divided by the total number of occurrences of the (n-1)-gram. This gives the probability for the word.
The n-gram model is generated in steps. In the first step, the input data is cleaned. Unwanted symbols and words are removed from the input data.
In the next step, the cleaned data file is read. N-grams are extracted from the file, starting from 1-grams up to the configured n-gram size. The 1-gram, 2-gram, 3-gram etc tokens are saved in separate files along with the frequency. So the 3-gram file contains all extracted 3-grams and their respective frequencies.
The next step is to generate transition probability tables for each n-gram file. For the 1-gram file the transition probability table is simply the list of unique words along with the word frequencies. For the other n-gram files, the transition probability table is a data frame with 3 columns. The hash of n-gram prefixes, the next word id and the next word probability.
The n-gram prefix is the set of n-1 components before the last component. The n-1 components are combined using “_” and converted to a numeric hash value using the digest2Int method of the digest package.
The next word id is the numeric index of the next word in the list of 1-grams. The next word probability is the probability of the next word given the previous n-1 words. It is calculated using Maximum Likelihood Estimation (MLE) as described above.
Instead of storing the n-gram prefix strings, a single number is saved. Also instead of storing the next word, the numeric index of the next word is saved. This saves a lot of memory and allows more data to be stored, which improves the n-gram model’s efficiency.
In R, a number requires a fixed amount of storage, which about 56 bytes. In contrast the memory required to store a string increases with the number of characters in the string.
The data frames that represent each transition probability table are combined into a single data frame. The combined transition probability table is used to make word predictions.
To predict a word, the word along with the n-1 previous words are used as input. The model computes the hash of the previous words and looks up the hash in the combined transition probabilities table. If the hash was found, then the model extracts the top 3 next word ids that have the highest probabilities.
The model looks up the next word text that corresponds to the next word ids. The result is the top 3 most likely next words along with their probabilities.
If the hash was not found, then the hash of the n-2 previous words is calculated and looked up in the combined transition probabilities table.
This process is repeated until there are no previous words. When this happens, the model returns a “word not found” message.
This method of checking the transition probabilities of lower level n-grams is called back-off. An alternate method of predicting a word is to use interpolation. This involves weighing and summing the probabilities for each n-gram size.
The wordpredictor package provides methods for performing intrinsic and extrinsic evaluation of the n-gram model.
The wordpredictor package performs intrinsic evaluation by calculating the mean Perplexity score for all sentences in a validation text file. The Perplexity for a sentence is calculated by taking the N-th root of the inverse of the product of probabilities of all words in a sentence. N is the number of words in the sentence.
The probability of a word is calculated by considering all n-1 words before that word. If the word was not found in the transition probabilities table, then the n-2 words are looked up. This process is repeated until there are no previous words.
If the word was found in the 1-gram list, then the probability of the word is calculated by simply dividing the number of times the word occurs by the total number words.
If the word was not found in the 1-gram list, then the model uses a default probability as the probability of the word. The default probability is calculated using Laplace Smoothing.
Laplace Smoothing involves adding 1 to the frequency count of each word in the vocabulary. Essentially this means that the total number of words in the data set are increased by vc, where vc is the number of words in the vocabulary.
In Laplace Smoothing 1 is added to the word count. Since an unknown word occurs zero times, after Laplace Smoothing it will have a count of 1. So the default probability is calculated as: P(unk) = 1/(N+VC), where N is the total number of words in the data set and VC is the number of words in the vocabulary. This default probability is assigned to unknown words. Alternative methods to Laplace Smoothing are Add-k smoothing, Kneser-Ney smoothing and Good-Turing Smoothing.
The wordpredictor package uses the file /usr/share/dict/cracklib-small as the dictionary file. This file is pre-installed in most Linux distributions.
Extrinsic evaluation involves calculating the accuracy score. The model tries to predict the last word of a sentence. If the actual last word was one of the 3 words predicted by the model, then the prediction is considered to be accurate. The accuracy score is the number of sentences that were correctly predicted.
The ModelGenerator class allows generating the final n-gram model using a single method call. The following example generates a n-gram model using default data cleaning and tokenization options:
# The required files <- c("input.txt") rf # The test environment is setup <- setup_env(rf, ve) ed # The following code generates n-gram model using default options for data # cleaning and tokenization. See the following section on how to customize these # options. Note that input.txt is the name of the input data file. It should be # present in the ed directory. The generated model file is also placed in this # directory. # ModelGenerator class object is created <- ModelGenerator$new( mg name = "def-model", desc = "N-gram model generating using default options", fn = "def-model.RDS", df = "input.txt", n = 4, ssize = 0.1, dir = ed, dc_opts = list(), tg_opts = list(), ve = ve ) # Generates n-gram model. The output is the file # ./data/model/def-model.RDS $generate_model() mg # The test environment is cleaned up clean_up(ve)
The wordpredictor package provides the ModelEvaluator class for evaluating the performance of the generated n-gram model.
The following example performs intrinsic evaluation. It measures the Perplexity score for each sentence in the validation.txt file. It returns the minimum, mean and maximum Perplexity score for each line.
# The required files <- c("def-model.RDS", "validate-clean.txt") rf # The test environment is setup <- setup_env(rf, ve) ed # The model file name <- paste0(ed, "/def-model.RDS") mfn # The path to the cleaned validation file <- paste0(ed, "/validate-clean.txt") vfn # ModelEvaluator class object is created <- ModelEvaluator$new(mf = mfn, ve = ve) me # The intrinsic evaluation is performed on first 20 lines <- me$intrinsic_evaluation(lc = 20, fn = vfn) stats # The test environment is cleaned up clean_up(ve)
The following example performs extrinsic evaluation. It measures the accuracy score for each sentence in validation.txt file. For each sentence the model is used to predict the last word in the sentence given the previous words. If the last word was correctly predicted, then the prediction is considered to be accurate. The extrinsic evaluation returns the number of correct and incorrect predictions.
# The required files <- c("def-model.RDS", "validate-clean.txt") rf # The test environment is setup <- setup_env(rf, ve) ed # The model file name <- paste0(ed, "/def-model.RDS") mfn # The path to the cleaned validation file <- paste0(ed, "/validate-clean.txt") vfn # ModelEvaluator class object is created <- ModelEvaluator$new(mf = mfn, ve = ve) me # The intrinsic evaluation is performed on first 100 lines <- me$extrinsic_evaluation(lc = 100, fn = vfn) stats # The test environment is cleaned up clean_up(ve)
The following example shows how to predict the next word. It returns the 3 possible next words along with their probabilities.
# The required files <- c("def-model.RDS", "validate-clean.txt") rf # The test environment is setup <- setup_env(rf, ve) ed # The model file name <- paste0(ed, "/def-model.RDS") mfn # An object of class ModelPredictor is created. The mf parameter is the name of # the model file that was generated in the previous example. <- ModelPredictor$new(mf = mfn, ve = ve) mp # Given the words: "how are", the next word is predicted. The top 3 most likely # next words are returned along with their respective probabilities. <- mp$predict_word(words = "how are", 3) res # The test environment is cleaned up clean_up(ve)
The wordpredictor package includes a demo called “word-predictor.” The demo is a Shiny application that displays the ten most likely words for a given set of words. To access the demo, run the following command from the R shell:
demo("word-predictor", package = "wordpredictor", ask = F).