Luis Caballero Diaz's profile

Natural Language Processing Text Part 2

The project focuses on an natural language processing (NLP) exercise based on a text dataset as continuation of an existing project. The project is split in three parts, covering part 2 in this publication.

Part 1 focused on introducing basic NLP concepts:

Part 2 focused on advanced NLP techniques:

Part 3 focused on cross validation and grid search for model tuning:

As reference, the work is performed with Python and scikit learn, and the dataset to use is an open dataset from kaggle. 


The previous work in the project part 1 is related to create bag of words using CountVectorizer from scikit learn, understand some important parameters and perform a word text and length assessment for the full dataset and per each output class.

The dataset under work gathers around 120k questions for a general knowledge exam in India to have access to the premier institutes. The dataset only has two columns: one for the question itself and other for the classification. Therefore, the exercise is focused on natural language processing with the objective to create a model able to learn from interpreting text data and predict the subject using the exam question text information.

This part of the project will focus on:
1. Advanced tokenization
2. TF-IDF rescaling
3. Model coefficients analysis
4. Bigrams and trigrams

1. ADVANCED TOKENIZATION
As seen in the project part 1, the bag of words include all words detected in the text data and it does not question how similar the words are. Therefore, as depicted in the below question as example, "acid" and "acids" would be included in the bag of words.
However, it is questionable the added value of having "acid" and "acids" as independent features. It does not look that "acid" feature can be linked to a subject and "acids" to be meaningful for a different subject. Moreover, including both "acid" and "acids" leads to make a more complex model and so, the likelihood to overfit might increase. Correct this behavior is possible by combining all words with a common word stem. In the following, two advanced tokenization techniques are mentioned, followed by an example of one of them.

STEMMING --> it includes methods based on an heuristic rule. For example, dropping common suffixes as general rule. 

LEMMATIZATION --> it uses a human-verified system consisted on a dictionary including the known word forms. 

Let's make an example with LEMMATIZATION. To do that, the library NLTK is used to create a lemma class with the new lemmatization tokenization technique based on the dictionaries provided in the library NLTK. However, for comparison purposes, a regular expression equal to the one by default in CountVectorizer is defined. To use lemmatization technique, an instance of the above class (LemmaTokenizer) is created in the parameter "tokenizer" when calling the function CountVectorizer  --> tokenizer=lemma.LemmaTokenizer()​​​​​​​
Let's compare the result of applying lemmatization to the above acid question. Comparison is depicted below.
As can be observed, lemmatization combined both "acid" and "acids" in a single occurrence in the bag of word tagged as "acid".

2. TF-IDF RESCALING
After calculating the bag of words, instead of directly applying the machine learning algorithm, data might be rescaled applying some techniques to weight the importance of each word. One of the most common techniques to calculate word weight is term frequency–inverse document frequency (TF-IDF). The basis of this method is to provide high weight to words which has lots of occurrences in a particular sample, but they are not widely used in the rest of samples. It makes sense to think that a very common word in a question, but not commonly used in the rest of questions, might be very descriptive of the subject of that question. For example, it is expected to see several occurrences of the word "dna" in biology questions and very few (if any) in mathematics for example. In that case, "dna" would be a potential word with high TF-IDF score.

The TF-IDF score for word w in a sample d is calculated as follows:
N is the number of samples
Nw is the number of samples with the word w
tf is the number of times the word w appears in the sample d

The score is limited to a minimum value of 1 since worst scenario is N = Nw, and so log 1 = 0. From the equation, the higher tf and the lower Nw, the higher the TF-IDF score, which means to more robust and unique relation between the word and the sample under analysis.

Let's calculate the words in the dataset with higher and lower TF-IDF score using the below code. The function in sickit learn is TfidfVectorizer, which calculates the bag of words and then apply the TF-IDF scoring.
The output is a matrix with the same shape as the bag of words, so the output of TfidfVectorizer might be interpreted as a "bag of tfidf". This "bag of tfidf" has as many rows as samples and as many columns as words detected in the dataset.
Let's take a look to the words with highest and lowest TF-IDF score in the dataset. The words with lowest score are vague words, which has a clear meaning (everybody knows the word), but they do not describe much about a potential application of that word. For example, words as "question", "manner", "orders", "holding", "soon", "including" or "briefly" do not provide too much description of the subject. These words have low TF-IDF score because either they appear in lots of samples or they appear in a reduced group of samples but without being often repeated. To put some numbers, the TF-IDF score of a word which appears only in one sample but with once occurrence, and the TF-IDF score of a word which appears in 12k samples but 5 times in the sample under analysis, would have the same score. 

Instead, the words with higher TF-IDF score would look very specific to the application. Most of them would look mathematical variables inside some equation as "b_", "t_" or "x_". It might be a question with four options, having an equation in each option and mixing the variables. Also LATEX operators as "frac" or "mathrm". There are also some, apparently, biology words as "chromosome", "muscle" or "pulmonary" and some, apparently, chemistry words as "nitrophenol" or "dinitrobenzene".
3. MODEL COEFFICIENTS ANALYSIS
Let's start to create a model for the application to get more information about the dataset. Linear model algorithm is used, which tends to offer a very good performance under high dimensional sparse data. For the particular case of coefficients analysis, the LOGISTIC REGRESSION algorithm has been selected, but the final model will be selected by applying a complete grid search cross validation among several model candidates. 

First step is to split the input dataset into training and testing sets. A 80-20% split between training and testing sets is applied, respectively, selecting the options to shuffle the input dataset prior to split and stratify based on the output class. Thus, all input dataset, training and testing sets have the same output class distribution. As can be observed in the pie plot per each set, the percentage of questions per each subject is exactly the same, just changing the number of questions in each set.
Both CountVectorizer and TfidfVectorizer methods are run and then applying LogisticRegression algorithm with C parameter = 1. After modeling, the coefficients per each feature in each subject are assessed, plotting the most and least meaningful coefficients in below plots. As the plots for both CountVectorizer and TfidfVectorizar and all subjects might be too much information, a summary of the three most and least meaningful features can be found next. 

MATHEMATICS
COUNTVECTORIZER positive coeffs --> solve, polynomial, evaluate
COUNTVECTORIZER negative coeffs --> reaction, significant, potential
TFIDFVECTORIZER positive coeffs --> solve, polynomial, evaluate
TFIDFVECTORIZER negative coeffs --> reaction, significant, potential

BIOLOGY
COUNTVECTORIZER positive coeffs --> tissue, ecosystem, cockroach
COUNTVECTORIZER negative coeffs --> l_, angle, current
TFIDFVECTORIZER positive coeffs --> tissue, ecosystem, cockroach
TFIDFVECTORIZER negative coeffs --> l_, current, angle

CHEMISTRY
COUNTVECTORIZER positive coeffs --> iupac, adsortion, orbitals
COUNTVECTORIZER negative coeffs --> band, polynomial, matrix
TFIDFVECTORIZER positive coeffs --> adsortion, iupac, solubility
TFIDFVECTORIZER negative coeffs --> sin, polynomial, tissue

PHYSICS
COUNTVECTORIZER positive coeffs --> friction, capacitance, gate
COUNTVECTORIZER negative coeffs --> root, probability, complex
TFIDFVECTORIZER positive coeffs --> friction, shm, gate
TFIDFVECTORIZER negative coeffs --> root, complex, probability

Main conclusions after assessing the results:

1. Both TfidVectorizer and CountVectorizer are quite aligned in terms of providing higher coefficients to the same words, at least the most and least meaningful ones.

2. The most meaningful coefficients are linked to very descriptive words of the subject under analysis. For example, "friction" in physics, "iupac" (it is an international chemistry federation) and "adsorption" (it is not a typo, it is a chemistry process) in chemistry, "tissue" in biology and "solve" in mathematics.

3. In the same way, the least meaningful coefficients are linked to very descriptive words to any other subject different than the one under analysis. For example, "root" in physics, "polynomial" and "sin" in chemistry, "angle" in biology and "reaction" in mathematics.

4. The coefficients are much higher using CountVectorizer than using TfidfVectorizer, which leads to a potential overfit with CountVectorizer and in turn, a potential lower performance when generalization to new data. The overfit might be caused since both functions need to tune a different regularization parameter (parameter C for LogisticRegression), and in the example they are sharing the same value.
MATHEMATICS COUNTVECTORIZER ​​​​​​​
MATHEMATICS TFIDFVECTORIZER 
BIOLOGY COUNTVECTORIZER 
BIOLOGY TFIDFVECTORIZER 
CHEMISTRY COUNTVECTORIZER 
CHEMISTRY TFIDFVECTORIZER 
PHYSICS COUNTVECTORIZER 
PHYSICS TFIDFVECTORIZER 
4. BIGRAMS AND TRIGRAMS
For the moment, during tokenization only single words are considered as tokens, but it limits potential meaningful tokens as for example "not good". Assessing the words "not" and "good" in a full independent approach might lead to wrong assumptions. "Good" would make us think in positive and "not" would make us think in negative, so that might be misunderstood due to the lack of context. However, "not good" leads to a clear negative thought with no room for different interpretations. 

"not good" would be an example of a bigram, a group of two tokens, but this concept can be generalized to n-gram. However, adding n-grams comes in exchange of a higher memory usage and computational time. For example, in the current exercise around 40k features were detected in the bag of words when working with unigrams, but that number increases to 600k when including bigrams and to 2 millions when including both bigrams and trigrams. Moreover, including high n-grams is progressively less optimal since the computational time sharply increases, but the benefit tends to flat. 

Using bigrams and trigrams is straightforward with scikit learn. Both functions CountVectorizer and TfidfVectorizer include the parameter ngram_range, which is a two position tuple specifying the range to n-grams to assess. For example, (1, 1) would mean to asses only unigrams and (2, 3) would mean assess both bigrams and trigrams, but not unigrams.  

The model coefficients were reassessed by considering unigrams, bigrams and trigrams, keeping C parameter for LogisticRegression equal to 1 and using only TfidfVectorizer this time. Results are depicted next.

As general comment, the coefficients has been reduced since now there are much more features. The most and least meaningful features are majorly unigrams, but there are also some bigrams and trigrams in the list. 

Let's focus on biology to make a deeper assessment. In biology, there are some of n-grams in the least meaningful coefficients, specially n-grams focused on LATEX functions. Therefore, it makes us think they are n-grams related to equations, which would make sense since equations are more often expected in other subjects than biology. On the other side, there are not n-grams in the most meaningful coefficients, which means the biology unigrams are significant enough on its own. For example, words as "dna", "blood", "tissue", "biomagnification" or "phytoplankton" looks very descriptive to biology on its own.
BIOLOGY UNIGRAMS, BIGRAMS AND TRIGRAMS
CHEMISTRY UNIGRAMS, BIGRAMS AND TRIGRAMS
MATHEMATICS UNIGRAMS, BIGRAMS AND TRIGRAMS
PHYSICS UNIGRAMS, BIGRAMS AND TRIGRAMS
Next step after would be to apply a robust validation with cross validation and grid search against different parametrization and technique combinations to find an optimal model tuning, which will be covered in a different publication (see introduction).
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Natural Language Processing Text Part 2
Published:

Natural Language Processing Text Part 2

Published:

Creative Fields