Luis Caballero Diaz's profile

Natural Language Processing Text Part 1

The project focuses on an natural language processing (NLP) exercise based on a text dataset. The project is split in three parts, covering part 1 in this publication.

Part 1 focused on introducing basic NLP concepts:

Part 2 focused on advanced NLP techniques:

Part 3 focused on cross validation and grid search for model tuning:

As reference, the work is done with Python and scikit learn, and the dataset to use is an open dataset from kaggle. The code and dataset can be found in the below links.


The dataset gathers around 120k questions for a general knowledge exam in India to have access to the premier institutes. The dataset only has two columns: one for the question itself and other for the classification. Therefore, the exercise is focused on natural language processing with the objective to create a model able to learn from interpreting text data and predict the subject using the exam question text information. Some examples of questions of the dataset are below.
First step is to assess the dataset by checking the shape and the subject class distribution. The classes in this dataset are BIOLOGY, CHEMISTRY, MATHS and PHYSICS. The subject distribution among chemistry, mathematics and physics is pretty uniform, but there is a lower number of questions for biology. However, overall, the class distribution is acceptable to select accuracy as scoring method for this exercise and not consider it as an imbalanced dataset.
When working with text data, a common approach is to represent the data with bag of words. This process implies three steps:

1. Tokenization: split the text data into words (called tokens)
2. Vocabulary building: collect a vocabulary of all words/ tokens in the text data and number them
3. Encoding: count how often each identified word in the vocabulary appear in the data

Bag of data can be performed in scikit learn using the function CountVectorizer. Note CountVectorizer uses as regular expression of a word "\b\w\w+\b", meaning it will not catch single digit or single letter words or expressions. It will also not catch contractions as single word as for example "can't" or extensions as "file.doc". Below code executes bag of words using CountVectorizer and check the shape of the created vocabulary. Note data['eng'] refers to the panda series for the exam questions.
The bag of words has as many rows as dataset samples (in this case 122519 questions) and as many columns as words inside the bag of words. In this case, 42450 words were identified. The occurrence of each word can be calculated by summing each column in the bag of words, and the name of each word is reached by using the function get_feature_names_out(), which provides them alphabetically ordered. For reference, the first 50 features, the last 50 features and 50 features in the middle of the vector are printed. The output of the above code is as follows.
The initial features correspond to numbers and the last features to strange tokens. It should not have lots of occurrences of these features, but bag of words counts for them if a single occurrence is detected. However, if needed, CountVectorizer provides the parameter min_df with default value of 1, which defines the minimum of samples in which a word must appear to be included in the bag of words. Thus, if increasing min_df, words which are related to a single text data and they do not look very meaningful for the whole dataset would be removed. As reference, the code is run again but defining min_df = 5 generating the below output.
The bag of words shape has been clearly reduced to 13179 with min_df = 1 (from 42450 with min_df = 1), and also it reduced the number of words with strange and numbers tokens. When tuning the model into the grid search the effect of this parameter will be assessed.

Let's assess now the most common words in the dataset in the below plot.
The total words in the bag of words is 42450, as previously mentioned. First position with almost 250k occurrences (meaning twice per question) is the word "the". There are also with lots of occurrences words such as "of", "is", "and", "in", "to", "if", "for", "with" and "by". That is expected since reading any English wording, you will probably end up with a significant number of occurrences of those words. However, they do not look to provide any further value to the subject classification. For example, a major use of "the" does not look to be indicative of having in front a mathematics problem. Thus, scikit learn already provides a way to remove very common words in English for sentences building but lacking of significant meaning in the sentence to provide learning power to a machine learning model. These words are known as stop words and they are removed them if defining the parameter stop_words to "english" in the function CountVectorizer. The new most common words in the bag of words is as follows.
The total number of words has been reduced to 42165, meaning a little less than 300 words were removed. The words as "the", "is" or "and" are not longer in the list as expected. By assessing the most common words, the next three clusters might be created.

1. LATEX functions: The most common words now are "boldsymbol", "frac", "cdot", "mathrm", "mathbf", "right" and "left" which correspond to Latex functions for text formatting. A major use of these functions is expected when creating equations and so, it makes sense to expect these words more often in mathematics, physics or chemistry questions than biology.

2. NON SIGNIFCANT WORDS: Words as "correct", "assertion", "respectively", "following", "options", "given" or "statement". All of these words are not common in normal spoken English, but they are often when creating exam questions. For example, "given X, what would be Y?" or "Which of the following assertions/options are correct?". They might be removed when optimizing the model scoring since they are very generic with not apparent learning power. 

3. SIGNFICANT WORDS: Words which are meaningful for a particular subject as mathematical operators as "cos" or "sin", mathematical features as "radius" or "area", physics features as "velocity", "mass" or "energy" or chemistry nouns as "acid". There are words which are significant but they might be applied to several subjects, for example, "solution" might be the solution of a problem or a chemistry solution. 

To have a more robust understanding of the dataset, let's look the most common words per subject instead of focusing on the whole dataset. The next plot depicts the most common 50 words per subject.
Effectively, the latex functions are a meaningful indicative that that question under assessment is not focused on biology, but on the other side, it would be hard to confirm which of the others subject would be the correct. Surprisingly, "radius" was considered more related to mathematics, but it appears as a common word in physics and not in mathematics (at least in the first 50 most common words). In the same way, "acid" was thought related to chemistry, but it is also a common word in biology. However, this wording assessment exercise does not try to categorize the subjects by the words according to our thinking, it is just a data visualization exercise to get more familiar to the dataset, and the most interesting and surprising points are highlighted.

Looking at glance the most common words per subjects and ignoring the latex functions and the non significant words, it would look there is a considerable group of words with high learning power such as below.

BIOLOGY --> cell, plant, dna, body, species, heart, human...
CHEMISTRY --> reaction, solution, hydrogen, compound, gas...
MATHS --> sin, cos, angle, tan, theta, triangle, area, line, function, sum...
PHYSICS --> mass, velocity, energy, distance, speed, particle, potential...

However, that is only focused on the first 50 most common words, so let's check the complete word interdependency among classes in the next plot.
For example, physics has 16.5k words which shares 5k with mathematics, 7.5k with chemistry and 5.5k with biology. The interdependency factor with the rest of subjects is similar. It is expected to have a significant degree of interdependency since English has lots of non-subject related words, but even with that, there is a meaningful part of exclusive words per subject which would help to create a model with high learning power. The number of words present in all subjects was calculated, reaching 2767 words (around 6.5% of the total words). And it is also calculated the words exclusive to each subject, which change depending on the subject because the word per subject is different, but on average around 40-50% of the words per each subject class are exclusive to the subject. It makes us think that a good high accuracy score model might be created with the current dataset. 
Now let's asses the word length of the questions to check if there is some new learning hidden there. For example, it might be the case that the questions for one subject are longer or shorter than the rest, and adding the length as an additional feature into our model might be efficient.

The length of each sample can be calculated as the sum of each row in the bag of words. Remember the rows are the samples in the dataset and the columns the words. Below histograms depict the length word distribution in the dataset under analysis, and also per subject.
Plots show that mostly questions are less than 50 words, with no significant difference among subjects. However, there is an outlier question with more than 500 words which does not allow to inspect the data into detail. That outlier is suspicious since the second question with more words is around 200, so let's print that question to check if there is something wrong there.
Effectively, there is something which looks wrong. The question is correct but then there are hundreds of occurrences of bar{b}, which looks to be an error when generating the dataset. The input dataset is manually corrected by removing all repeated bar{b} instances in that particular sample, and the new length word histograms are depicted next.
Now the histograms look much better enabling a better overview of the length word distribution in the dataset and in the subjects. In the previous histogram, due to the wrong outlier, there was not accuracy resolution to interpret the dense area of the data. Looking into more detail to the new graphs, around a third part of samples are less than 10 words and other third part between 10 and 20 words. That trend looks to be followed for all subjects with no significant distribution change among subjects. 

One last thing is to confirm length is being properly calculated, since 10 words looks very few number to have so many questions. To do that, a reduced number of 4 samples with length lower than 10 words is considered, generating the next bag of words to be assessed manually.
Now let's make a manual counting of the words for these 4 samples. The first sample has 9 words, second sample 3, third sample 5 and fourth sample 9. The words counted by CountVectorizer are marked in red in the below picture. And effectively, all marked red words are in the previous bag of words. However, some expressions such as single digit equation variables or single word chemistry substances abbreviatures are excluded due to the default regular expression of CountVectorizer for ignoring single letter words ("\b\w\w+\b"). Moreover, there are some multi letters words which are not considered such as "to", "the", "give", "two", "what" or "name", and they are within the default regular expression of CountVectorizer. The reason is that simulation run with stop_words = 'english', meaning all words which are extensively common in English language but not meaningful to provide learning power to the model are removed. And all of these mentioned words are included there.
To confirm it, ENGLISH_STOP_WORDS from scikit learn is imported to check that all multi letter previously excluded words in the bag of words are effectively there.
Just as curiosity, running again the simulation with stop_words = None moves the histogram to the right extending the number of words of all questions, since now the common English words such as "the" or "is" are considered. With this histogram, the number of questions with less than 10 words is 11k, when using stop_words = 'english' it was 32k. 
Next steps would be to apply advanced natural language processing (NLP) techniques to optimally scrub text data prior modeling, and then applying a robust validation with cross validation and grid search against different parametrization combinations to find an optimal model tuning, which will be covered in a different publication (see introduction).
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Natural Language Processing Text Part 1
Published:

Natural Language Processing Text Part 1

Published:

Creative Fields