Luis Caballero Diaz's profile

Natural Language Processing Text Part 3

The project focuses on an natural language processing (NLP) exercise based on a text dataset as continuation of an existing project. The project is split in three parts, covering part 3 in this publication.

Part 1 focused on introducing basic NLP concepts:

Part 2 focused on advanced NLP techniques:

Part 3 focused on cross validation and grid search for model tuning:

As reference, the work is performed with Python and scikit learn, and the dataset to use is an open dataset from kaggle. The code and dataset can be found in the below links.


The previous work in the project part 1 is related to create bag of words using CountVectorizer from scikit learn, understand some important parameters and perform a word text and length assessment for the full dataset and per each output class. In part 2, advanced tokenization and rescaling techniques were assessed, introducing the concept of n-grams and assess the model coefficients under different conditions.

The dataset under work gathers around 120k questions for a general knowledge exam in India to have access to the premier institutes. The dataset only has two columns: one for the question itself and other for the classification. Therefore, the exercise is focused on natural language processing with the objective to create a model able to learn from interpreting text data and predict the subject using the exam question text information.

In this part of the project the focus will be on finding an optimal model with the corresponding parametrization. For that purpose, a pipeline with two steps will be created: preprocessing and classification, using the below code.
As pipeline requires the real estimator and forcing the users to write the exact spelling of the estimator can lead to errors, so a middle man approach is used with the functions create_preprocess and create_model to convert the user input to the real estimator to be loaded in the pipeline. Therefore, if user introduces 'count', 'Count', 'COUNT' or 'counttt', the code will share CountVectorizer() to pipeline . 
For PREPROCESS, the two main text data representation methods discussed previously are cosnidered:
- CountVectorizer to calculate the bag of words, meaning the occurrence of each word in each sample. It simply calculates number of occurrences of each word without applying advanced scoring methods.
- TfidfVectorizer to calculate the TFIDF score for each word in the bag of words. The TFIDF score is a measure which increases with the occurrences of a word in a particular sample but decreases if this word is used in many samples, so it measures how meaningful a word is to describe a particular sample. 

For CLASSIFIER, linear models are considered since they tend to offer a very robust performance in high dimensional sparse datasets as the current one. Therefore, both LinearSVC and LogisticRegression will be used. Moreover, NaiveBayes methods will also be used, specifically the MultinomialNB algorithm, since GaussianNB is more focused to continuous data and not sparse exercises as text data and BernoulliNB, in spite of being a good option for text data, is more focused on binary features. 

Regarding parametrization, next model parameters in our tuning sweep will be considered:

STOPWORDS
Two approaches will be assessed:
- None not to exclude any word for the representation process.
- English to exclude a group of 318 words which scikit learn has identified as very common and non-descriptive. For example, below most of these words are depicted.
MAX_DF
This parameter excludes some words from the bag of words if the word is widely used in the dataset. The parameter is controlled in a float approach, meaning 1.0 includes all words, 0.0 excludes all words and 0.5 excludes the words which appear in more than 50% of the samples. Thus, a sweep among these values is performed to assess the model score with different numbers of max_df. 

MIN_DF
Similarly as MAX_DF, the parameter MIN_DF will be used, but for the opposite purpose. Therefore, if a word is only used in a limited number of samples, they will be excluded from the bag of words to be considered not very descriptive. 

NGRAM_RANGE
This parameter corresponds to the n-gram level to consider for the model. By default, model only considers single word words, excluding bigrams (two words word), trigrams (three words word) or generally speaking, n-grams with n > 1. As demonstrated previously, n-grams can capture some information which is lost when assessing each word separately. However, in exchange it comes with very high memory and computational workload since the features are exponentially increased.

MODEL REGULARIZATION
Linear models include the parameter C and MultinomialNB includes the parameter alpha to control the model regularization. Both C and alpha work in the opposite way.
- Higher C values lead to a less demanding regularization, so coefficients are higher creating a complex model with higher likelihood to overfit
- Higher alpha values lead to higher model smoothing, so more restricted and simpler model, and reducing likelihood to overfit

Therefore, in summary the grid search in the pipeline will run for the below sweeping conditions:
- 3 classifier methods
- 2 preprocess methods
- 5 parameters
          - Stopwords: binary sweep between None and English
          - Max_df: around 10 combinations
          - Min_df: around 10 combinations
          - Ngram_range:  unigrams, bigrams, trigrams and tetragrams
          - Model regularization: around 10 combinations

That is a big search of thousands of simulations to be executed all at once since the memory and computational time will be out of expectations. Moreover, even running all combinations, the complete range will not be covered for each parameter with significant accuracy. For example, the regularization parameter might be as low as 0.0001 or as high as 1000, so a proper grid search would sweep the values of 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000. That would look a reasonable split for the range under analysis, but if the best scoring is 10, it will be required to know if for example 5 or 50 can provide even better results and a new grid search would be required.

Instead, a better approach is to make a single parameter grid search to anticipate the potential area with higher scores for that parameter, and then define the complete grid search accordingly introducing higher accuracy in the most potential area of most optimal tuning. For example, as testing case to anticipate the best parametrization, the below settings are considered:

Classifier --> Logistic Regression
Preprocess --> CountVectorizer
Max_df --> 1.0
Min_df --> 1
Ngram_range --> (1, 1) (only unigrams)
Stopwords --> English
Regularization parameter --> 1

Each parameter will be swept independently keeping the rest of parameters fixed, and according to the results, a decision for each parameter sweep range in the complete grid search will be taken.
For max_df higher than 0.4, score does not change. Therefore, it looks there are not words applied to more than 40% of the samples. At lower values, the score slightly reduces. However, simulation runs with stopwords=English and it excludes most common words, which is similar to max_df purposes, meaning it makes sense to reassess max_df performance but defining this time stopwords=None, as depicted in below plot.
As expected, max_df has some impact in scoring when all words are unfiltered. Therefore, in order to capture the benefit of reducing max_df under stopwords=None, the decision is to sweep MAX_DF = 0.25, 0.4 and 1.0 for the complete grid search.

MIN_DF PARAMETER SWEEP
The trend is very clear: the lower min_df, the higher the score, but with a great effect in sensitivity. Note min_df filters very uncommon words, so there is no need to reassess defining stopwords=None. The decision here is to use a MIN_DF = 1 for the complete grid search.

NGRAM_RANGE PARAMETER SWEEP
Using bigrams has a positive effect in scoring, but adding further n-grams as trigrams or tetragrams does not provide additional value, even the score is slightly reduced. Therefore, the decision is clear to define bigrams in the complete grid search.

For the regularization parameter assessment, a simulation of each model sweeping its regularization parameter is required. 

LOGISTIC REGRESSION REGULARIZATION PARAMETER
The optimal value is around 1, so the decision will be to sweep this parameter with the next values [0.5, 0.75, 1, 1.25, 1.5, 2, 3, 4, 5] to be sure the best performance is captured.

LINEARSVC REGULARIZATION PARAMETER
The optimal value is around 0.1, so the decision will be to sweep this parameter with the next values [0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75] to be sure the best performance is captured.

MULTINOMIALNB REGULARIZATION PARAMETER
The optimal value is around 0.0001 and 0.001, so the decision will be to sweep this parameter with the next values [0.00005, 0.0001, 0.0005, 0.001, 0.0015, 0.0025] to be sure the best performance is captured.

Therefore, the complete grid search to run is defined in the below picture. Note scoring method is selected to accuracy since it is not an imbalanced problem. The number of cross validation splits is defined to 5. 
The number of simulations in the complete grid search is calculated as follows:

Logistic regression simulations --> 108
Linear SVC simulations --> 84
Multinomial NB simulations --> 72
TOTAL simulations --> 264

To have a reference about simulation time, the small sweep searches to tune the potential area of high scores per each parameter took around 5-10 minutes each and the complete grid search took around 5-6 hours per model. Therefore, the absence of the preliminary step to anticipate the most optimal ranges per each parameter and instead, making a unique and very long search might take several weeks of computational time for a similar granularity. The results for the best cross validation test score per each analyzed model are depicted below. 

LOGISTIC REGRESSION OPTIMAL MODEL

CROSS VALIDATION TEST SCORE: 0.9359
Preprocess: CountVectorizer
Stopwords: None
C parameter: 2
Max_df: 0.25
Min_df: 1
Ngram_range: (1,2)
LINEARSVC OPTIMAL MODEL

CROSS VALIDATION TEST SCORE: 0.9388
Preprocess: CountVectorizer
Stopwords: None
C parameter: 0.05
Max_df: 0.25
Min_df: 1
Ngram_range: (1,2)
MULTINOMIALNB OPTIMAL MODEL

CROSS VALIDATION TEST SCORE: 0.9353
Preprocess: TfidfVectorizer
Stopwords: None
Alpha parameter: 0.0025
Max_df: 0.25
Min_df: 1
Ngram_range: (1,2)
Looking to the training set results, the most optimal model corresponds to the Linear SVC algorithm with preprocess CountVectorizer, stopwords=None, C=0.05, max_df=0.25, min_df=1 and ngram_range=(1, 2)

However, prior to bless the model, it needs to be challenged against new data. Therefore, let's use the testing set split in the initial part of the project to check how the three optimal models classify new data. 

LOGISTIC REGRESSION CROSS VALIDATION TEST SCORE: 0.9359
LOGISTIC REGRESSION TESTING SET SCORE: 0.9348

LINEARSVC CROSS VALIDATION TEST SCORE: 0.9388
LINEARSVC TESTING SET SCORE: 0.9377

MULTINOMIAL CROSS VALIDATION TEST SCORE: 0.9353
MULTINOMIAL TESTING SET SCORE: 0.9336

Testing set score is slightly reduced compared to cross validation test score, but not reduced enough to consider a very complex model with high degree of overfit. In both cross validation test score and testing set score, the model with higher values correspond to Linear SVC and it would be the proposed option to model the dataset under analysis. 
Prior to close the project, two more analysis were performed.

Firstly, the most optimal model was run to assess the coefficients per each feature, plotting the most and least meaningful ones per each subject below. However, for the sake of simplicity the three most and least meaningful words are summarized as follows. On a hand, the most meaningful words are very descriptive for the subject. Most of them are unigrams, but there are also a couple of bigrams. In chemistry both orbitals and orbital are in the list because lemmatization is not used in the tokenization process (otherwise they would be combined). On the other hand, the least meaningful words are also very descriptive but to a different subject. For example, words as "solve" and "simplify, which are very meaningful for mathematics, are the least meaningful words for the rest of subjects. 

BIOLOGY most meaningful words: ecosystem, tissue, reproduction
BIOLOGY least meaningful words: solve, simplify, find

CHEMISTRY most meaningful words: orbitals, adsorption, orbital
CHEMISTRY least meaningful words: solve, simplify, sublimation

MATHEMATICS most meaningful words: solve, simplify, evaluate
MATHEMATICS least meaningful words: reaction, significant, differentiate between

PHYSICS most meaningful words: sublimation, friction, circuit
PHYSICS least meaningful words: its latent, unit vector, solve
Secondly, a DECISION TREE algorithm was run under some different parametrization to check the algorithm accuracy for the current exercise. The results are as follows:
The effect of stopwords=English and using max_df=0.25 are beneficial in scoring. Instead, TDIF or Count representation does not have a meaningful impact. However, the most sensitive parameter is max_depth. The model gets bad score for reduced number in max_depth since features are high leading to a too simple model, but score increases with max_depth. However, that score increase comes in exchange to increase the memory and computational time. For example, a simulation for 5 and 100 max_depth increases the computational time x10. At any case, even with max_depth=500 the scores for decision tree are underperforming the three previously analyzed algorithms and it does not look to have much room for improvement. That is expected since the more efficient performance for decision tree algorithm is when dataset has a mix of different features and values. 
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Natural Language Processing Text Part 3
Published:

Natural Language Processing Text Part 3

Published:

Creative Fields