Tweet Classification NLP using Linear Models

This project focuses on a supervised binary classification for text mining application with a real published dataset for the shared-task ProfNER in 2021. Specifically, this project uses the data corresponding to the task number 1, which consist of Spanish tweets with a label assigned as 1 if the tweet mentions the presence of a professional job, and 0 otherwise. In case of interest, the explanation for the corpus creation is detailed here.

As reference, the work is done with Python and NLTK, SKLEARN and GENSIM frameworks. The code and dataset can be found below.

Code --> https://github.com/luiscaballerodiaz/Tweets_Classification_NLP

Dataset --> https://zenodo.org/record/4563995/files/profner.zip?download=1

The procedure applied in this project consists of the next steps.

STEP 1: DATA ANALYSIS

STEP 2: DATA PREPROCESSING

STEP 3: EMOJIS

STEP 4: VECTORIZATION

STEP 5: MODELING

STEP 6: CONCLUSIONS

STEP 1: DATA ANALYSIS

First step to perform data analysis is to have a dataframe. The exercise does not provide directly a postprocessed metadata and provides instead each tweet in an independent txt file and split in one folder for training and other for testing. The code in github provides a function to extract all tweets and generate a single dataframe with a feature "set" to indicate if the tweet corresponds to either training or testing. As reference, next picture depicts the head of the dataframe once generated.

The dataframe has 8000 tweets, 6000 for training and 2000 for testing and there is no null or duplicate tweets.

The dataset is unbalanced since there are 6130 tweets with class 0 and 1870 with class 1. It might jeopardize the learning capability of the model, but at least it is checked that both training and testing sets are generated following a stratified approach.

The tweet length is calculated by using next code to analyze if it is significant in the class distribution.

Below function is created to perform a class histogram for the tweet length, and the results are depicted in the following. It looks the tweet length has a significant correlation with the target class, being the longer tweets more likely to have class 1. Note it is important to focus on normalized histogram, specially important for unbalanced exercises.

As curiosity, some tweets have been detected not to generate any valid token and some examples are depicted below. For example, the tweet "ESTO, ESTO, ESTO" does not provide any added value since "ESTO" will be probably a stop word.

STEP 2: DATA PREPROCESSING

Once the data has been analyzed, next step is to tokenize and transform each tweet text to a chain of tokens. That is a very important task and a modular function for that purpose has been created and depicted below. The function performs the next steps:

1. To avoid potential confusion, all written accents are removed, and also the tilde from Ñ.

2. Remove websites.

3. Split hashtag into different words using space as separator.

4. Transform all text to lowercase.

5. Remove consecutive characters if 4 or more occurrences are detected in a row, and leaving only the first occurrence.

6. Tokenization using the NLTK framework.

7. Remove additional spaces in the start and end of each token.

8. Remove all punctuation marks in the tokens.

9. Remove tokens consisting exclusively on digits.

10. Remove stop words.

11. Stemming each token to remove the suffix and avoid having identical multiple tokens. The algorithm to stem is Snowball Stemmer using the implementation in NLTK framework.

12. Remove tokens with a number of chars lower than mintoken.

Note all steps are not applied by default and it needs the corresponding flag to be set. The only step to be performed automatically is step 1 and step 12, but step 12 defines by default a mintoken value equal to 1 and so it is not really making changes if not explicitly requested.

Below code performs the preprocessing and tokenization process, and then creates a new feature with the number of tokens in each tweet. With that information, a new class histogram is generated reinforcing the previous conclusions that longer tweets are more likely to be class 1.

One other token assessment is to create a word cloud with wordcloud framework using the below code. It generates a cloud of words assigning higher size in the most common tokens among all tweets. Note the words in the cloud are stemmed and so they slightly differ the normal format. However, given the most common words are "coronavirus", "virus", "gobierno", "mascarilla", "pandemia", "contagio", "confinamiento"..., it is possible to identify that the tweets are focused on coronavirus

STEP 3: EMOJIS

Nowadays emojis are a very important approach to share information, so it is key to consider them when assessing text, specially social network text as tweets. For that purpose, an emoji sentiment dictionary is downloaded from here. The dictionary consists of different emojis with a positive, negative and neutral rating as depicted below after some postprocessing.

Note emojis are more focused on sharing sentiments, and the purpose of this exercise is to identify professional job mentions, so it might be not related to emoji content. However, emojis are considered in the procedure in a generic approach to be used as a reference in any text mining exercise.

The procedure to manage emojis is as follows:

1. Extract them for each tweet by using the emoji_extractor framework.

2. The emojis were considered as tokens, but they are remove from token list since they will receive an independent processing.

3. Calculate the cumulated positive, negative and neutral level of all emojis in each tweet and create three new features with that information.

Main code is depicted below, but detailed code is updated in the corresponding github repository.

Note there are other techniques to manage emojis as applying de-emoji, which means to translate the emoji icon into words describing the emoji content, and then processing the text as additional text in the document. As reference, the unicode name entry from above dictionary might be used as text describing the emoji content.

STEP 4: VECTORIZATION

Once text has been efficiently tokenized and emojis has been managed, next step is to transform text to numbers with the vectorization process since models require numbers as input. In this publication three vectorization methods are considered:

1. BAG OF WORDS

2. TF IDF

3. EMBEDDING

To avoid excessive computational cost, a 100 positions embedding is used. The selected embedding is downloaded from GENSIM framework and it is called glove-twitter-100. The vocabulary size will be also 100 for bag of words and TF IDF to fairly compare all three methods.

BAG OF WORDS and TF IDF

The bag of words and TF IDF is generated by using SKLEARN framework with below code.

Once the bag of words is available, the below function has been created to plot the most common words for both class 0 and 1.

The code to create the plot using the defined function is captured next.

And finally the barplot is depicted below. Most common words are related to coronavirus, but it is possible to identify certain trend in the most common words for class 1 as "sanitario", "medico", "policia", "presidente" o "ministro" with reference to professional jobs.

EMBEDDING

As mentioned, the selected embedding is downloaded from GENSIM framework and it is called glove-twitter-100, loaded with the below code.

The tweets are again processed selecting this time False in the stem flag. The reason not to stem is because the tokens will be converted into vectors by using the embedding and it is important to have the tokens in the embedding vocabulary. In case of having a token not included in the embedding vocabulary, it will be ignored. Note stemming will remove the suffix and the token will not be in standard format and the risk not to be in the embedding vocabulary is higher.

After that, each token in the tweets is transformed by using embedding. The outcome will be a vector of 100 positions per each token and then, it averages each position of the vector among the tokens for the same tweet.

Last step prior to modeling would be to include the new features as tweet length, number of token in tweet or emoji sentiment in the vectorization matrix of each method, as depicted in below code.

STEP 5: MODELING

First analysis for modeling is to compare the three vectorization techniques assessed in the project (bag of words, TF IDF and embedding). To do that a fixed estimator will be used for the three techniques and the results will be compared. The estimator is a standardized Logistic Regression, which normally leads to good results in text mining applications. The code for model evaluation is as follows.

The results are as follows:

BAG OF WORDS

CROSS VALIDATION SCORE 0.7925

TEST SCORE 0.7805

TFIDF

CROSS VALIDATION SCORE 0.7929

TEST SCORE 0.7805

EMBEDDING

CROSS VALIDATION SCORE 0.8215

TEST SCORE 0.8135

Therefore, the best technique at same vector positions is embedding leading to a clear higher score. Note there is no big difference between the performance with bag of words and TF IDF.

Once best technique is identified, a wide grid search depicted below with multiple models and sets of hyperparameters is performed to find the best estimator for the current exercise.

The best results are slightly better than the previous Logistic Regression modeling, concluding that STANDARD SCALER + LINEAR SVC (C=0.5, dual=False, penalty='l1) is an optimal estimator for the current exercise reaching below scores.

CROSS VALIDATION SCORE 0.823

TEST SCORE 0.817

However, the modeling has focused on vectors of 100 positions. Extending the vector positions is in general a good practice to increase model performance, but at the same time, it increases computational cost. Therefore, a very good option would be to repeat the simulation with a larger embedding.

As reference, a simulation with 2000 positions as bag of words vocabulary is performed with an increase in performance.

BAG OF WORDS 2000 FEATURES

CROSS VALIDATION SCORE 0.8352

TEST SCORE 0.8465

With the 2000 positions vocabulary the most and least significant feature coefficients can be calculated with below function.

Bar plot from above function is depicted below. The results makes sense. The most significant words, which lead to class 1 are mainly professional jobs (words as "presidente", "ministro", "policia", "sanitario", "medico", "alcalde", "director", "diputado", "rey", "portavoz"...) , meaning the model has been able to capture the identity "professional job". It is worth it to highlight the feature tweet_length has been identified as highly significant too, as anticipated earlier. Instead, the least significant words are generic words without a clear classification.

As last point, a simulation with 4000 positions as bag of words vocabulary is performed with a decrease in performance. It demonstrates that increasing vocabulary is generally a good practice but until certain limit. There is a point in which the benefit is so reduced (or even negligible) that the extra computational cost is not worth it. In this particular case, increasing from 2000 to 4000 positions led to a decrease in performance, but the computational cost abruptly increased.

BAG OF WORDS 4000 FEATURES

CROSS VALIDATION SCORE 0.8285

TEST SCORE 0.836

STEP 6: CONCLUSIONS

- Three vectorization approaches have been analyzed, leading EMBEDDING to higher scores.

- The other two methods (bag of words and TF IDF) led to very similar scores, but clearly lower than embedding.

- The number of positions in the vocabulary plays a critical role in the model performance and computational cost.

- It is generally a good practice to increase the vocabulary, but in exchange of also increasing the computational cost.

- The benefit in performance from increasing vocabulary is progressively lower, until being negative.

- The downside in computational cost from increasing vocabulary is progressively higher.

- For the current exercise, a short embedding of 100 positions has been used only for reference purposes. However, the proposal as trade off between performance and execution time would be to use a longer embedding of 200 or 300 positions. It is expected to increase performance compared to the 100 position embedding score and even higher than the bag of words case with large vocabulary.

I appreciate your attention and I hope you find this work interesting.

Luis Caballero

Tweet Classification NLP using Linear Models

Owner

Tweet Classification NLP using Linear Models

Creative Fields