Luis Caballero Diaz's profile

Pretrained Language Model Text Classification

This project focuses on a supervised binary classification for text mining application with a real published dataset for the shared-task ProfNER in 2021. Specifically, this project uses the data corresponding to the task number 1, which consist of Spanish tweets with a label assigned as 1 if the tweet mentions the presence of a professional job, and 0 otherwise. In case of interest, the explanation for the corpus creation is detailed here.

This exercise was solved in a previous publication using Python, NLTK, SKLEARN and GENSIM frameworks and approaches as Bag of Words, TF-IDF, embeddings and shallow learning models. For further information of that publication, please refer here. However, the present publication solves the exercise by using a pretrained language model. The code and dataset can be found below.

LANGUAGE MODEL INTRODUCTION
A language model is a probabilistic model of a natural language able to generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. Large language models are a combination of feedforward neural networks and transformers and they nowadays have superseded recurrent neural  network-based models.

Language models are useful for a variety of tasks, as for example speech recognition, machine translation, natural language generation, handwriting recognition, grammar induction or information retrieval.

As mentioned, language models are based on a transformer architecture. GPT was the first pretrained language model with transformer architecture created by OpenAI in 2018. Since then, multiples models as BERT, GPT-2, XLM, DistilBERT, T5... have been published after being trained with larger literature and leading to further performance. As reference, next picture depicts a timeline with the main language models. 
Language models are published by organizations with high computational ability and resources since the training process is very expensive. However, once the model is trained using a huge amount of literature, the neural network and transformer are parametrized accordingly to offer a high performance. Therefore, downloading a pretrained language model and perform a fine-tuning process is a great approach for multiple purposes applications to reach high performance models without needing to complete the expensive training process of a language model. This fine-tuning process can be done using TRANSFORMERS framework created by Hugging Face (https://huggingface.co/docs/transformers/index), which allows selecting a language model from a public cloud, preprocessing the application data and perform the fine-tuning process to reach state of art performance.

This project specifically uses a RoBERTa language model, created in 2019 as an improvement from BERT language model. RoBERTa architecture information can be found here. The Hugging Face page for this language model is as follows:


The roberta-base-bne is a transformer-based language model for Spanish. It is based on the RoBERTa base model and has been pretrained using the largest Spanish corpus known to date, with a total of 570GB of processed text, compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019. The training task for this language model was fill-mask, which masks a word from each sentence and the model must predict the masked word. Therefore, this model is ready-to-use for masked language modeling to perform the fill mask task. However, for other non-generative tasks as the task for the current publication (sequence text classification), it can be fine-tuned and reach state of art results.
CODE EXPLANATION
First of all, TRANSFORMERS framework must be imported, but only the required classes of AutoConfig, AutoTokenizer and TFAutoModelForSequenceClassification are imported to avoid increasing computational cost. Note TRANSFORMERS framework operates in my setup under TENSFORFLOW framework and so it also needs to be imported.
Next step is to generate the exercise dataframe. It is downloaded from exercise website and then some data engineering is applied to reach the final dataframe as depicted below. If needed, full code is uploaded in github.
The settings for the simulation are defined as follows:

- Max_seq_length: it is the maximum number of tokens in each input sequence introduced in the pretrained model. It is defined to 48 since it was detected mots tweets have less than 30 tokens. At any case, sequences will be truncated if exceeding that length.
- Train_batch_size: it is the batch size for training set. Based on current literature, it has been observed that there is a significant degradation in the ability of a model to generalize when using a larger batch size. The gradient descent can be assumed as making a linear approximation to the loss function, so it is more efficient to move applying small step sizes (meaning using small batch size). The value is defined to 32, which tends to be a proper trade off between performance and computational cost.
- Val_batch_size: it is the batch size for validation set, also defined to 32.
- Test_batch_size: it is the batch size for testing set, also defined to 32.
- Learning_rate: it is the learning rate during network training. A too low value might lead to expensive computational cost and a too high value might lead to instabilities. A proper trade off tends to be around 2e-5.
- Num_epochs: it is the number of loops across the complete training set during simulation. A too low value might lead to underfitting and a too high value to overfitting and high computational cost. As the model is already trained, it is defined to 3 just to perform the fine-tuning process.
- Model_name: it is the language model reference from the Hugging Face public cloud, as mentioned, Spanish RoBERTa is used for the current project.
Using exercise dataframe information, a list for text sequence samples and a list for the corresponding labels are created. And then the whole dataframe is split in training, validation and testing sets. Nota testing set is kept to 2000 samples from 6000-8000 indexes position as originally aimed in the exercise. That enables comparing the results with the previous analysis done using shallow learning models.
Each text sequence cannot be randomized tokenized, it must apply the corresponding language model tokenizer to ensure the input is in the proper format to be understood by the language model. That is done with AutoTokenizer class from Transformers and the method from_pretrained. Finally, the three training, validation and testing datasets are unified with both text sequence and label, and then transformed into batches to be simulated.
As mentioned, the RoBERTa language model is originally trained for fill-mask applications, so it needs to be configured for the sequence classification task of the present exercise. Therefore, the number of target labels must be introduced in the AutoConfig class and then, generate the new model using the class for the current application (in this case TFAutoModelForSequenceClassification). This step will download the language model from the Hugging Face public cloud if needed as depicted in the picture.
Deep learning model requires defining optimizer, loss and metric. 
- As for optimizer, Adam is selected since it generally leads to good performance.
- As for loss, SparseCategoricalCrossentropy is selected since the the process done has converted the target labels to integers.
- As for metric, Accuracy under SparseCategoricalAccuracy is selected.
Simulation can be now triggered to perform the fine-tuning process according to the training set and then, check the language model performance under the validation set.
Validation accuracy in the third epoch reaches 0.9476, which is much higher than the values at range 0.8-0.85 reached in the previous publication using approaches as Bag of Words, TF IDF or small pretrained embeddings followed by shallow learning models. However, the computational cost has significantly increased and the model interpretability has been lost, but the performance increased so much that compensates these two downsides.

Note that a parameter sweep might be performed to try to optimize the fine-tuning process and reach potentially higher values.

Last step to confirm the new fine-tuned transformer-based language model is to predict labels from testing set and confirm the performance is similar as in validation set.  
Testing set accuracy is 0.9425, which is a close value to validation set accuracy and much higher than the values of 0.8-0.85 reaches with shallow learning models. Therefore, it confirms the suitability of the new fine-tuned language model based on a pretrained RoBERTa architecture to solve the present text sequence classification exercise maximizing the performance. 
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Pretrained Language Model Text Classification
Published:

Pretrained Language Model Text Classification

Published:

Creative Fields