How to Classify Theses Using UNAM’s RoBERTa Model

Apr 7, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_4_1363

Welcome to the world of machine learning! With the increasing amount of data available, classifying information accurately is more crucial than ever. Today, we are diving into the process of classifying theses at the National Autonomous University of Mexico (UNAM) using the finetuned RoBERTa model developed for this specific purpose.

What is Unam_tesis_ROBERTA_GOB_finnetuning?

The Unam_tesis_ROBERTA_GOB_finnetuning model utilizes the powerful capabilities of RoBERTa, a transformer-based model optimized for natural language tasks. This model has been trained with a dataset from the National Library of Spain and specifically targets the classification of theses into five distinct academic careers:

Psicología
Derecho
Química Farmacéutico Biológica
Actuaría
Economía

Understanding the Training Dataset

The model was trained using a structured dataset containing 1000 documents, each featuring:

Thesis introduction
Author’s first name
Author’s last name
Thesis title
Year
Career

Each of the five possible careers is equally represented with 200 documents each, ensuring a balanced training process.

Using the Model: Step-By-Step

To use the unam_tesis_ROBERTA_GOB_finnetuning model, you’ll need to follow these steps:

Step 1: Install the Required Libraries

Make sure you have PyTorch and the Hugging Face Transformers library installed. You can do it using pip:

pip install torch transformers

Step 2: Download the Model and Tokenizer

Now, you can download the model and tokenizer in your Python environment with the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-large-bne", use_fast=False)

model = AutoModelForSequenceClassification.from_pretrained(
    "hackathon-pln-e-unam_tesis_ROBERTA_GOB_finnetuning", num_labels=5, output_attentions=False,
    output_hidden_states=False
)

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)

Step 3: Classify a Thesis

To classify the content of a thesis, simply use the following code snippet:

classification_result = pipe("El objetivo de esta tesis es elaborar un estudio de las condiciones asociadas al aprendizaje desde casa")

This will provide you with scores for each of the five careers based on the input text.

Analogy for Better Understanding

Think of the Unam_tesis_ROBERTA_GOB_finnetuning model like a skilled librarian who is trained to sort books (theses) into specific categories (academic careers). Just as the librarian uses knowledge and experience to determine the right place for each book, the RoBERTa model uses its training on relevant data to classify thesis texts accordingly. The more books the librarian handles, the better they become at making these judgments, similar to how the model improves accuracy as it processes more documents.

Troubleshooting

If you encounter any issues while using the model, consider the following troubleshooting steps:

Make sure you have a stable internet connection when downloading the model and tokenizer.
Check that you have the correct version of Python and installed libraries that are compatible with PyTorch.
If you get an error related to tokenization, ensure that your input text is correctly formatted.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In conclusion, classifying theses using the Unam_tesis_ROBERTA_GOB_finnetuning model is a streamlined process that leverages cutting-edge machine learning technologies. By following the steps outlined above, you can effectively sort thesis documents with impressive accuracy.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox