Welcome to the world of machine learning! With the increasing amount of data available, classifying information accurately is more crucial than ever. Today, we are diving into the process of classifying theses at the National Autonomous University of Mexico (UNAM) using the finetuned RoBERTa model developed for this specific purpose.
What is Unam_tesis_ROBERTA_GOB_finnetuning?
The Unam_tesis_ROBERTA_GOB_finnetuning model utilizes the powerful capabilities of RoBERTa, a transformer-based model optimized for natural language tasks. This model has been trained with a dataset from the National Library of Spain and specifically targets the classification of theses into five distinct academic careers:
- Psicología
- Derecho
- Química Farmacéutico Biológica
- Actuaría
- Economía
Understanding the Training Dataset
The model was trained using a structured dataset containing 1000 documents, each featuring:
- Thesis introduction
- Author’s first name
- Author’s last name
- Thesis title
- Year
- Career
Each of the five possible careers is equally represented with 200 documents each, ensuring a balanced training process.
Using the Model: Step-By-Step
To use the unam_tesis_ROBERTA_GOB_finnetuning model, you’ll need to follow these steps:
Step 1: Install the Required Libraries
Make sure you have PyTorch and the Hugging Face Transformers library installed. You can do it using pip:
pip install torch transformers
Step 2: Download the Model and Tokenizer
Now, you can download the model and tokenizer in your Python environment with the following code:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-large-bne", use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(
"hackathon-pln-e-unam_tesis_ROBERTA_GOB_finnetuning", num_labels=5, output_attentions=False,
output_hidden_states=False
)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
Step 3: Classify a Thesis
To classify the content of a thesis, simply use the following code snippet:
classification_result = pipe("El objetivo de esta tesis es elaborar un estudio de las condiciones asociadas al aprendizaje desde casa")
This will provide you with scores for each of the five careers based on the input text.
Analogy for Better Understanding
Think of the Unam_tesis_ROBERTA_GOB_finnetuning model like a skilled librarian who is trained to sort books (theses) into specific categories (academic careers). Just as the librarian uses knowledge and experience to determine the right place for each book, the RoBERTa model uses its training on relevant data to classify thesis texts accordingly. The more books the librarian handles, the better they become at making these judgments, similar to how the model improves accuracy as it processes more documents.
Troubleshooting
If you encounter any issues while using the model, consider the following troubleshooting steps:
- Make sure you have a stable internet connection when downloading the model and tokenizer.
- Check that you have the correct version of Python and installed libraries that are compatible with PyTorch.
- If you get an error related to tokenization, ensure that your input text is correctly formatted.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
In conclusion, classifying theses using the Unam_tesis_ROBERTA_GOB_finnetuning model is a streamlined process that leverages cutting-edge machine learning technologies. By following the steps outlined above, you can effectively sort thesis documents with impressive accuracy.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

