Welcome to a deep dive into the BERT model fine-tuned specifically for code-mixed Hinglish text. In the realm of NLP, working with languages that blend linguistic elements from different origins presents challenges, yet also exciting opportunities. Our focal point here is an innovative approach to sentiment analysis using the BERT architecture.
Model Description
This model leverages the power of BERT (Bidirectional Encoder Representations from Transformers) to classify sentiment in code-mixed Hinglish text, with outputs ranging from:
- 0 – Negative
- 1 – Neutral
- 2 – Positive
The primary input for this model is any code-mixed Hinglish text while the output is the sentiment classification as mentioned above. Built on the foundation of a pretrained bert-base-multilingual-cased model from Hugging Face, it has been fine-tuned using the SAIL 2017 dataset.
Evaluation Results
The model’s performance has been evaluated using key metrics that reflect its capability:
| Metric | Score |
|---|---|
| Accuracy | 0.55873 |
| F1 Score | 0.558369 |
| Accuracy & F1 | 0.558549 |
| Precision | 0.558075 |
| Recall | 0.55873 |
How to Use the Model
Integrating this model into your code is straightforward, whether you’re using PyTorch or TensorFlow. Let’s break it down using an analogy:
Imagine you’re an artist (the model) who uses a specific brush (the tokenizer) to paint on a canvas (your text). You can achieve different outcomes based on how you prepare your paints (your input) and how you apply your brush (the model’s architecture).
Using PyTorch
from transformers import BertTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you’d like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Using TensorFlow
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
text = "Replace me by any text you’d like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
Preprocessing Steps
For optimal results, the input data should be preprocessed using standard techniques:
- Remove digits
- Remove punctuation
- Remove stopwords
- Remove excess whitespace
Here’s a snippet of how the preprocessing is done:
from pathlib import Path
import pandas as pd
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords, en_stopwords
from texthero.preprocessing import remove_digits, remove_whitespace
root = Path('path-to-data')
for file in ['test', 'train', 'validation']:
tochange = root / f'{file}.txt'
df = pd.read_csv(tochange, header=None, sep='\t', names=['text', 'label'])
df['text'] = df['text'].pipe(remove_digits)\
.pipe(remove_punctuation)\
.pipe(remove_stopwords, stopwords=en_stopwords.union(hi_stopwords))\
.pipe(remove_whitespace)
df.to_csv(tochange, index=None, header=None, sep='\t')
Training Data and Procedure
Although the dataset and annotations may not be optimal, this is the best currently available option. There are plans to procure a more comprehensive dataset for future model enhancements. The training was conducted on the bert-base-multilingual-cased model.
Troubleshooting
While using this model, you might encounter some common issues:
- Compatibility Issues: Ensure all necessary libraries are installed and compatible versions are being used.
- Data Inconsistencies: Double-check your input data formatting and preprocessing steps to ensure they align with the expected format.
- Model Performance: If the output isn’t as expected, consider revisiting your training dataset or the preprocessing steps.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

