BERT Code-Mixed Base Model for Hinglish (Cased)

May 21, 2021 | Educational

Welcome to a deep dive into the BERT model fine-tuned specifically for code-mixed Hinglish text. In the realm of NLP, working with languages that blend linguistic elements from different origins presents challenges, yet also exciting opportunities. Our focal point here is an innovative approach to sentiment analysis using the BERT architecture.

Model Description

This model leverages the power of BERT (Bidirectional Encoder Representations from Transformers) to classify sentiment in code-mixed Hinglish text, with outputs ranging from:

  • 0 – Negative
  • 1 – Neutral
  • 2 – Positive

The primary input for this model is any code-mixed Hinglish text while the output is the sentiment classification as mentioned above. Built on the foundation of a pretrained bert-base-multilingual-cased model from Hugging Face, it has been fine-tuned using the SAIL 2017 dataset.

Evaluation Results

The model’s performance has been evaluated using key metrics that reflect its capability:

Metric Score
Accuracy 0.55873
F1 Score 0.558369
Accuracy & F1 0.558549
Precision 0.558075
Recall 0.55873

How to Use the Model

Integrating this model into your code is straightforward, whether you’re using PyTorch or TensorFlow. Let’s break it down using an analogy:

Imagine you’re an artist (the model) who uses a specific brush (the tokenizer) to paint on a canvas (your text). You can achieve different outcomes based on how you prepare your paints (your input) and how you apply your brush (the model’s architecture).

Using PyTorch

from transformers import BertTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')

text = "Replace me by any text you’d like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Using TensorFlow

from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')

text = "Replace me by any text you’d like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Preprocessing Steps

For optimal results, the input data should be preprocessed using standard techniques:

  • Remove digits
  • Remove punctuation
  • Remove stopwords
  • Remove excess whitespace

Here’s a snippet of how the preprocessing is done:

from pathlib import Path
import pandas as pd
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords, en_stopwords
from texthero.preprocessing import remove_digits, remove_whitespace

root = Path('path-to-data')

for file in ['test', 'train', 'validation']:
    tochange = root / f'{file}.txt'
    df = pd.read_csv(tochange, header=None, sep='\t', names=['text', 'label'])
    df['text'] = df['text'].pipe(remove_digits)\
                             .pipe(remove_punctuation)\
                             .pipe(remove_stopwords, stopwords=en_stopwords.union(hi_stopwords))\
                             .pipe(remove_whitespace)
    df.to_csv(tochange, index=None, header=None, sep='\t')

Training Data and Procedure

Although the dataset and annotations may not be optimal, this is the best currently available option. There are plans to procure a more comprehensive dataset for future model enhancements. The training was conducted on the bert-base-multilingual-cased model.

Troubleshooting

While using this model, you might encounter some common issues:

  • Compatibility Issues: Ensure all necessary libraries are installed and compatible versions are being used.
  • Data Inconsistencies: Double-check your input data formatting and preprocessing steps to ensure they align with the expected format.
  • Model Performance: If the output isn’t as expected, consider revisiting your training dataset or the preprocessing steps.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox