How to Use Turkish BERT for Token Classification

Jul 25, 2024 | Educational

Welcome to our guide on using the Turkish BERT model for token classification, specifically focusing on capitalization and punctuation correction. Get ready to transform unstructured Turkish text into well-structured sentences!

Example Usage

This section will introduce you to a simple example of how to implement the Turkish BERT model in your Python script. Don’t worry if it seems complex at first; we’ll break it down further!

from transformers import pipeline, AutoTokenizer, BertForTokenClassification

def preprocess(text):
    noktalama_isaretleri = ['!', '?', '.', ',', '-', ':', ';', "'"]
    new_text = "".join(
        [char for char in text if char in noktalama_isaretleri or char.isalnum() or char.isspace()]
    )
    new_text_Pure = "".join([char for char in text if char.isalnum() or char.isspace() or char == "'" or char == "-"])
    new_text_Pure = new_text_Pure.replace("'", " ").replace("-", " ")
    new_text = new_text_Pure.replace("I", "ı").lower()
    return new_text

def end2end(sent, capitalization_corr, punc_corr):
    p_sent = preprocess(sent)
    r1 = capitalization_corr(p_sent)
    r2 = punc_corr(p_sent)

    tokenized_sent = tokenizer.tokenize(p_sent)
    final_sent = ''
    i = 0
    while i < len(tokenized_sent):
        token = tokenized_sent[i]
        if r1[i]['entity'] == 'one':
            token = token.capitalize()
        elif r1[i]['entity'] == 'cap':
            token = token.upper()
            while tokenized_sent[i + 1].startswith("##"):
                token += tokenized_sent[i + 1][2:].upper()
                i += 1
                
        if r2[i]['entity'] != 'non':
            token += r2[i]['entity']
        if r2[i]['entity'] != "'":
            token += ' '
        final_sent += token
        i += 1

    final_sent = final_sent.replace(' ##', '')
    return final_sent

cap_model = BertForTokenClassification.from_pretrained("ytu-ce-cosmos/turkish-base-bert-capitalization-correction")
punc_model = BertForTokenClassification.from_pretrained("ytu-ce-cosmos/turkish-base-bert-punctuation-correction")

tokenizer = AutoTokenizer.from_pretrained("turkish-base-bert-capitalization-correction")

capitalization_corr = pipeline("ner", model=cap_model, tokenizer=tokenizer)
punc_corr = pipeline("ner", model=punc_model)

sent = """geçen hafta sonu arkadaşlarımla birlikte kısa bir tatile çıktık ...""" 
print(end2end(sent, capitalization_corr, punc_corr))

Breaking Down the Code

Let's consider the code as a machine in a bakery, where each part has a specific responsibility to produce a delicious loaf of bread (i.e., a well-structured sentence). Here's how the components work together:

Preprocessing: This is like the baker preparing the ingredients. The preprocess function removes unnecessary characters (similar to sifting flour) and ensures everything is ready for the next steps.
Tokenization: The text is divided into smaller parts (tokens), just like cutting the dough into manageable pieces before baking.
Correction Processes: Here, we have two bakers: one for capitalization and another for punctuation. They work in tandem, ensuring that each piece (token) is perfectly baked (correctly formatted) before being brought together to form the final loaf (the structured sentence).
Final Assembly: The end2end function assembles everything back together into a coherent finished product (the corrected sentence).

Running the Code

To execute the provided code:

Ensure you have Python and the transformers library installed.
Copy and paste the code into a script or a Jupyter notebook.
Run the script, and you'll see a well-structured Turkish sentence printed out!

Troubleshooting

If you encounter issues while running the code, consider the following:

Missing Libraries: Ensure all necessary libraries, particularly transformers, are installed. You can do this using pip install transformers.
Model Download Failure: Sometimes, downloading the models may fail due to connectivity issues. Try rerunning the script or checking your internet connection.
Memory Issues: The models may require substantial memory. If your system runs out of memory, consider running the code in a different environment or optimize memory usage.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With just a few lines of code, you can transform Turkish text into a polished narrative that flows beautifully. In today's world, automated text correction is a key component in making AI more accessible and understandable. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox