BERT for Chinese Named Entity Recognition (bert4ner) Model

Feb 23, 2024 | Educational

In the realm of Natural Language Processing (NLP), the identification of named entities within text is crucial. Imagine reading a book where the characters and locations are highlighted, making the story easier to follow. This blog post will explore how to implement the BERT model for Chinese Named Entity Recognition (NER), aptly named bert4ner.

What is BERT for NER?

BERT, or Bidirectional Encoder Representations from Transformers, is a revolutionary model that understands the context of words in search and text prediction tasks. The bert4ner model specifically targets named entity recognition in Chinese text, effectively distinguishing between people, locations, and other attributes within large corpora of Chinese language data.

Getting Started

To utilize this powerful model in your projects, you will need to set everything up correctly. Here’s a simple guide to get you on the right track:

Installation

Before you can use bert4ner, you need to install the necessary packages:

pip install transformers seqeval

Model Usage

To get started with using bert4ner, here’s how you can make it work using two different methods:

Using the nerpy Library

With the nerpy library, it’s as simple as the following:

from nerpy import NERModel

model = NERModel(bert, shibing624bert4ner-base-chinese)
predictions, raw_outputs, entities = model.predict([常建良,男,1963年出生,工科学士,高级工程师], split_on_space=False)

Using HuggingFace Transformers

If you prefer not to use nerpy, here’s how to utilize the model through HuggingFace Transformers:

import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

tokenizer = AutoTokenizer.from_pretrained('shibing624bert4ner-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('shibing624bert4ner-base-chinese')
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = '王宏伟来自北京,是个警察,喜欢去王府井游玩儿。'
def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors='pt')
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    
    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))
    
    print('Sentence entity:')
    print(entities)

get_entity(sentence)

Understanding the Code

Think of the model as a highly sophisticated library assistant who, upon receiving a request, quickly fetches information from various sections of the library and organizes it neatly. In the above usage with HuggingFace, here’s how the process the assistant follows can be understood:

  • The assistant types out the request (sentence) and organizes it into a format that the library can use.
  • They consult the catalog (tokenizer) to break down the request into manageable parts (tokens).
  • After getting the necessary information (outputs) using the library model, they assign a label (tag) to each part based on pre-defined categories.
  • Finally, they summarize the information in a neat format showing what was asked and how it was categorized.

Training Data Sets

For effective results, using quality datasets is essential. Here are two important datasets you may want to refer to:

  • CNER模板数据集: Available on CNER GitHub
  • PEOPLE中文实体识别数据集: Contains data from the People’s Daily, available on PEOPLE GitHub

Troubleshooting

As with any project, issues may arise. Here are a few tips to troubleshoot common problems:

  • Model not loading: Ensure that the model path is correct and accessible. Double-check network issues if loading from the hub.
  • Inconsistent results: Verify your input formats and labels to ensure accuracy in the tokenizer and model prediction stages.
  • Installation errors: Make sure all dependencies are installed correctly. Try reinstalling them if you encounter issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox