BERTによる日本語固有表現抽出のモデル

Sep 29, 2022 | Educational

本記事では、株式会社Jurabiが提供する、日本語から固有表現を抽出するためのモデル、BertForTokenClassificationの使い方について詳しく解説します。固有表現の抽出は、自然言語処理の中でも特に重要なタスクであり、様々なビジネスや研究において活用されています。

固有表現のタイプ

このモデルが抽出する固有表現のタイプは以下の8種類です。

人名
法人名（法人または法人に類する組織）
政治的組織名
その他の組織名
地名
施設名
製品名（商品名、番組名、映画名、書籍名、歌名、ブランド名等）
イベント名

使用方法

このモデルを利用するためには、まず必要なライブラリをインストールする必要があります。以下のライブラリをpipなどでインストールしてください。

transformers
unidic_lite
fugashi

すべてのライブラリが準備できたら、次のコードを実行するだけです。

from transformers import BertJapaneseTokenizer, BertForTokenClassification
from transformers import pipeline

model = BertForTokenClassification.from_pretrained('jurabibert-ner-japanese')
tokenizer = BertJapaneseTokenizer.from_pretrained('jurabibert-ner-japanese')
ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
ner_pipeline('株式会社Jurabiは、東京都台東区に本社を置くIT企業である。')

事前学習モデル

モデルは、東北大学乾研究室が公開している日本語BERTモデル（cl-tohokubert-base-japanese-v2）を使用しています。

学習データ

学習データは、ストックマーク株式会社が公開しているWikipediaを用いた日本語の固有表現抽出データセット（stockmarkteamner-wikipedia-dataset）を元にしています。

ソースコード

ファインチューニングに使用したプログラムは、jurabiinc/bert-ner-japaneseで公開しています。

トラブルシューティング

何か問題が発生した場合、以下の点を確認してください。

必要なライブラリが正しくインストールされているか。
モデルやトークナイザーの指定が正しいか。
入力テキストに誤りがないか。

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

まとめ

BERTを使用した日本語の固有表現抽出は非常に強力な手法です。このモデルを活用したプロジェクトによって、効率的かつ効果的に情報を整理・分析することが可能になります。

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox