Harnessing SEC-BERT for Financial Text Analysis

May 2, 2022 | Educational

In the world of FinTech and financial Natural Language Processing (NLP), having specialized language models is essential. Meet SEC-BERT, a family of BERT models finely tuned for the financial domain designed to simplify and enhance the analysis of financial texts, particularly U.S. Securities and Exchange Commission (SEC) filings.

What is SEC-BERT?

SEC-BERT is an innovative collection of models that applies the BERT architecture specifically to financial documents. Its purpose is to facilitate research and applications in the financial sector. The SEC-BERT family comprises different variants to handle numerical expressions and other specific tokenization needs:

  • SEC-BERT-BASE: The foundational model, akin to BERT-BASE, fine-tuned for finance.
  • SEC-BERT-NUM: This version standardizes all numerical tokens with a [NUM] pseudo-token for consistent handling.
  • SEC-BERT-SHAPE: Similar to SEC-BERT-BASE, but replaces numbers with pseudo-tokens that represent the numeric shape, preventing fragmentation of numeric representation.

How Does SEC-BERT Work?

To understand the mechanics of SEC-BERT, consider it as a finely-tuned orchestra, where each instrument (in this case, a model variant) plays a unique role in creating a harmonious performance (an analysis of financial texts). Each model variant is designed to handle data differently, depending on its composition, ensuring that our financial texts sing in tune without missing a note.

Loading the Pretrained Model

To leverage SEC-BERT-SHAPE in your applications, follow the steps below:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")

Pre-processing Text for SEC-BERT

Before feeding your data into the model, you must pre-process the text to replace numerical tokens with their respective shape pseudo-tokens. The function below exemplifies this preprocessing:

import re
import spacy
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
spacy_tokenizer = spacy.load("en_core_web_sm")

def sec_bert_shape_preprocess(text):
    tokens = [t.text for t in spacy_tokenizer(text)]
    processed_text = []
    for token in tokens:
        if re.fullmatch(r'(\d+[\d,.]*)', token):
            shape = '[' + re.sub(r'\d', 'X', token) + ']'
            processed_text.append(shape if shape in tokenizer.additional_special_tokens else '[NUM]')
        else:
            processed_text.append(token)
    return ' '.join(processed_text)

sentence = "Total net sales decreased 2% or $5.4 billion during 2019 compared to 2018."
tokenized_sentence = tokenizer.tokenize(sec_bert_shape_preprocess(sentence))
print(tokenized_sentence)

Using SEC-BERT Variants

After preprocessing, you can utilize SEC-BERT in a variety of ways, such as predicting masked tokens within the financial texts. Let’s delve into an example:

When given the sentence “Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018,” the model may predict outcomes like:

  • With SEC-BERT-BASE: increased (0.221), decreased (0.282)
  • With SEC-BERT-NUM: [NUM] (1.000), increased (0.753)
  • With SEC-BERT-SHAPE: [XX] (0.316), increased (0.747)

Troubleshooting Tips

Sometimes, things may go awry while utilizing SEC-BERT. Here are some troubleshooting ideas:

  • If you encounter issues during model loading, ensure the correct model name is specified.
  • For any inconsistencies in tokenization, verify that you have preprocessed your text accurately.
  • Check to see if you have installed all required libraries, like transformers and spacy.

For additional insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By harnessing SEC-BERT, you can elevate your analysis of financial documents and enhance your understanding of key metrics within the industry. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox