How to Use the NuExtract-large Model for Information Extraction

Jul 2, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_20_8

Welcome to the exciting world of information extraction! Today, we’ll dive into how to leverage the NuExtract-large model developed by NuMind—an impressive tool for smoothly extracting information from text. Buckle up, as we guide you through the process step by step!

What is NuExtract-large?

NuExtract-large is a fine-tuned variant of phi-3-small, designed specifically for extracting vital pieces of information from textual data using a pre-defined JSON template. Imagine this model as a skilled librarian who knows exactly where to find the specific information you need within a massive sea of texts.

Getting Started

To utilize this model, you must provide an input text (less than 2000 tokens) and a JSON template that describes what you want to extract. The model then returns the information precisely as it appeared in the original text, allowing for highly accurate extractions.

Using NuExtract-large: A Step-by-Step Guide

Step 1: Install the necessary libraries. You’ll need Transformers and Torch configured in your Python environment before running the model.
Step 2: Prepare your text and schema. The schema must define the structure of the information you’re trying to extract. Think of it as a blueprint for a house—the clearer it is, the better the extraction.
Step 3: Run the prediction function with your model and tokenizer. Below is the relevant code block:

import json
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]):
    schema = json.dumps(json.loads(schema), indent=4)
    input_llm = "<|input|>\n### Template:\n" +  schema + "\n"
    for i in example:
        if i != "":
            input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"
        input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")
    output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
    return output.split("<|output|>")[1].split("<|end-output|>")[0]

model = AutoModelForCausalLM.from_pretrained("numind/NuExtract", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract", trust_remote_code=True)
model.to("cuda")
model.eval()
text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for superior performance and efficiency..."""
schema = """{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of token": "",
        "Architecture": []
    },
    "Usage": {
        "Use case": [],
        "Licence": ""
    }
}"""
prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
print(prediction)

An Analogy for Understanding This Code

Think of the code above as a chef preparing a gourmet dish. The chef (model) needs specific ingredients (input text, JSON schema) to create a masterpiece (extracted information). Each step in the code is vital: gathering ingredients, following the recipe (defining structure), and finally cooking (invoking the model for predictions). If one ingredient is missing or a step skipped, the dish may not turn out as expected!

Troubleshooting Tips

If you encounter issues while using the model, consider the following troubleshooting ideas:

Check Token Limit: Ensure your input text does not exceed 2000 tokens. If needed, shorten it and try again.
Dependencies: Verify that all required libraries are installed and up to date. Run pip install transformers torch if necessary.
JSON Format: Ensure your JSON schema is well-formed. Even a small syntax error can impede functionality.
CUDA Errors: Make sure your machine has proper CUDA configuration for GPU processing. If not using a GPU, adjust the code to run on CPU.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using the NuExtract-large model can simplify the process of information extraction, making it a powerful tool for any data-driven project. By following the guidelines mentioned above, you’ll be well on your way to mastering the extraction process!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox