How to Use the Structure Extraction Model by NuMind: NuExtract

Jun 26, 2024 | Educational

Welcome to the world of information extraction! Today, we’ll dive into how you can harness the power of NuMind’s NuExtract model, a fine-tuned version of phi-3-mini designed to extract structured data from text effortlessly.

What is NuExtract?

NuExtract is a sophisticated model that allows you to extract specific information from a given text input using a defined template. This model is purely extractive, meaning the output is direct and presents data in its original form from the input text.

How to Use the NuExtract Model

Follow these steps to effectively use NuExtract:

  • Step 1: Install the required libraries. Ensure you have the transformers library installed to work with pre-trained models.
  • Step 2: Import the necessary libraries in your Python environment.
  • Step 3: Define your input text and JSON schema based on the data you want to extract.
  • Step 4: Call the predict_NuExtract function with the model, tokenizer, input text, schema, and an example if needed.
  • Step 5: Retrieve and interpret the output based on your schema.

Understanding the Code: An Analogy

Imagine you are a detective trying to extract vital information from a large book filled with stories. The predict_NuExtract function acts as your trusty assistant who filters through the pages to extract only the relevant data you’ve asked for. Here’s how that corresponds to each part of the code:

  • Setup: You gather your tools (libraries and model) just like a detective prepares their magnifying glass and notebook before diving into the case.
  • Schema as an Outline: Think of the schema as the framework for what your investigation needs. It’s like having a checklist of who or what to look for: the names, parameters, and architecture of your suspects (or in this case, the model).
  • Input and Output: Your input text is the book, and your assistant (the function) reads it, using the checklist to highlight the necessary information from it—extracting clarity from chaos!

Sample Code

Here’s a simplified version of how the code looks:

import json
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict_NuExtract(model, tokenizer, text, schema, example=["", "", ""]):
    schema = json.dumps(json.loads(schema), indent=4)
    input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
    for i in example:
        if i != "":
            input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"
    input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation = True, max_length=4000).to("cuda")
    output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
    return output.split("<|output|>")[1].split("<|end-output|>")[0]
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("numind/NuExtract", torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract", trust_remote_code=True)
model.to("cuda")
model.eval()

Troubleshooting Tips

If you encounter any issues while using NuExtract, consider the following:

  • Model Loading Issues: Ensure that your environment supports CUDA if you are trying to load it on a GPU. If not, switch to CPU.
  • Input Size: Remember that the input text must be less than 2000 tokens. If you exceed this limit, you might need to truncate or simplify your input.
  • Schema Alignment: Double-check your JSON schema format. An incorrectly defined schema could lead to unexpected outcomes.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Now go ahead, give NuExtract a whirl, and start extracting the gems hidden within your text!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox