How to Use NuExtract-v1.5 for Structured Information Extraction

Oct 28, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesnumind_NuExtract-v1.5

Welcome to our guide on utilizing NuExtract-v1.5 by NuMind, a powerful tool for extracting structured information from long documents across multiple languages including English, French, Spanish, German, Portuguese, and Italian. This model is ideally suited for businesses and developers who are seeking a refined method to gather specific data from extensive texts.

Getting Started

To harness the capabilities of NuExtract-v1.5, you will need to do the following:

Install the Required Libraries: Ensure you have the necessary libraries installed. The model requires the transformers library from Hugging Face.
Set Up Your Environment: You will need a Python environment with access to GPU for optimal performance.

Using the NuExtract Model

Here’s a step-by-step guide to use the model:

Imagine you are a librarian tasked with sorting out various pieces of information from a stack of books. The books represent your input documents, and you need to pull out specific information like titles, authors, and summaries. NuExtract performs this task efficiently for you.

Code Overview

The model utilizes various functions to extract data based on a provided template. Here’s how the main functions work:

python
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
    template = json.dumps(json.loads(template), indent=4)
    prompts = [f"input\n### Template:\n{template}\n### Text:\n{text}\n\noutput" for text in texts]
    outputs = []
    with torch.no_grad():
        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i+batch_size]
            batch_encodings = tokenizer(batch_prompts, return_tensors='pt', truncation=True, padding=True, max_length=max_length).to(model.device)
            pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
            outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    return [output.split("output")[1] for output in outputs]

In our analogy, the predict_NuExtract function is like an efficient librarian that reads through your prompts (or inquiries) and fetches the respective information as per the template you provide.

Example Implementation

Let’s put this into practice:

python
model_name = 'numind/NuExtract-v1.5'
device = 'cuda'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
text = "We introduce Mistral 7B, a 7-billion-parameter language model..."

template = '''
{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of max tokens": "",
        "Architecture": []
    },
    "Usage": {
        "Use case": [],
        "Licence": ""
    }
}
'''

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

This example initializes the model, sets up your variables, and then calls the predict_NuExtract function to extract the desired information based on the provided text and template.

Troubleshooting Tips

In case you encounter issues while using NuExtract-v1.5, here are some troubleshooting ideas:

Model Not Loading: Ensure that the model name is correctly specified and the required libraries are installed properly.
Memory Issues: Ensure your GPU has enough memory; consider reducing the batch size or the maximum token length.
Broken JSON Output: Sometimes the model might return a broken JSON. You can handle this by keeping track of the previous output and using it if the new output is invalid. This is similar to asking the librarian to refer back to the last book she retrieved.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With NuExtract-v1.5, you have a powerful ally in extracting structured information from lengthy documents effortlessly. Just remember, just like a diligent librarian, you need to guide the model with the right prompts and templates.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox