Welcome to our guide on utilizing NuExtract-v1.5 by NuMind, a powerful tool for extracting structured information from long documents across multiple languages including English, French, Spanish, German, Portuguese, and Italian. This model is ideally suited for businesses and developers who are seeking a refined method to gather specific data from extensive texts.
Getting Started
To harness the capabilities of NuExtract-v1.5, you will need to do the following:
- Install the Required Libraries: Ensure you have the necessary libraries installed. The model requires the
transformers
library from Hugging Face. - Set Up Your Environment: You will need a Python environment with access to GPU for optimal performance.
Using the NuExtract Model
Here’s a step-by-step guide to use the model:
Imagine you are a librarian tasked with sorting out various pieces of information from a stack of books. The books represent your input documents, and you need to pull out specific information like titles, authors, and summaries. NuExtract performs this task efficiently for you.
Code Overview
The model utilizes various functions to extract data based on a provided template. Here’s how the main functions work:
python
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
template = json.dumps(json.loads(template), indent=4)
prompts = [f"input\n### Template:\n{template}\n### Text:\n{text}\n\noutput" for text in texts]
outputs = []
with torch.no_grad():
for i in range(0, len(prompts), batch_size):
batch_prompts = prompts[i:i+batch_size]
batch_encodings = tokenizer(batch_prompts, return_tensors='pt', truncation=True, padding=True, max_length=max_length).to(model.device)
pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
return [output.split("output")[1] for output in outputs]
In our analogy, the predict_NuExtract
function is like an efficient librarian that reads through your prompts (or inquiries) and fetches the respective information as per the template you provide.
Example Implementation
Let’s put this into practice:
python
model_name = 'numind/NuExtract-v1.5'
device = 'cuda'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
text = "We introduce Mistral 7B, a 7-billion-parameter language model..."
template = '''
{
"Model": {
"Name": "",
"Number of parameters": "",
"Number of max tokens": "",
"Architecture": []
},
"Usage": {
"Use case": [],
"Licence": ""
}
}
'''
prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)
This example initializes the model, sets up your variables, and then calls the predict_NuExtract
function to extract the desired information based on the provided text and template.
Troubleshooting Tips
In case you encounter issues while using NuExtract-v1.5, here are some troubleshooting ideas:
- Model Not Loading: Ensure that the model name is correctly specified and the required libraries are installed properly.
- Memory Issues: Ensure your GPU has enough memory; consider reducing the batch size or the maximum token length.
- Broken JSON Output: Sometimes the model might return a broken JSON. You can handle this by keeping track of the previous output and using it if the new output is invalid. This is similar to asking the librarian to refer back to the last book she retrieved.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
With NuExtract-v1.5, you have a powerful ally in extracting structured information from lengthy documents effortlessly. Just remember, just like a diligent librarian, you need to guide the model with the right prompts and templates.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.