How to Use the Structure Extraction Model by NuMind

Jun 26, 2024 | Educational

Welcome to the world of AI-driven data extraction with NuMind’s NuExtract model! In this article, we will guide you through the steps of utilizing this powerful tool to extract structured information from your text. But first, let’s set the scene.

Understanding NuExtract: Your Data Extraction Ally

Imagine you are an archaeologist trying to unearth precious artifacts from layers of sediment. Each piece of information in a massive document resembles a hidden treasure, and NuExtract is your trusty excavation tool. Fine-tuned on a high-quality synthetic dataset, NuExtract helps you sift through the chaos of unstructured text to uncover specific nuggets of information you are looking for.

Getting Started

To use NuExtract, you will need to follow these straightforward steps:

1. Install Dependencies: Ensure you have the necessary libraries installed. You will need `transformers` for handling the model.

2. Load the Model and Tokenizer:
“`python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(“numind/NuExtract”, torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(“numind/NuExtract”, trust_remote_code=True)
model.to(“cuda”) # Make sure you enable GPU support if available
model.eval() # Set the model in evaluation mode
“`

3. Prepare Your Input: Craft the text you wish to analyze and design a JSON template that describes the information you need to extract.

4. Run the Prediction: Use the `predict_NuExtract` function to obtain your results.

Here is a sample implementation:


import json

def predict_NuExtract(model, tokenizer, text, schema, example=["", "", ""]):
    schema = json.dumps(json.loads(schema), indent=4)
    input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
    for i in example:
        if i != "":
            input_llm += "### Example:\n" + json.dumps(json.loads(i), indent=4) + "\n"
    input_llm += "### Text:\n" + text + "\n<|output|>\n"
    
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")
    output = tokenizer.decode(model.generate(input_ids)[0], skip_special_tokens=True)
    
    return output.split("<|output|>")[1].split("<|end-output|>")[0]

# Example usage
text = """Your complex text goes here."""
schema = """{"Model": {"Name": "", "Number of parameters": "", "Number of max token": "", "Architecture": []}, "Usage": {"Use case": [], "License": ""}}"""

prediction = predict_NuExtract(model, tokenizer, text, schema)
print(prediction)

Replace `text` and `schema` with your specific values.

Troubleshooting Tips

If you encounter any issues while using NuExtract, here are some common problems and solutions:

– Model Not Loading Properly: Ensure that you have a stable internet connection and that you are using the correct model identifiers.

– Output Mismatches: If the output does not match your expectations, check your JSON schema and text input. Make sure the data types and structures are in the correct format.

– Token Limit Exceeded: Remember that your input text must be less than 2000 tokens. If it exceeds this limit, you may encounter truncation issues.

– CUDA Errors: If you get errors related to CUDA, ensure that your environment supports GPU and that the PyTorch version is compatible.

For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.

Conclusion

With NuExtract, extracting critical information from vast volumes of text has never been easier! By following the steps outlined above, you can be well on your way to mastering structure extraction models. Just remember, much like a skilled archaeologist, patience and precision are your best friends in this journey. Happy extracting!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox