How to Get Started with OneKE: A Bilingual Large Language Model for Knowledge Extraction

Apr 13, 2024 | Educational

In the world of Natural Language Processing (NLP), knowledge extraction is an essential task that helps to turn unstructured data into structured information. OneKE, developed by Zhejiang University and Ant Group, stands out as a robust bilingual model specifically designed for this task. In this article, we will explore how to set up OneKE, dive into its functionalities, and troubleshoot potential issues you may encounter.

What is OneKE?

OneKE is a cutting-edge bilingual knowledge extraction model that utilizes schema-based polling instruction construction technology. Its primary goal is to enhance the extraction capabilities of structured information from varied formats of data. Launched in 2024, it addresses the challenges faced by existing models through advanced techniques, leading to improved generalization and performance.

How is OneKE Trained?

The training of OneKE focuses on schema-generalizable information extraction. To overcome common issues such as non-standard formats and noisy data, OneKE employs normalization, cleaning techniques, and careful instruction crafting. For deeper insights into this model’s design and efficacy, refer to the paper IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus.

Getting Started with OneKE

Model Download

Quick Start

Before you begin, ensure your system has at least **20GB of VRAM** for optimal training and inference. Here’s how to set it up:

python
import torch
from transformers import (AutoConfig, AutoTokenizer, AutoModelForCausalLM, GenerationConfig, BitsAndBytesConfig)

model_path = "zjunlp/OneKE"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 4bit quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()

system_prompt = "You are a helpful assistant."
instruction = "You are an expert in named entity recognition. Please extract entities."
schema = ["person", "organization", "location"]
input_data = "284 Robert Allenby (Australia) 69..."
s_instruct = f"[INST] {system_prompt} {instruction} {schema} {input_data} [INST]"

input_ids = tokenizer.encode(s_instruct, return_tensors='pt')
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024))
output = tokenizer.decode(generation_output[0], skip_special_tokens=True)

print(output)

In the code above, we’re essentially creating a new assistant—a bit like planting a seed of knowledge that will grow as it learns from the data it’s fed. Each instruction acts like a watering can, providing the right nutrients for the model to thrive.

Advanced Use of OneKE

OneKE Instruction Format

The instructions in OneKE are structured similarly to JSON, composed of three main fields:

  • instruction: Describes the task in natural language.
  • schema: A dynamic list of labels that indicates what information to extract.
  • input: The actual text from which the data will be extracted.

Customized Schema Description Instructions

Customized schemas help define specific extraction targets based on the context. For example:

json
{
    "instruction": "You are an expert specializing in entity extraction.",
    "schema": {
        "Position": "Defines the occupation",
        "Company": "Represents business organizations"
    },
    "input": "Mr. John Doe works as a software engineer at TechCorp."
}

Troubleshooting Common Issues

  • Insufficient VRAM Issue: If you encounter memory errors, consider decreasing the batch size or increasing your hardware specs.
  • Loading Model Errors: Ensure the model path is correct and you have access to the necessary repositories.
  • Output Interpretation Problems: If the responses seem incorrect, double-check your instructions and input data format.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox