Transforming Your Datasets with Doccano Transformer

Apr 20, 2021 | Data Science

Welcome to the world of Doccano Transformer, a tool designed to simplify the process of transforming your exported datasets into formats that are compatible with your favorite machine learning libraries. Whether you’re diving into Named Entity Recognition or simply need to change the dataset format, this guide covers everything you need to know.

What is Doccano Transformer?

Doccano Transformer is a Python package that allows users to convert datasets from Doccano into multiple formats, such as CoNLL 2003 and spaCy. It’s like a universal adapter for your machine learning datasets; allowing you to switch between the different formats seamlessly.

Supported Formats

Doccano Transformer supports the following formats:

  • CoNLL 2003
  • spaCy

Installing Doccano Transformer

Getting started with Doccano Transformer is incredibly straightforward. You can install it using pip by running the following command:

pip install doccano-transformer

Examples of Use

Let’s dive into an example to understand how to use Doccano Transformer for Named Entity Recognition. Think of it like a chef preparing ingredients for a dish. You’ll need to gather your dataset, then transform it to suit your recipe (model). Here’s how you can do that:

python
from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

# Load the dataset
dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')

# Transform to CoNLL 2003 format
dataset.to_conll2003(tokenizer=str.split)

# Transform to spaCy format
dataset.to_spacy(tokenizer=str.split)

In the code above, you first import the necessary classes and methods, then load your dataset. It’s similar to gathering your ingredients in cooking. Once you have everything ready, you can convert your dataset into the desired format, as if you were preparing it for the final dish.

Troubleshooting Ideas

If you encounter issues while using Doccano Transformer, here are some common troubleshooting ideas:

  • Ensure that your input JSONL file is properly formatted.
  • Check for any spelling mistakes in the format names or dataset class names.
  • Make sure you have the proper version of Python installed that is compatible with Doccano Transformer.
  • If you experience performance issues, consider optimizing your dataset or breaking it into smaller chunks.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Contributing to Doccano Transformer

Your contributions are always welcome! If you have ideas or improvements for Doccano Transformer, please check out the Contributing to Doccano Transformer guide for guidelines on how to proceed.

License

Doccano Transformer is licensed under the MIT license.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox