How to Create a Massive Monolingual Dataset with Lazynlp

Jun 16, 2021 | Data Science

Creating a massive dataset to fuel your machine learning projects doesn’t have to be a Herculean task. With the Lazynlp Library, you can crawl, clean up, and deduplicate webpages effortlessly. This article will guide you step by step through the process of setting up Lazynlp and utilizing it to create datasets larger than OpenAI’s original dataset for GPT-2. Ready to dive in? Let’s start!

Setup

Before getting started, ensure you have Python 3 installed. Follow these simple steps:

  1. Clone the Lazynlp repository and navigate into the folder:
  2. git clone https://github.com/chiphuyen/lazynlp.git
    cd lazynlp
  3. Install required dependencies:
  4. pip3 install -r requirements.txt
  5. Install the library:
  6. pip3 install .
  7. To uninstall the library, simply run:
  8. pip3 uninstall lazynlp

How to Create a Massive Dataset Using Lazynlp

Step 1: Obtain URLs of the Webpages You Want to Crawl

URL collection is key! Here are some vast resources:

  • Reddit URLs: Access the Reddit submissions dump. Expect large files (100MB – 1GB). A neat trick: You can also download a deduplicated list of links [{@jcpeterson}](https://github.com/jcpeterson) here.
  • Gutenberg: Download URLs of US and Australian books from here or utilize Lazynlp to fetch them.
  • Wikipedia: Download dumps from Wikipedia dumps.

Step 2: Deduplicate URLs

To avoid downloading duplicates, use these handy functions:

lazynlp.dedup_lines(files, outfold)
lazynlp.dedup_lines_from_new_file(original_files, new_file, outfile)

These functions help you ensure that you have a clean set of URLs to work with.

Step 3: Download the URLs

You can download pages individually or in bulk:

lazynlp.download_page(link, context=None, timeout=None)
lazynlp.download_pages(link_file, folder, timeout=30, default_skip=True, extensions=[], domains=[])

You can fine-tune parameters to skip unwanted URLs and more. Each downloaded webpage will be indexed and saved in specified folders, ensuring systematic storage of your content.

Step 4: Clean the Webpages

To purify your data, clean it using:

lazynlp.clean_page(page)

This method will handle HTML tags, whitespace and other unwanted characters automatically.

Step 5: Remove Duplicated Webpages

Finally, fine-tune your dataset by eliminating duplicates:

lazynlp.estimate_overlap(source_files, target_files, gran=word, n=8)
lazynlp.filter_files(files, threshold=0.5, gran=word, n=8)

These functions will help you sift through your files to ensure diversity in your dataset.

Troubleshooting

If you encounter any hiccups along the way, consider these troubleshooting tips:

  • Install Errors: Make sure you have Python 3.1 or higher.
  • URL Not Found: Double-check the URLs you are trying to crawl for accuracy.
  • Timeout Issues: Increase the timeout parameter in the download function.
  • Encoding Problems: Ensure that the content source does not include non-UTF-8 encoded characters.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Creating a large dataset using Lazynlp is almost like baking a massive cake. You gather all your ingredients (URLs), mix them (download and clean), and ensure there are no duplicates (deduplication) before serving it up with your machine learning algorithms! With this library, you can create datasets larger than what OpenAI initially used for their models.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox