The NSFW Data Scraper is an exciting tool designed for collecting tens of thousands of images across various categories. These categories include porn, hentai, sexy, neutral, and drawings. They can be used later to train an image classifier efficiently. However, it’s important to approach this with caution due to the nature of the dataset. In this article, we’ll walk you through how to set up and run the scraper.
Prerequisites
- Ensure Docker is installed on your machine.
Setup and Running the Data Scraper
The following steps will guide you through the process of collecting the dataset:
Step 1: Build the Docker Image
To start, you need to build the Docker image for the scraper:
docker build . -t docker_nsfw_data_scraper
This command sends the build context to the Docker daemon and installs all necessary packages.
Step 2: Run the Scraper
After building the Docker image, run the following command:
docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh
This command will execute the scraping scripts and fetch the images for the specified classes. Depending on your system’s performance, this step might take several hours, so it’s best to leave it running overnight.
Understanding the Scripts
Imagine the scripts as a set of specialized workers in a large library, each tasked with a particular job in order to gather the necessary materials:
- 1_get_urls_.sh: This worker sifts through various subreddits, collecting URLs of images based on categories. Think of it as finding the right books in a library.
- 2_download_from_urls_.sh: Once the URLs have been identified, this worker fetches the books (images) from the identified places.
- 3_optional_download_drawings_.sh: An optional worker who retrieves safe-for-work anime images from a specific well-known database, akin to navigating a separate section of a library for comics.
- 4_optional_download_neutral_.sh: Similar to the previous worker, but this one collects neutral images from another dataset, ensuring the sorted section isn’t biased.
- 5_create_train_.sh: After gathering all resources, this worker organizes the images into the training section, making sure corrupted titles are discarded.
- 6_create_test_.sh: Finally, this worker takes a random selection of images to create a test section for evaluation.
How to Train a CNN Model
Once you have collected the images, it’s time to train your Convolutional Neural Network (CNN) model:
- Install fastai:
conda install -c pytorch -c fastai fastai
train_model.ipynb from top to bottom to start your training process.Results
I successfully trained a CNN classifier with an impressive accuracy of 91%. The confusion matrix remarkably showed similar confusions between the ‘drawings’ and ‘hentai’ categories, as well as between ‘porn’ and ‘sexy’ categories.
Troubleshooting
If you encounter issues during any step of the process, consider the following troubleshooting tips:
- Check your Docker installation; ensure it is correctly set up.
- Ensure that you have a stable internet connection since the scripts retrieve data from various online sources.
- If scripts fail, verify the permissions of your directories and ensure you’re running commands from the correct location.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Conclusion
This NSFW Data Scraper opens up new avenues for exploring image classification models, particularly in the realm of adult content and safe-for-work images. With the right setup and execution, you can gather a significant dataset that can enhance your AI projects.
