How to Use AutoCrawler: Your Multiprocess Image Downloader

Jun 3, 2022 | Data Science

Welcome to the ultimate guide on using the AutoCrawlerGoogle, a high-speed, customizable multiprocess image crawler that can download images from Google and Naver with ease. Whether you’re looking to gather images for a project or simply want to explore, this tool will meet your needs!

Getting Started

To start using AutoCrawler, you’ll need to follow these simple steps:

  1. Install Chrome to run the necessary scripts.
  2. Open your terminal and execute the command:
    • pip install -r requirements.txt
  3. Write your search keywords in a file named keywords.txt.
  4. Run the Python script by executing:
    • python3 main.py
  5. Your files will be downloaded to the designated download directory.

Understanding the Arguments

The main.py script allows for several arguments to customize your crawling experience:

  • --skip true: Skips downloading a keyword if images already exist in the directory.
  • --threads 4: Defines the number of threads for downloading.
  • --google true: Enables downloads from google.com.
  • --naver true: Enables downloads from naver.com.
  • --full false: By default, will download thumbnails; set to true for full resolution (be aware, it may be slower).
  • --face false: Activates face search mode.
  • --no_gui auto: Runs the crawler without a GUI. This is especially useful for headless environments.
  • --limit 0: An infinite limit for the number of images to be downloaded per site.
  • --proxy-list: A comma-separated list of proxies. Each thread will randomly choose a proxy from this list to ensure anonymity.

Full Resolution Mode

To download images in full resolution (JPG, GIF, or PNG), specify --full true in your command when running the script.

Data Imbalance Detection

The AutoCrawler doesn’t just download images; it also makes sure you’re getting a good spread of data. After crawling is complete, it will identify directories that have fewer than 50% of the average file count. It’s a good practice to remove directories that are lacking and try redownloading.

Remote Crawling

If you’re looking to run your crawler remotely, follow these steps:

  1. Install Virtual Display:
  2. sudo apt-get install xvfb
  3. Install Screen:
  4. sudo apt-get install screen
  5. Run the crawler inside a Screen session:
  6. screen -S s1Xvfb :99 -ac DISPLAY=:99 python3 main.py

Customizing Your Crawler

Do you have ideas for a unique crawler that might better suit your needs? You can customize AutoCrawler by tweaking the collect_links.py file. Dive deep into the code and adjust it to enhance your crawling experience!

Troubleshooting Your Crawler

Installation and usage can sometimes hit a snag, especially because Google’s layout changes frequently. Here are some steps to troubleshoot:

  1. Visit Google Images through your Chrome browser: Google Images.
  2. Open Developer Tools by pressing CTRL+SHIFT+I (or CMD+OPTION+I on Mac).
  3. Select an image to capture. Follow the prompts illustrated in the developer tools.
  4. Take note of image selection logic and modify the collect_links.py accordingly.
  5. Refer to W3Schools XPATH Documentation for syntax help.
  6. Utilize the CTRL+F feature in developer tools to test your XPATH queries.

If issues persist or you’re in need of real-time support, feel free to reach out! For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox