How to Use Octopii for Detecting Personally Identifiable Information (PII)

Jul 26, 2023 | Data Science

Welcome to the world of Octopii, where the quest for safeguarding your sensitive information begins. This innovative tool developed by RedHunt Labs enables users to automatically discover and extract leaked Personally Identifiable Information (PII) from various formats including images, PDFs, and documents. In this guide, we’ll walk you through the installation, running the tool, and understanding the intricacies involved in this crucial cybersecurity endeavor.

Why Octopii?

PII leaks often fly under the radar in the cybersecurity landscape. With Octopii, we aim to shed light on these vulnerabilities by enabling the automation of discovery. The tool highlights how simple it can be to expose sensitive information when systems are not correctly configured.

Getting Started: Installation

To start using Octopii, follow these steps for installation:

  • Install all dependencies by executing: pip install -r requirements.txt.
  • For Linux users, install the Tesseract OCR tool:
    • For Ubuntu: sudo apt install tesseract-ocr -y
    • For Arch Linux: sudo pacman -Syu tesseract
  • Download Spacy language definitions by running: python -m spacy download en_core_web_sm.

Once these dependencies are installed, you’re all set to start scanning!

Running Octopii

To run Octopii, type the following command:

python3 octopii.py location_to_scan

Here, location_to_scan can be a file or directory. Octopii supports scanning from the local filesystem, S3 URLs, and Apache open directories. You can also provide individual image URLs or files as arguments.

Example Usage

To test Octopii, you can use the provided sample PII folder:

python3 octopii.py dummy-pii

The output reflects the types of PII found along with their details:

dummy-drivers-license-nebraska-us.jpg
file_path: dummy-pii/dummy-drivers-license-nebraska-us.jpg,
pii_class: Nebraska Drivers License,
country_of_origin: United States,
faces: 1,
identifiers: [],
emails: [],
phone_numbers: [4000002170],
addresses: [Nebraska]

dummy-PAN-India.jpg
file_path: dummy-pii/dummy-PAN-India.jpg,
pii_class: Permanent Account Number,
country_of_origin: India,
faces: 0,
identifiers: [],
emails: [],
phone_numbers: [],
addresses: [INDIA]

An output.txt file will be generated, logging the findings in real-time.

Understanding the Mechanism: An Analogy

Think of Octopii as a highly skilled detective working in a library filled with books (your files). The detective uses a combination of tools:

  • Magnifying Glass (Optical Character Recognition): This helps identify words among many lettered pages.
  • Dictionary (Natural Language Processing): It assists in understanding the meaning of the identified words.
  • Fingerprint Scanner (Face Detection): Detects faces on documents, adding to the richness of the investigation.

The detective cleans up the pages to ensure clarity, finds defined keywords that represent sensitive information, and organizes everything in a neat report. That’s Octopii, sweeping through digital clutter to ensure your safety!

Troubleshooting Common Issues

If you encounter issues while running Octopii, consider the following troubleshooting steps:

  • Ensure all dependencies are correctly installed and compatible with your Python version.
  • Check that you have appropriate permissions to access files or URLs you are attempting to scan.
  • If Octopii is not recognizing certain file types, confirm that the files are in a supported format as mentioned in the README.
  • Review the fxis.ai for any updates or patches applicable to the tool.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With Octopii, RedHunt Labs is paving the way to a more secure digital environment. Following the steps outlined in this article, you can now effectively scan and manage PII-related risks.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox