Unlock the Power of Financial Documents with EDGAR-CRAWLER

Mar 19, 2024 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_nlpaueb_edgar-crawler

Tired of sifting through endless financial reports of 100+ pages, struggling to extract meaningful insights? Let me guide you through the world of EDGAR-CRAWLER, an open-source toolkit that retrieves key information from financial reports with ease. In this guide, we’ll explore how to use this powerful tool to crawl any report found in the SEC EDGAR database, the web repository for all publicly traded companies in the USA.

Why Use EDGAR-CRAWLER?

EDGAR-CRAWLER is not just another ordinary toolkit. While it can download EDGAR filings like the others, its real power lies in its ability to preprocess and convert lengthy, unstructured documents into clean, easy-to-use JSON files. This functionality means you can focus on the data that matters most without getting lost in the minutiae.

Core Modules of EDGAR-CRAWLER

Business Documents Crawling: Use the edgar_crawler.py module to effortlessly crawl and download financial reports for publicly traded companies over specified years.
Item Extraction: Utilize the extract_items.py module to extract and clean specific sections, such as Risk Factors, directly from 10-K documents (annual reports).

Who Can Benefit from EDGAR-CRAWLER?

EDGAR-CRAWLER caters to a broad audience:

Academics: Enhance NLP research in economics and finance through efficient data access and analysis.
Professionals: Strengthen decision-making and strategic planning with comprehensive and easy-to-interpret financial reports.
Developers: Seamlessly integrate financial data into your models, applications, and experiments using this open-source toolkit.

How to Set Up and Use EDGAR-CRAWLER

Installation

Before getting started, it’s recommended to create a new virtual environment using Python 3.8. For an easier setup, consider installing and using Anaconda.

Once that’s done, you’ll want to install the necessary dependencies via:

pip install -r requirements.txt

Configuration

To tailor the behavior of the two modules, you will need to edit the config.json file before running any script. Here’s a breakdown of what you can configure:

edgar_crawler.py:
- start_year XXXX: Starting year for reports (default is 2021).
- end_year YYYY: Ending year for reports (default is 2021).
- quarters: Quarters to download filings from (default is [1, 2, 3, 4]).
- filing_types: Types of filings to download (default is [10-K, 10-K405, 10-KT]).
- cik_tickers: List or path file with CIKs or Tickers.
- user_agent: User-agent declared to SEC EDGAR.
- raw_filings_folder: Folder for downloaded filings (default is RAW_FILINGS).
- filings_metadata_file: CSV filename for report metadata.
- skip_present_indices: Skip existing indices (default is True).
extract_items.py:
- raw_filings_folder: Folder for downloaded documents (default is RAW_FILINGS).
- extracted_filings_folder: Folder for extracted documents (default is EXTRACTED_FILINGS).
- items_to_extract: List of item sections to extract.
- remove_tables: Remove numerical tables (often not useful for NLP).
- skip_extracted_filings: Skip existing extractions (default is True).

Running the Scripts

To download financial reports from EDGAR, run:
```
python edgar_crawler.py
```
To clean and extract specific sections from already downloaded 10-K documents, run:
```
python extract_items.py
```

Understanding the Code with an Analogy

Think of EDGAR-CRAWLER as a librarian in a gigantic library filled with financial reports. This librarian not only fetches books (financial reports) for you but also organizes them (turns unstructured data into JSON files) and highlights the most important chapters (extracts specific items like Risk Factors and Management Discussion) for your quick reading.

Troubleshooting

If you encounter issues during installation or execution, here are some troubleshooting ideas:

Ensure your Python version is 3.8, as other versions may cause compatibility issues.
If the toolkit is not downloading expected reports, double-check your config.json settings for correctness.
For any further issues, it’s recommended to create an issue on GitHub instead of emailing directly. This way, all potential users can benefit from the troubleshooting.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox