How to Mine Source Code Repositories at Scale Using Tree-Hugger

Oct 16, 2023 | Data Science

If you’re looking to dive into the world of code mining and want a tool that’s efficient, versatile, and user-friendly, you’ve come to the right place! Tree-Hugger is a powerful library that enables you to mine Git repositories or any supported code files effortlessly. In this guide, we’ll walk you through the steps of installing and using Tree-Hugger, understanding its functionalities, and troubleshooting common issues.

What is Tree-Hugger?

Tree-Hugger is a lightweight, high-level library that provides Pythonic APIs making it incredibly easy to explore and analyze source code across multiple programming languages including Python, PHP, Java, JavaScript, and C++. Built on top of tree-sitter, it gives you the ability to parse and analyze code like a pro.

Installation

To get started, you need to install Tree-Hugger. Here’s how you can do it:

  • From pip: Run the command pip install -U tree-hugger PyYAML
  • From Source:
    • Clone the repository: git clone https://github.com/autosoft-dev/tree-hugger.git
    • Change directory: cd tree-hugger
    • Install: pip install -e .

The installation process has been validated on macOS Mojave, and for Linux environments, you will need to install libgit2 (use brew install libgit2 on macOS).

Setup

After installing, you might need to prepare your environment:

  • If applicable, set the TS_LIB_PATH environment variable for the tree-sitter library path.
  • Make sure to download the necessary .so files if they do not work by default.

Hello World Example

Now, let’s get our hands dirty! Here’s a simple analogy to understand how Tree-Hugger parses different languages: Think of each programming language as a book in a library. Tree-Hugger acts like a librarian, helping you extract specific information from those books depending on the section you need to refer to.

Here’s how you can parse files in various programming languages:

  • Python:
    from tree_hugger.core import PythonParser
    pp = PythonParser()
    pp.parse_file('tests/assets/file_with_different_functions.py')
    pp.get_all_function_names()
  • PHP:
    from tree_hugger.core import PHPParser
    phpp = PHPParser()
    phpp.parse_file('tests/assets/file_with_different_functions.php')
    phpp.get_all_function_names()
  • Java:
    from tree_hugger.core import JavaParser
    jp = JavaParser()
    jp.parse_file('tests/assets/file_with_different_methods.java')
    jp.get_all_class_names()
  • JavaScript:
    from tree_hugger.core import JavascriptParser
    jsp = JavascriptParser()
    jsp.parse_file('tests/assets/file_with_different_functions.js')
    jsp.get_all_function_names()
  • C++:
    from tree_hugger.core import CPPParser
    cp = CPPParser()
    cp.parse_file('tests/assets/file_with_different_functions.cpp')
    cp.get_all_function_names()

API Reference

Tree-Hugger provides various methods to extract useful information about code. Below are selected functions available for each language:

Language Functions
Python get_all_function_names, get_all_class_names, etc.
PHP get_all_function_names, get_all_class_names, etc.
Java get_all_class_method_names, get_all_class_names, etc.
JavaScript get_all_function_names, get_all_class_names, etc.
C++ get_all_function_names, get_all_class_names, etc.

Extending Tree-Hugger

Extending Tree-Hugger to support additional languages or functionalities is straightforward:

Adding Languages

  • Create a new parser class inheriting from BaseParser.
  • Implement necessary methods like loading the .so file and preparing the queries.

Adding Queries

To write queries, you’ll use a YAML file structured with s-expressions, allowing you to extract specific data easily. For example:

all_function_docstrings:
    (
      function_definition
      name: (identifier) @function.def
      body: (block(expression_statement(string))) @function.docstring
    )

Troubleshooting

If you encounter issues during installation or usage, consider the following:

  • Ensure all dependencies are correctly installed, especially libgit2.
  • Double-check that your environment variables are set properly.
  • If you downloaded .so files from the web, confirm they’re compatible with your system.
  • Review the Tree-Hugger GitHub page for issue tracking and community support.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Tree-Hugger is an incredible tool that simplifies the process of mining source code repositories while providing extensive functionalities for various programming languages. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox