GitHub Language Statistics

Feb 10, 2022 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitstatisticsreadme_madnight_githut

Explore the dynamic world of programming languages used across GitHub repositories!

Getting Started

Have you ever wondered which programming languages dominate GitHub? In this blog, we will explore how to generate and analyze statistics about programming languages, licenses, and pull requests from GitHub’s dataset using SQL queries in Google BigQuery. This information can be instrumental for developers, researchers, and businesses alike.

Data Generation

To kick-start our analysis, we’ll dive into a couple of SQL queries that allow us to generate language statistics and license counts. Let’s break them down.

Languages

To fetch the top programming languages used on GitHub, we can use the following SQL query:

SELECT language.name, COUNT(language.name) AS count 
FROM [bigquery-public-data:github_repos.languages] 
GROUP BY language.name 
ORDER BY count DESC

In this analogy, think of a library filled with diverse genres of books. Each book represents a programming language. By counting how many books there are of each genre, we get a clearer picture of what genres (or languages) are the most popular.

Licenses

Similarly, to get a list of licenses used across GitHub:

SELECT license, COUNT(license) AS count 
FROM [bigquery-public-data:github_repos.licenses] 
GROUP BY license 
ORDER BY count DESC

Imagine a park where people have different permissions to use the space, akin to licenses for software. Some areas might allow picnics (like MIT License), while others might have strict rules (like GPL). Counting how often each permission type is granted helps us understand the park’s usage better.

Pull Requests

Lastly, to analyze the number of pull requests per day, month, or year, you can use this SQL statement:

SELECT language AS name, year, quarter, count 
FROM ( 
    SELECT * 
    FROM ( 
        SELECT lang AS language, y AS year, q AS quarter, type, COUNT(*) AS count 
        FROM ( 
            SELECT a.type, b.lang, a.y, a.q 
            FROM ( 
                SELECT type, actor.login, YEAR(created_at) AS y, QUARTER(created_at) AS q, 
                STRING(REGEXP_REPLACE(repo.url, r'https://github.com', 'https://api.github.com/repos', )) AS name 
                FROM [githubarchive:month.201901] 
                WHERE NOT LOWER(actor.login) LIKE '%bot%' 
            ) a 
            JOIN ( 
                SELECT repo_name AS name, lang 
                FROM ( 
                    SELECT *, ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY lang) AS num 
                    FROM ( 
                        SELECT repo_name, FIRST_VALUE(language.name) OVER (PARTITION BY repo_name ORDER BY language.bytes DESC) AS lang 
                        FROM [bigquery-public-data:github_repos.languages]
                    ) 
                WHERE num = 1 ORDER BY repo_name 
                ) 
            WHERE lang IS NOT NULL 
            ) b ON a.name = b.name 
            GROUP BY type, language, year, quarter 
            ORDER BY year, quarter, count DESC 
        ) 
        WHERE count = 100 
    ) 
    WHERE type = 'PullRequestEvent'

In the world of collaboration, pull requests are like submitting proposals for building a new section in our park. Some areas see more proposals than others, revealing where the community is most active and involved.

Manual Method

Google BigQuery is a powerful tool that allows access to public datasets like GitHub. It’s free for users with a limit of 1000 GB query volume each month. One of the above queries usually takes around 50-200 MB of query volume to execute.

You can explore the public dataset for GitHub here.

Troubleshooting

If you encounter issues while retrieving data or running queries, here are a few tips to resolve them:

Ensure that your SQL syntax is correct. Even a small typo can lead to errors.
Check your permissions on Google BigQuery to confirm that you have access to the datasets.
If the queries take too long to execute, consider refining them to request smaller data sets.
You can also refer to the documentation for more details on how to optimize your queries and use the available resources efficiently.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Understanding GitHub language statistics helps you stay updated with the tech landscape and make informed decisions in your projects. Whether you are seeking to learn new skills or gauge community involvement, the data available can guide your path. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox