Explore the dynamic world of programming languages used across GitHub repositories!
Getting Started
Have you ever wondered which programming languages dominate GitHub? In this blog, we will explore how to generate and analyze statistics about programming languages, licenses, and pull requests from GitHub’s dataset using SQL queries in Google BigQuery. This information can be instrumental for developers, researchers, and businesses alike.
Data Generation
To kick-start our analysis, we’ll dive into a couple of SQL queries that allow us to generate language statistics and license counts. Let’s break them down.
Languages
- To fetch the top programming languages used on GitHub, we can use the following SQL query:
SELECT language.name, COUNT(language.name) AS count
FROM [bigquery-public-data:github_repos.languages]
GROUP BY language.name
ORDER BY count DESC
In this analogy, think of a library filled with diverse genres of books. Each book represents a programming language. By counting how many books there are of each genre, we get a clearer picture of what genres (or languages) are the most popular.
Licenses
- Similarly, to get a list of licenses used across GitHub:
SELECT license, COUNT(license) AS count
FROM [bigquery-public-data:github_repos.licenses]
GROUP BY license
ORDER BY count DESC
Imagine a park where people have different permissions to use the space, akin to licenses for software. Some areas might allow picnics (like MIT License), while others might have strict rules (like GPL). Counting how often each permission type is granted helps us understand the park’s usage better.
Pull Requests
- Lastly, to analyze the number of pull requests per day, month, or year, you can use this SQL statement:
SELECT language AS name, year, quarter, count
FROM (
SELECT *
FROM (
SELECT lang AS language, y AS year, q AS quarter, type, COUNT(*) AS count
FROM (
SELECT a.type, b.lang, a.y, a.q
FROM (
SELECT type, actor.login, YEAR(created_at) AS y, QUARTER(created_at) AS q,
STRING(REGEXP_REPLACE(repo.url, r'https://github.com', 'https://api.github.com/repos', )) AS name
FROM [githubarchive:month.201901]
WHERE NOT LOWER(actor.login) LIKE '%bot%'
) a
JOIN (
SELECT repo_name AS name, lang
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY lang) AS num
FROM (
SELECT repo_name, FIRST_VALUE(language.name) OVER (PARTITION BY repo_name ORDER BY language.bytes DESC) AS lang
FROM [bigquery-public-data:github_repos.languages]
)
WHERE num = 1 ORDER BY repo_name
)
WHERE lang IS NOT NULL
) b ON a.name = b.name
GROUP BY type, language, year, quarter
ORDER BY year, quarter, count DESC
)
WHERE count = 100
)
WHERE type = 'PullRequestEvent'
In the world of collaboration, pull requests are like submitting proposals for building a new section in our park. Some areas see more proposals than others, revealing where the community is most active and involved.
Manual Method
Google BigQuery is a powerful tool that allows access to public datasets like GitHub. It’s free for users with a limit of 1000 GB query volume each month. One of the above queries usually takes around 50-200 MB of query volume to execute.
You can explore the public dataset for GitHub here.
Troubleshooting
If you encounter issues while retrieving data or running queries, here are a few tips to resolve them:
- Ensure that your SQL syntax is correct. Even a small typo can lead to errors.
- Check your permissions on Google BigQuery to confirm that you have access to the datasets.
- If the queries take too long to execute, consider refining them to request smaller data sets.
- You can also refer to the documentation for more details on how to optimize your queries and use the available resources efficiently.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Understanding GitHub language statistics helps you stay updated with the tech landscape and make informed decisions in your projects. Whether you are seeking to learn new skills or gauge community involvement, the data available can guide your path. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.