DATA STORIES | REPORTING | KNIME ANALYTICS PLATFORM
Automate your GitHub Stats Reporting with KNIME
Connect to your GitHub account via API and create a consolidated report that summarizes the overall performance of all your repositories
The Motivation
Recently, I wanted to take a look at how my repositories on GitHub are doing. I haven’t published anything for a while and wanted to get an updated overview of my repositories. How much traffic they have with views, unique visitors and so on.
My goal is to have a complete overview of the traffic of all repositories at a glance on one page.
However, when analyzing how to achieve this, I noticed that I had to select each repository individually to get the “Insights” metrics. In addition, you must be logged in to your GitHub account to access the corresponding pages.
This leads to the conclusion that web scraping of these pages is too complicated, as extracting the data from the graphics is also too complex.
The GitHub REST API
The efficient solution for automatically extracting the required information from GitHub is via the REST API.
Some pages and information in the repositories can be accessed directly with the following code:
https://api.github.com/repos/{owner}/{repo}
So if I call one of my repositories on GitHub with the following endpoint:
https://api.github.com/repos/deganza/Install-TensorFlow-on-Mac-M1-GPU
…I get the following output:
{
"id": 531901113,
"node_id": "R_kgDOH7QquQ",
"name": "Install-TensorFlow-on-Mac-M1-GPU",
"full_name": "deganza/Install-TensorFlow-on-Mac-M1-GPU",
"private": false,
"owner": {
"login": "deganza",
"id": 42083662,
"node_id": "MDQ6VXNlcjQyMDgzNjYy",
"avatar_url": "https://avatars.githubusercontent.com/u/42083662?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/deganza",
"html_url": "https://github.com/deganza",
"followers_url": "https://api.github.com/users/deganza/followers",
"following_url": "https://api.github.com/users/deganza/following{/other_user}",
"gists_url": "https://api.github.com/users/deganza/gists{/gist_id}",
"starred_url": "https://api.github.com/users/deganza/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/deganza/subscriptions",
"organizations_url": "https://api.github.com/users/deganza/orgs",
"repos_url": "https://api.github.com/users/deganza/repos",
"events_url": "https://api.github.com/users/deganza/events{/privacy}",
"received_events_url": "https://api.github.com/users/deganza/received_events",
"type": "User",
"site_admin": false
},
"html_url": "https://github.com/deganza/Install-TensorFlow-on-Mac-M1-GPU",
"description": "Install TensorFlow in a few steps on Mac M1/M2 with GPU support and benefit from the native performance of the new Mac Silicon ARM64 architecture.",
"fork": false,
....
To call the views on the traffic page of the same repository, I use the following endpoint:
https://api.github.com/repos/deganza/Install-TensorFlow-on-Mac-M1-GPU/traffic/views
But this time I got the following error:
{
"message": "Must have push access to repository",
"documentation_url": "https://docs.github.com/rest/metrics/traffic#get-page-views"
}
The reason for this is that in order to get this information, I need to generate a new token for authenticating.
To do this follow these steps:
- Step 1: Log in to the GitHub Account.
- Step 2: Go to Settings >> Developer settings >> Personal access tokens.
- Step 3: Then, click on generate a new token.
- Step 4: Confirm the user password to continue.
- Step 5: Add a description to the token.
- Step 6: Under the select scopes option, check all the boxes.
- Step 7: Finally, click on generate a new token.
All information and statistics from my GitHub repositories can now be directly accessed via the API.
The following documentation describes how to use the GitHub REST API:
For this project, however, I will only need the following endpoints
To call all my repositories:
(change owner with your account without brackets)
https://api.github.com/users/{owner}/repos
And to get the traffic on my repositories:
(change repo with your repositories without brackets)
https://api.github.com/repos/{owner}/{repo}/traffic/views
Loading and parsing the data with KNIME
To get everything from one source, I will create both the automated calls via the API and the visualization in a report in KNIME.
If you still don’t know KNIME, you can find here a good “Getting Started Guide” where you can also download the open-source software for free.
The visual programming language of KNIME is self-explanatory and therefore easy to learn.
The KNIME workflows with all the examples can be found on my
KNIME Community Hub space.
The following part of the KNIME workflow calls up all my repositories via the GitHub API and lists them in a table with their names, the number of forks and the number of stars assigned.
For this part of the workflow I don’t even need to use my generatet GitHub token. The “GET Request” node calls the API, the “JSON to XML” node converts the JSON output into an XML and finally we can select the desired information via XPath.
To get only the repositories that I have created myself and not forked from other projects, I have set the filter “forks=false” with the “Row Filter” node.
In addition to the names and links of the repositories, we can now also extract the stars and forks from the generated XML file. To do this, we need the XPath node. Once again, the XPath node proves to be very helpful.
We have already seen it in various articles on how it can easily be used for web scraping (see references above).
The following example is even simpler. Just click in the corresponding section of the XML file and an XPath code is suggested, which can be used immediately to create a column.
Now I have all api.githup URL’s of my repos. The next step is to call up the traffic statistics for each repository.
To accomplish this, I need the “GET Request” node again. This time, however, I have to pass the URL paths directly and adapt the header so that I can pass the generated token.
Now I can call up the traffic with the views etc. again with the XPath node and read the statistics of the last 30 days.
It is not possible to go back further than the last 30 days. This workflow could therefore be called up every few weeks or once a month in order to historicize the statistical data in a file or database.
I have described in detail how to do this in KNIME in the following article:
Reporting with KNIME
With the new “KNIME Reporting” extension, we can quickly and easily build our report. The new nodes from this extension allows you to export the results of analyses from component composite views efficiently and generate static PDF or HTML reports.
The best thing about this new feature is that you don’t have to learn anything new; all you need is what you’re already familiar with, as well as enabling report output in the layout settings of a component with a view.
After the data preprocessing is done, we are ready to continue with the next step: building the report.
Building a report in KNIME follows three steps:
1. Adding views to your workflow
Below you can find some compatible view nodes.
— > Hint: Add comments to the view nodes to identify them more easily later. Choose from various ways to present your data: from textual descriptions and tables via bar charts to more advanced visualizations such as heatmaps.
2. Wrap the view nodes in a component and design the layout of your report
Wrap the view nodes to be part of your report in a component: Select the nodes, then click “Create component” in the toolbar at the top. To design your report layout, right-click on the component and select “Open layout editor”. Check the “Enable Reporting” checkbox at the bottom of the editor to output a report.
Once you have done that and closed the layout editor, a petrol-colored input and output port will appear. It is essential to connect the “Report Template Creator” node to the input port to define the size and orientation of the report.
3. Write to PDF or HTML file
Connect a “Report PDF Writer” or “Report HTML Writer” node to the petrol-colored output port to create and export PDF or HTML reports. Merge smaller reports into one for convenience.
In our case, we will just build a simple report. In the upper left part we want to print the date of the last update. We want a table with total views and unique visitors of the last 30 days and a line plot with the evolution of this two metrics.
Then we want also table and a bar chart with the totals forks and stars of the repositories.
And that’s it, our GitHub Stats Report is ready. We can now save it as PDF or HTML file.
You can even add a “Send Email” node to automatically send the report to your boss.
Conclusion
Once again, KNIME proves to be THE Swiss army knife for every data engineer and data science task.
The new KNIME Reporting Extension has great potential and enables many reporting use cases to be implemented just with the open source version of KNIME.
For me, there is no doubt about it: The major players in the reporting industry will soon have to brace themselves!
Material for this project:
- KNIME workflow: KNIME Community Hub
References:
- Scraping NFL Data with KNIME — Part 1 (Dennis Ganzaroli)
- Getting Started with the REST API (GitHub Docs)
- Say hi to KNIME reporting (KNIME blog)
- KNIME Reporting: Transforming Report Communication (Ángel Molina Laguna)
Thanks for reading and may the Data Force be with you! Please feel free to share your thoughts or reading tips in the comments.
Follow me on Medium, LinkedIn or Twitter and follow my Facebook Group “Data Science with Yodime”.