DATA STORIES | REPORTING | KNIME ANALYTICS PLATFORM

Automate your GitHub Stats Reporting with KNIME

Connect to your GitHub account via API and create a consolidated report that summarizes the overall performance of all your repositories

Dennis Ganzaroli
Low Code for Data Science
8 min readDec 15, 2023

--

Fig 1: GitHub Stats Report with KNIME 5.2 (image by author).

The Motivation

Recently, I wanted to take a look at how my repositories on GitHub are doing. I haven’t published anything for a while and wanted to get an updated overview of my repositories. How much traffic they have with views, unique visitors and so on.

My goal is to have a complete overview of the traffic of all repositories at a glance on one page.

However, when analyzing how to achieve this, I noticed that I had to select each repository individually to get the “Insights” metrics. In addition, you must be logged in to your GitHub account to access the corresponding pages.

Fig 2: Traffic on a repository on my GitHub (image by author).

This leads to the conclusion that web scraping of these pages is too complicated, as extracting the data from the graphics is also too complex.

The GitHub REST API

The efficient solution for automatically extracting the required information from GitHub is via the REST API.

Some pages and information in the repositories can be accessed directly with the following code:

https://api.github.com/repos/{owner}/{repo}

So if I call one of my repositories on GitHub with the following endpoint:

https://api.github.com/repos/deganza/Install-TensorFlow-on-Mac-M1-GPU

…I get the following output:

{
"id": 531901113,
"node_id": "R_kgDOH7QquQ",
"name": "Install-TensorFlow-on-Mac-M1-GPU",
"full_name": "deganza/Install-TensorFlow-on-Mac-M1-GPU",
"private": false,
"owner": {
"login": "deganza",
"id": 42083662,
"node_id": "MDQ6VXNlcjQyMDgzNjYy",
"avatar_url": "https://avatars.githubusercontent.com/u/42083662?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/deganza",
"html_url": "https://github.com/deganza",
"followers_url": "https://api.github.com/users/deganza/followers",
"following_url": "https://api.github.com/users/deganza/following{/other_user}",
"gists_url": "https://api.github.com/users/deganza/gists{/gist_id}",
"starred_url": "https://api.github.com/users/deganza/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/deganza/subscriptions",
"organizations_url": "https://api.github.com/users/deganza/orgs",
"repos_url": "https://api.github.com/users/deganza/repos",
"events_url": "https://api.github.com/users/deganza/events{/privacy}",
"received_events_url": "https://api.github.com/users/deganza/received_events",
"type": "User",
"site_admin": false
},
"html_url": "https://github.com/deganza/Install-TensorFlow-on-Mac-M1-GPU",
"description": "Install TensorFlow in a few steps on Mac M1/M2 with GPU support and benefit from the native performance of the new Mac Silicon ARM64 architecture.",
"fork": false,
....

To call the views on the traffic page of the same repository, I use the following endpoint:

https://api.github.com/repos/deganza/Install-TensorFlow-on-Mac-M1-GPU/traffic/views

But this time I got the following error:

{
"message": "Must have push access to repository",
"documentation_url": "https://docs.github.com/rest/metrics/traffic#get-page-views"
}

The reason for this is that in order to get this information, I need to generate a new token for authenticating.

To do this follow these steps:

  • Step 1: Log in to the GitHub Account.
  • Step 2: Go to Settings >> Developer settings >> Personal access tokens.
  • Step 3: Then, click on generate a new token.
  • Step 4: Confirm the user password to continue.
  • Step 5: Add a description to the token.
  • Step 6: Under the select scopes option, check all the boxes.
  • Step 7: Finally, click on generate a new token.
Fig 3: Generate a new token for authentication of GitHub REST APIs (image by author).

All information and statistics from my GitHub repositories can now be directly accessed via the API.

The following documentation describes how to use the GitHub REST API:

For this project, however, I will only need the following endpoints

To call all my repositories:
(change owner with your account without brackets)

https://api.github.com/users/{owner}/repos

And to get the traffic on my repositories:
(change repo with your repositories without brackets)

https://api.github.com/repos/{owner}/{repo}/traffic/views

Loading and parsing the data with KNIME

To get everything from one source, I will create both the automated calls via the API and the visualization in a report in KNIME.

If you still don’t know KNIME, you can find here a good “Getting Started Guide” where you can also download the open-source software for free.

The visual programming language of KNIME is self-explanatory and therefore easy to learn.

Fig 4: Getting Set Up with KNIME Analytics Platform (image from KNIME).

The KNIME workflows with all the examples can be found on my
KNIME Community Hub space.

The following part of the KNIME workflow calls up all my repositories via the GitHub API and lists them in a table with their names, the number of forks and the number of stars assigned.

For this part of the workflow I don’t even need to use my generatet GitHub token. The “GET Request” node calls the API, the “JSON to XML” node converts the JSON output into an XML and finally we can select the desired information via XPath.

Fig 5: KNIME Workflow to call my GitHub Repos (image from author).

To get only the repositories that I have created myself and not forked from other projects, I have set the filter “forks=false” with the “Row Filter” node.

In addition to the names and links of the repositories, we can now also extract the stars and forks from the generated XML file. To do this, we need the XPath node. Once again, the XPath node proves to be very helpful.
We have already seen it in various articles on how it can easily be used for web scraping (see references above).

The following example is even simpler. Just click in the corresponding section of the XML file and an XPath code is suggested, which can be used immediately to create a column.

Fig 6: Generating a column with XPath from an XML (image from author).

Now I have all api.githup URL’s of my repos. The next step is to call up the traffic statistics for each repository.

To accomplish this, I need the “GET Request” node again. This time, however, I have to pass the URL paths directly and adapt the header so that I can pass the generated token.

Fig 7: Calling the GitHub API with generated token (image from author).

Now I can call up the traffic with the views etc. again with the XPath node and read the statistics of the last 30 days.

Fig 8: Output of Traffic Stats on my GitHub (image from author).

It is not possible to go back further than the last 30 days. This workflow could therefore be called up every few weeks or once a month in order to historicize the statistical data in a file or database.

I have described in detail how to do this in KNIME in the following article:

Reporting with KNIME

With the new “KNIME Reporting” extension, we can quickly and easily build our report. The new nodes from this extension allows you to export the results of analyses from component composite views efficiently and generate static PDF or HTML reports.

Fig 9: The new “KNIME Reporting” extension (image from author).

The best thing about this new feature is that you don’t have to learn anything new; all you need is what you’re already familiar with, as well as enabling report output in the layout settings of a component with a view.

After the data preprocessing is done, we are ready to continue with the next step: building the report.

Building a report in KNIME follows three steps:

1. Adding views to your workflow

Below you can find some compatible view nodes.
— > Hint: Add comments to the view nodes to identify them more easily later. Choose from various ways to present your data: from textual descriptions and tables via bar charts to more advanced visualizations such as heatmaps.

Fig 10: View Nodes to build a report in KNIME (image from KNIME).

2. Wrap the view nodes in a component and design the layout of your report

Wrap the view nodes to be part of your report in a component: Select the nodes, then click “Create component” in the toolbar at the top. To design your report layout, right-click on the component and select “Open layout editor”. Check the “Enable Reporting” checkbox at the bottom of the editor to output a report.

Once you have done that and closed the layout editor, a petrol-colored input and output port will appear. It is essential to connect the “Report Template Creator” node to the input port to define the size and orientation of the report.

Fig 11: The Report Template Creator (image from KNIME).

3. Write to PDF or HTML file

Connect a “Report PDF Writer” or “Report HTML Writer” node to the petrol-colored output port to create and export PDF or HTML reports. Merge smaller reports into one for convenience.

Fig 12: Nodes for Reporting Output (image from KNIME).

In our case, we will just build a simple report. In the upper left part we want to print the date of the last update. We want a table with total views and unique visitors of the last 30 days and a line plot with the evolution of this two metrics.

Fig 13: Building the Reporting Component (image from author).

Then we want also table and a bar chart with the totals forks and stars of the repositories.

And that’s it, our GitHub Stats Report is ready. We can now save it as PDF or HTML file.

You can even add a “Send Email” node to automatically send the report to your boss.

Fig 14: The final GitHub Stat Report (image from author).

Conclusion

Once again, KNIME proves to be THE Swiss army knife for every data engineer and data science task.

The new KNIME Reporting Extension has great potential and enables many reporting use cases to be implemented just with the open source version of KNIME.

For me, there is no doubt about it: The major players in the reporting industry will soon have to brace themselves!

Material for this project:

References:

Thanks for reading and may the Data Force be with you! Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter and follow my Facebook Group “Data Science with Yodime”.

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.