Repository Data Collection
Overview
Dependabot and Secret Scanning alerts are collected from their respective GitHub API Endpoints.
- Secret Scanning:
GET /orgs/{org}/secret-scanning/alerts
- Dependabot:
GET /orgs/{org}/dependabot/alerts
These endpoints list all alerts for the organisation. The endpoint contains the name of the repository they belong to, but, unfortunately, does not contain the repository's visibility (public/private/internal) or if the repository is archived or not.
This information is important to allow users to filter alerts appropriately, for example, any public Secret Scanning alerts are much more of a risk than private ones. We want users to be able to highlight these sorts of issues.
We must, therefore, collect the repository information from the GitHub API separately to allow us to provide this functionality.
How is the data collected?
The additional repository information is collected by the dashboard at runtime. Although this is not ideal due to performance, it must be collected in the frontend due to the already stretched rate limits of the GitHub API in the Data Logger.
In order to make this process as efficient and performant as possible, the function changes how it collects the data based on which alerts the data is being collected for.
The function is available within ./src/utilities
as get_github_repository_information()
. See below for the function's docstring.
Retrieves additional information about repositories in a GitHub organization (Repository Type and Archived Status).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ql |
github_graphql_interface
|
The GraphQL interface for GitHub API. |
required |
org |
str
|
The GitHub organization name. |
required |
repository_list |
list
|
A list of specific repositories to check. If None, all repositories in the organization are checked. |
None
|
Returns:
Type | Description |
---|---|
Tuple[dict, dict]
|
Tuple[dict, dict]: A tuple containing two dictionaries: - repo_types: A dictionary mapping repository names to their types (Public, Internal, Private). - archived_status: A dictionary mapping repository names to their archived status (Archived, Not Archived). |
Source code in src/utilities.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
The function works by collecting the repository information of either all repositories in the organisation or only those that have been passed to the function within repository_list
. The option is provided because in some cases, it requires less API calls to collect all repositories in the organisation, rather than just those that have alerts.
For Secret Scanning alerts, there are likely to be less than 30 repositories with alerts, so it is more efficient to collect only those repositories. For Dependabot alerts, there are likely to be more than 30 repositories with alerts, so it is more efficient to collect all repositories in the organisation.
It works out cheaper on the API to collect all repositories in the organisation because we can collect up to 100 repositories per API call. At the time of writing, the organisation has around 3000 repositories, meaning 30 API calls to collect all repositories. If there are more than 30 repositories to collect, we may as well collect all repositories in the organisation, as it will only require 30 API calls.
The endpoints used to collect the repository information are:
- List Repositories for an Organisation:
GET /orgs/{org}/repos
- Get a Repository:
GET /repos/{owner}/{repo}
How is the data used?
Once we have the repository information, we can map it to a new column within the DataFrame containing the respective alerts - providing the extra data we need.