Skip to content

Repository Data Collection

Overview

Dependabot and Secret Scanning alerts are collected from their respective GitHub API Endpoints.

These endpoints list all alerts for the organisation. The endpoint contains the name of the repository they belong to, but, unfortunately, does not contain the repository's visibility (public/private/internal) or if the repository is archived or not.

This information is important to allow users to filter alerts appropriately, for example, any public Secret Scanning alerts are much more of a risk than private ones. We want users to be able to highlight these sorts of issues.

We must, therefore, collect the repository information from the GitHub API separately to allow us to provide this functionality.

How is the data collected?

The additional repository information is collected by the dashboard at runtime. Although this is not ideal due to performance, it must be collected in the frontend due to the already stretched rate limits of the GitHub API in the Data Logger.

In order to make this process as efficient and performant as possible, the function changes how it collects the data based on which alerts the data is being collected for.

The function is available within ./src/utilities as get_github_repository_information(). See below for the function's docstring.


Retrieves additional information about repositories in a GitHub organization (Repository Type and Archived Status).

Parameters:

Name Type Description Default
ql github_graphql_interface

The GraphQL interface for GitHub API.

required
org str

The GitHub organization name.

required
repository_list list

A list of specific repositories to check. If None, all repositories in the organization are checked.

None

Returns:

Type Description
Tuple[dict, dict]

Tuple[dict, dict]: A tuple containing two dictionaries: - repo_types: A dictionary mapping repository names to their types (Public, Internal, Private). - archived_status: A dictionary mapping repository names to their archived status (Archived, Not Archived).

Source code in src/utilities.py
@st.cache_data(ttl=timedelta(hours=1))
def get_github_repository_information(
    _rest: github_api_toolkit.github_interface, 
    org: str, 
    repository_list: list = None
) -> Tuple[dict, dict]:
    """Retrieves additional information about repositories in a GitHub organization (Repository Type and Archived Status).

    Args:
        ql (github_api_toolkit.github_graphql_interface): The GraphQL interface for GitHub API.
        org (str): The GitHub organization name.
        repository_list (list, optional): A list of specific repositories to check. If None, all repositories in the organization are checked.

    Returns:
        Tuple[dict, dict]: A tuple containing two dictionaries:
            - repo_types: A dictionary mapping repository names to their types (Public, Internal, Private).
            - archived_status: A dictionary mapping repository names to their archived status (Archived, Not Archived).
    """

    if repository_list:
        # If a specific list of repositories is provided, retrieve their types
        # This is useful since Secret Scanning will only return a handful of repositories

        repo_types = {}
        archived_status = {}

        for repo in repository_list:
            response = _rest.get(f"/repos/{org}/{repo}")

            if type(response) is not Response:
                print(f"Error retrieving repository {repo}: {response}")
                repo_types[repo] = "Unknown"
                archived_status[repo] = "Unknown"
            else:
                repository = response.json()
                repository_type = repository.get("visibility", "Unknown").title()
                repo_types[repo] = repository_type

                archived_status[repo] = "Archived" if repository.get("archived", False) else "Not Archived"

    else:
        # If no specific list is provided, retrieve all repositories in the organization
        # This is useful for Dependabot Alerts where there are many repositories
        # There will be less API calls doing 100 repositories at a time than each repository individually

        repo_types = {}
        archived_status = {}
        repository_list = []

        response = _rest.get(f"/orgs/{org}/repos", params={"per_page": 100})

        if type(response) is not Response:
            print(f"Error retrieving repositories: {response}")
            return repo_types, archived_status
        else:
            try:
                last_page = int(response.links["last"]["url"].split("=")[-1])
            except KeyError:
                last_page = 1

        for page in range(1, last_page + 1):
            response = _rest.get(f"/orgs/{org}/repos", params={"per_page": 100, "page": page})

            if type(response) is not Response:
                print(f"Error retrieving repositories on page {page}: {response}")
                continue

            repositories = response.json()

            repository_list = repository_list + repositories

        for repo in repository_list:
            repository_name = repo.get("name")
            repository_type = repo.get("visibility", "Unknown").title()
            repo_types[repository_name] = repository_type

            archived_status[repository_name] = "Archived" if repo.get("archived", False) else "Not Archived"

    return repo_types, archived_status

The function works by collecting the repository information of either all repositories in the organisation or only those that have been passed to the function within repository_list. The option is provided because in some cases, it requires less API calls to collect all repositories in the organisation, rather than just those that have alerts.

For Secret Scanning alerts, there are likely to be less than 30 repositories with alerts, so it is more efficient to collect only those repositories. For Dependabot alerts, there are likely to be more than 30 repositories with alerts, so it is more efficient to collect all repositories in the organisation.

It works out cheaper on the API to collect all repositories in the organisation because we can collect up to 100 repositories per API call. At the time of writing, the organisation has around 3000 repositories, meaning 30 API calls to collect all repositories. If there are more than 30 repositories to collect, we may as well collect all repositories in the organisation, as it will only require 30 API calls.

The endpoints used to collect the repository information are:

How is the data used?

Once we have the repository information, we can map it to a new column within the DataFrame containing the respective alerts - providing the extra data we need.