INFO1903

Data

The data used was the public repositories on Github and Bitbucket, and was obtained by querying their respective APIs.

Both APIs produced a JSON-formatted result, which in some cases directly included the required data, or included a link to another API query which held related data. In some form or another, both had access to all of the following:

Name
Description
Owner
Size
Language
Branches
Commits
Issues
Forks
Subscribers

Of this, only these were used and saved into the database:

Name
Description
Size
Number of forks
Primary language

As well as an added host key for which host the data came from (i.e. GH for Github or BB for Bitbucket).

Bitbucket Content

The main access point for Bitbucket was this URL

https://api.bitbucket.org/2.0/repositories

which produces a paginated list of repositories according the value of the pagelen parameter. The first URL used to access the API is

https://api.bitbucket.org/2.0/repositories?pagelen=100

which grabs the maximum number of entries. From that, the URL for the link to the next page can be extracted.

Every Bitbucket response contains this data as a wrapper around the actual values, which are contained within the values field.

name	description	example
pagelen	Number of entries on this page	10
next	Link to next page of entries	https://api.bitbucket.org/2.0/repositories?pagelen=100&after= …
values	List of entries for this page	[{“description”: “…”, …}, …]

Each entry’s full spec can be viewed at https://confluence.atlassian.com/bitbucket/repositories-endpoint-423626330.html. For brevity, these are only the relevant entries for an example of the data returned on a repository from this query:

{
    "pagelen": 10,
    "next": "https://api.bitbucket.org/2.0/repositories?pagelen=10&after=2008-07-12T07%3A44%3A01.476818%2B00%3A00",
    "values": [{
        "description": "Mercurial (hg) extension to allow commenting on commit messages.  Mainly written for practice reading & working with mercurial internals.\r\n",
        "links": {
            "commits": {"href": "https://api.bitbucket.org/2.0/repositories/phlogistonjohn/tweakmsg/commits"},
            "forks": {"href": "https://api.bitbucket.org/2.0/repositories/phlogistonjohn/tweakmsg/forks"},
        },
        "name": "tweakmsg",
        "language": "",
        "created_on": "2008-06-25T00:53:00.273366+00:00",
        "full_name": "phlogistonjohn/tweakmsg",
        "updated_on": "2012-06-24T17:32:27.458855+00:00",
        "size": 7085,
    },

name	description	example
name	Name of the repository	tweakmsg
description	Repository description	Mercurial (hg) extension to allow commenting on commit messages
links	Entries of links to more data	{“commits”: {“href”: “…”}, …}
size	Size of a repository in bytes	7085
language	Main language used in the respository	Ruby
`full_name`	The name and owner of the repository	plogistonjohn/tweakmsg
`created_on`	The ISO8601-formatted creation date	2008-06-25T00:53:00.273366+00:00
`updated_on`	The ISO8601-formatted creation date	2012-06-24T17:32:27.458855+00:00

The links for obtaining counts of forks were pulled from these responses. The same could have been done for commits, but due to issues in obtaining commits from the Github API, that data was excluded.

The next entry contains the link to the next page of repositories, which was used to get enough pages of results to reach 500 total repositories.

Rights

Bitbucket’s website details the terms of usage at https://www.atlassian.com/legal/customer-agreement, which gives the right to use the software in other applications such as this one.

Github Content

The main access point for Github is

https://api.github.com/repositories

Pagination works similarily to Bitbucket, but Github’s API has the added complication of a more strict rate limit if used without authentication, resulting in occasional 403 errors as access is denied.

Unlike Bitbucket, Github’s repositories endpoint doesn’t honor any way of controlling the number of repositories returned per page. The full spec of what can be returned can be found at . For brevity, only the relevant parts for an example of the returned data is included below:

Status: 200 OK
Link: <https://api.github.com/repositories?since=364>; rel="next"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
[{
    "name": "calendar_builder",
    "full_name": "collectiveidea/calendar_builder",
    "description": "",
    "url": "https://api.github.com/repos/collectiveidea/calendar_builder",
}]

name	description	example
name	Name of the repository	`calendar_builder`
`full_name`	Name and owner of the repository	`collectiveidea/calendar_builder`
description	Repository description
url	Link to the specific API call for the repo	https://api.github.com/repos/collectiveidea/calendar_builder

The Link Header contains the link to the next page of responses. This is used to retrieve 300 responses from Github; only 250 are needed to meet 500 total for both repositories, but Github’s repositories resource does not honor any attempts to limit the number of produced entries.

To access the majority of the data, the url link has to be followed. The commits_url link is similar and could have been followed to collect commits, but attempting this resulted in hitting the rate limit before the program could finish as it meant following an extra link for every repository.

The url entry contains data of which the spec can be viewed at [[ . For brevity, only the relevant results have been included below:

Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
      <https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
[
  {
      "name": "Hello-World",
      "full_name": "octocat/Hello-World",
      "description": "This your first repo!",
      "language": null,
      "forks_count": 9,
      "size": 108,
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",
  }
]

name	description	example
name	Name of the repository	Hello-World
`full_name`	Owner and name of repository	octocat/Hello-World
language	Main language used in repository	null
`forks_count`	Number of forks for the repository	9
size	Size of the repository in KB	108
`created_at`	Creation timestamp in ISO81 format	2011-01-26T19:01:12Z
`updated_at`	Update timestamp in ISO81 format	2011-01-26T19:01:12Z

Rights

Github’s website details the terms of usage at https://help.github.com/articles/github-terms-of-service/, which gives the right to use the software in other applications such as this one.

Interest

Collecting repository data can allow for a number of interesting comparisons, such as determining which host produces repositories that are more actively maintained, or using descriptions to determine what kind of content is more common on Github than Bitbucket.

Cleaning

Produced data

name	description	example
name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a2.org

a2.org

INFO1903

Data

Bitbucket Content

Rights

Github Content

Rights

Interest

Cleaning

Produced data

Analysis

Files

a2.org

Latest commit

History

a2.org

File metadata and controls

INFO1903

Data

Bitbucket Content

Rights

Github Content

Rights

Interest

Cleaning

Produced data

Analysis