Skip to content

Latest commit

 

History

History
164 lines (147 loc) · 9.04 KB

a2.org

File metadata and controls

164 lines (147 loc) · 9.04 KB

INFO1903

Data

The data used was the public repositories on Github and Bitbucket, and was obtained by querying their respective APIs.

Both APIs produced a JSON-formatted result, which in some cases directly included the required data, or included a link to another API query which held related data. In some form or another, both had access to all of the following:

  • Name
  • Description
  • Owner
  • Size
  • Language
  • Branches
  • Commits
  • Issues
  • Forks
  • Subscribers

Of this, only these were used and saved into the database:

  • Name
  • Description
  • Size
  • Number of forks
  • Primary language

As well as an added host key for which host the data came from (i.e. GH for Github or BB for Bitbucket).

Bitbucket Content

The main access point for Bitbucket was this URL

https://api.bitbucket.org/2.0/repositories

which produces a paginated list of repositories according the value of the pagelen parameter. The first URL used to access the API is

https://api.bitbucket.org/2.0/repositories?pagelen=100

which grabs the maximum number of entries. From that, the URL for the link to the next page can be extracted.

Every Bitbucket response contains this data as a wrapper around the actual values, which are contained within the values field.

namedescriptionexample
pagelenNumber of entries on this page10
nextLink to next page of entrieshttps://api.bitbucket.org/2.0/repositories?pagelen=100&after=
valuesList of entries for this page[{“description”: “…”, …}, …]

Each entry’s full spec can be viewed at https://confluence.atlassian.com/bitbucket/repositories-endpoint-423626330.html. For brevity, these are only the relevant entries for an example of the data returned on a repository from this query:

{
    "pagelen": 10,
    "next": "https://api.bitbucket.org/2.0/repositories?pagelen=10&after=2008-07-12T07%3A44%3A01.476818%2B00%3A00",
    "values": [{
        "description": "Mercurial (hg) extension to allow commenting on commit messages.  Mainly written for practice reading & working with mercurial internals.\r\n",
        "links": {
            "commits": {"href": "https://api.bitbucket.org/2.0/repositories/phlogistonjohn/tweakmsg/commits"},
            "forks": {"href": "https://api.bitbucket.org/2.0/repositories/phlogistonjohn/tweakmsg/forks"},
        },
        "name": "tweakmsg",
        "language": "",
        "created_on": "2008-06-25T00:53:00.273366+00:00",
        "full_name": "phlogistonjohn/tweakmsg",
        "updated_on": "2012-06-24T17:32:27.458855+00:00",
        "size": 7085,
    },
namedescriptionexample
nameName of the repositorytweakmsg
descriptionRepository descriptionMercurial (hg) extension to allow commenting on commit messages
linksEntries of links to more data{“commits”: {“href”: “…”}, …}
sizeSize of a repository in bytes7085
languageMain language used in the respositoryRuby
full_nameThe name and owner of the repositoryplogistonjohn/tweakmsg
created_onThe ISO8601-formatted creation date2008-06-25T00:53:00.273366+00:00
updated_onThe ISO8601-formatted creation date2012-06-24T17:32:27.458855+00:00

The links for obtaining counts of forks were pulled from these responses. The same could have been done for commits, but due to issues in obtaining commits from the Github API, that data was excluded.

The next entry contains the link to the next page of repositories, which was used to get enough pages of results to reach 500 total repositories.

Rights

Bitbucket’s website details the terms of usage at https://www.atlassian.com/legal/customer-agreement, which gives the right to use the software in other applications such as this one.

Github Content

The main access point for Github is

https://api.github.com/repositories

Pagination works similarily to Bitbucket, but Github’s API has the added complication of a more strict rate limit if used without authentication, resulting in occasional 403 errors as access is denied.

Unlike Bitbucket, Github’s repositories endpoint doesn’t honor any way of controlling the number of repositories returned per page. The full spec of what can be returned can be found at . For brevity, only the relevant parts for an example of the returned data is included below:

Status: 200 OK
Link: <https://api.github.com/repositories?since=364>; rel="next"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
[{
    "name": "calendar_builder",
    "full_name": "collectiveidea/calendar_builder",
    "description": "",
    "url": "https://api.github.com/repos/collectiveidea/calendar_builder",
}]
namedescriptionexample
nameName of the repositorycalendar_builder
full_nameName and owner of the repositorycollectiveidea/calendar_builder
descriptionRepository description
urlLink to the specific API call for the repohttps://api.github.com/repos/collectiveidea/calendar_builder

The Link Header contains the link to the next page of responses. This is used to retrieve 300 responses from Github; only 250 are needed to meet 500 total for both repositories, but Github’s repositories resource does not honor any attempts to limit the number of produced entries.

To access the majority of the data, the url link has to be followed. The commits_url link is similar and could have been followed to collect commits, but attempting this resulted in hitting the rate limit before the program could finish as it meant following an extra link for every repository.

The url entry contains data of which the spec can be viewed at [[ . For brevity, only the relevant results have been included below:

Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
      <https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
[
  {
      "name": "Hello-World",
      "full_name": "octocat/Hello-World",
      "description": "This your first repo!",
      "language": null,
      "forks_count": 9,
      "size": 108,
      "created_at": "2011-01-26T19:01:12Z",
      "updated_at": "2011-01-26T19:14:43Z",
  }
]
namedescriptionexample
nameName of the repositoryHello-World
full_nameOwner and name of repositoryoctocat/Hello-World
languageMain language used in repositorynull
forks_countNumber of forks for the repository9
sizeSize of the repository in KB108
created_atCreation timestamp in ISO81 format2011-01-26T19:01:12Z
updated_atUpdate timestamp in ISO81 format2011-01-26T19:01:12Z

Rights

Github’s website details the terms of usage at https://help.github.com/articles/github-terms-of-service/, which gives the right to use the software in other applications such as this one.

Interest

Collecting repository data can allow for a number of interesting comparisons, such as determining which host produces repositories that are more actively maintained, or using descriptions to determine what kind of content is more common on Github than Bitbucket.

Cleaning

Produced data

namedescriptionexample
name

Analysis