Apply the heuristics for the NLnet projects hosted on github.com #3

julianharty · 2024-02-29T21:31:11Z

Context

Approximately 60% of the projects sponsored by NLnet are currently hosted on github.com GitHub also provides mature query mechanisms so it's likely to be a useful early iteration to provide insight and feedback on our proposed objective of assessing the testing and automated tests performed by the project teams for these projects.

Further info

The wiki on this repo provides heuristics and some notes on querying GitHub using URL query parameters https://github.com/commercetest/nlnet/wiki

tnzmnjm · 2024-03-01T12:26:50Z

As per our discussion, I will start working with the Github REST API to :

be able to check the github repositories exist and are valid
create a column in the dataframe indicating the number of "test" files in each repository

I will be limiting the domain to github.com.

tnzmnjm · 2024-03-01T16:57:45Z

Learnt about Github REST API (what it is and what are some use cases)
Generated a PAT for authentication with the Github API (Expiration 90 days → The token will expire on Thu, May 30 2024, Scopes → public_repo)
chose the endpoints:
Repositories Endpoint: To get information about repositories.
Contents Endpoint: To access the file structure of a repository.
Search Code Endpoint: To search within repositories for specific words or phrases.

tnzmnjm · 2024-03-04T21:24:11Z

Progress Update

Saved the df as csv
Created a repo_request.py - script for finding the number of files that include "test" in their names within a GitHub repository.
Loaded the df and did some data preprocessing (there are no Null values - there were 17 duplicate rows, I kept only the first occurrence)
Filtered the df to only contain the github.com domain. Called the data frame: github_df
Installed the package ‘requests’ which is a HTTP library for Python
Saved the PAT into an environmental variable
Successfully made an authorised request to the first repo (status_code == 200)
Printed the response.text but was not easy to follow (I spent a bit of time converting it to json and then printed it but was getting an error. As I don’t need to read the info manually, I’m not going to spend more time on fixing this)
When ran the search query I was getting an error (status_code = 401 - response reason : unauthorised)
As I was able to successfully made an authorised request to the first repo , my PAT is correctly Set. I can print my pat so my script can access the PAT. Looked into using PyGithub. This makes it easy to access the repo object but for searching I would need to list repository contents and manually filter them so I won't use this package.
I was able to resolve it by adding the line below to the header:

'X-GitHub-Api-Version': '2022-11-28'

the repo I was querying is this.
My search query was only giving me the items where the word 'test' was mentioned in the body of the README files but not in the path
Below amendments fixed the search problem :

search_query = f'test in:path repo:{repo_path}'
payload = {'q': search_query, 'type': 'code'}
search_url = f'https://api.github.com/search/code'
response = requests.get(search_url, headers=headers, params=payload)

The output is:

Number of "test" files found: 2
File Name: test_unused_attr.c, Path: cmake/tests/test_unused_attr.c
File Name: test_format_attr.c, Path: cmake/tests/test_format_attr.c

Considerations:

REST API has a custom rate limit for searching (10 requests per minute). I will need to address this.

There are edge cases that are not covered. See #3 for details. I'm only checking a single repo at this time. Once I have the logic working I will apply it on all github repos in the dataframe.

tnzmnjm · 2024-03-06T19:33:14Z

Progress Update

Improved my script repo_request.py
Created a branch feature-count-test-files , pushed the changes to the remote repo
Cross-checked 20 repos and compared the results of my script with what I found on the repo link. I found some edge cases and some formats that might need to be excluded
I excluded the .txt and .md files from the result (search_query = f'test in:path -filename:.txt -filename:.md repo:{repo_path}')

What needs to happen next

My script counts the test files only on one repo at this time. I will need to apply it on all of the github dataframe
my script only returns 30 results due to the limitation of the call to the github API. I will need to fix this so that it the script can handle multiple pages of the response result
Decision needs to be made regarding below edge cases:
1. There are some repourl links that have only the username but no repo name. Example: https://github.com/nixcloud → script result is 0. One solution might be to go through these cases and find the repo which is related to the nlnet project (which we have the links). I also checked to see if removing the ‘ / ’ might have caused this problem but it’s not the case.
2. Username change → some usernames have changed (warner/magic-wormhole changed to magic-wormhole/magic-wormhole ) → script result is 0. Solution can be to consider the redirecting message received from the response and if required change the username.
3. excluding specific extension file formats from the result:
  - .h → C Header
  - .good → specific to their project?
  - .dsc → descriptor Files
  - .queries → Data file related to specific queries, used for testing or configuration
  - .tpkg → This appears to be related to a package or testing framework
  - .xml
  - .yml
  - .m → text files for MATLAB
  - .plist → a setting file, also known as a property list file, used by macOS)
  - .csv → Path: pythonProject/dataset/ml-latest-small/tags.csv --> will need to exclude them
  - .swift
  - .xctestplan → it is a document detailing the objectives, resources,
    and processes for a specific test session for a software or hardware product. The plan typically contains a detailed understanding of the eventual workflow)
  - .xcscheme → IOS development related files
  - .monal/TestPlan.xctestplan
  - .ts → TypeScript is a text file used to make JavaScript applications and are in fact similar to JavaScript (. JS) files, but include code in the TypeScript programming language.)
  - .lua --> source code and script file type that contains code which determines the actions to be taken by your PC when running programs and applications. The script in the file supports automated processes for operating systems and the applications that run on them.
  - .html → src/testStaticScreen.html
  - .pbxproj → The project. pbxproj file contains metadata like settings, file references, configuration, and targeted platforms which Xcode uses to build your project. Whenever you add, move, delete or rename a file / folder in Xcode (among other things), it will result in a project. pbxproj being modified
  - .svg → Scalable Vector Graphics is an XML-based vector image format for defining two-dimensional graphics, having support for interactivity and animation.
  - .mm (contains c / c++ code)
  - .sh.in ? (.sh is a shell format in Unix/Linux) - .in (just stands for input, as in input files. A common place to find them in the context of autotools - autoconf, automake, libtool) Is it for automation?
  - .tex
  - .dpl
  - .mag
  - .ngspice
  - .nix
  - .go c
  - onfig.toml
  - .ndjson
  - .snap
4. Couldn’t find a reason why certain files were in the script result:
  - File Name: README, Path: src/test/README but there's no .md in the end - Repo. Link: Here
  - Not sure why the script result is 0, as I can see the files when I check manually. Example
5. User used some tools for testing:
  - Example →
    - file name: milestones/M3/M3.md - Ran the LDP test Suite produced by W3C WG
    - I started looking at testing with the Solid Test suites too.
    - seems like the user had written some tests suits
      with another username: https://github.com/co-operating-systems/
      Reactive-SoLiD
    - I tested the Lenses idea with some simple Scala
      code ([as per web-cats issue 28](https://gitlab.com
      /web-cats/CG/-/issues/28)), but to be able to
      understand the implications of codata I would
      need to use something like Agda.
  - In some repos I found several functions that are testing various things but their path does not match our criteria

Other considerations :

.lua files might be related to automated tasks
found some lines related to gitlab: vulnerabilities/tests/test_data/gitlab/gem.yaml
.sh.in ? (.sh is a shell format in Unix/Linux) and .in (just stands for input, as in input files. A common place to find them in the context of autotools - autoconf, automake, libtool). Is it for automation?
My script gave me a 0 result and when I checked the repo and searched for the word test manually, I found this message: This repository's code is being indexed right now. Try again in a few minutes. for repo:turkmanovic/OpenEPT. I’m not sure how long it took but when I checked after a couple of hours and ran the script again we both found 0 results. We might have to put some measures in place to be able to handle these scenarios. Making sure we run our script again for that particular repo (maybe for the ones which we get 0 result).

tnzmnjm · 2024-03-09T10:58:59Z

Progress Update

Created the function get_test_file_count to resolve the pagination problem. → After applying the function to the 569 row, I can see the returned total number of test files are: 240 but I’m getting a message : Failed to search, status code: 403 Response text: {"message":"API rate limit exceeded for user ID ..}
came across Requests-Ratelimitermodule to resolve this problem
Replaced the print commands with loguru
Created new directories : src and data
Moved the script repo_request.py to the src directory and the csv files in the data
Realised that X-RateLimit-Limit: was not showing any results, modified the logger.info 's parameter to be one string
the rate limits applid in our problem are:
- Primary rate limit for authenticated users --> 5000 request per hour(83.3 per second)
- secondary rate limits:
  - 100 concurrent requests are allowed
  - No more than 900 points per minute are allowed for REST API endpoints (15 points per second)
  - No more than 90 seconds of CPU time per 60 seconds of real time is allowed (You can roughly estimate the CPU time by measuring the total response time for your API requests)
  - Create too much content on GitHub in a short amount of time. In general, no more than 80 content-generating requests per minute and no more than 500 content-generating requests per hour are allowed. Some endpoints have lower content creation limits. Content creation limits include actions taken on the GitHub web interface as well as via the REST API
If you exceed a secondary rate limit, you will receive a 403 or 429 response and an error message that indicates that you exceeded a secondary rate limit. If the retry-after response header is present, you should not retry your request until after that many seconds has elapsed. If the x-ratelimit-remaining header is 0, you should not retry your request until after the time, in UTC epoch seconds, specified by the x-ratelimit-reset header. Otherwise, wait for at least one minute before retrying. If your request continues to fail due to a secondary rate limit, wait for an exponentially increasing amount of time between retries, and throw an error after a specific number of retries.
Calculating points for the secondary rate limit --> Most REST API GET, HEAD, and OPTIONS requests --> point: 1
installed Requests-Ratelimiter package
after applying this to my get request (LimiterSession(per_second=5) I still got 30 results
considering the limitation, the limit is session = LimiterSession(per_second=0.2)
I excluded .json, .html .xml .
search_url = f"https://api.github.com/search/code?q=test+in:path+-filename:.txt+-filename:.md+-filename:.html+-filename:.xml+-filename:.json+repo:{repo_path}&page={page}"
I ran the code on 3 rows, saved the github_df will now continue applying it on the whole df

tnzmnjm · 2024-03-09T20:27:54Z

Progress Update

Running the script on the github_df, I see it worked on 4 rows then got 403 error and then after seeing
this error for several rows, it started working again --> reduced the limit session to LimiterSession(per_second=0.1)
separated the make_github_request(search_url=search_url, session=session, headers=headers) from the get_test_file_count(repo_path, headers)
found some new file formats: .tex .dpl .mag .ngspice .nix .go config.toml .ndjson .snap --> added them to the list of formats I have in my previous comments
Set LimiterSession(per_minute=10)
Not all repourls are correct, some point to just the user and others to the issues https://nlnet.nl/project/AccessibleSecurity,https://github.com/osresearch/heads/issues/540,-1
had to make a change on how I'm extracting the username and repo index_of_github = parts.index('github.com') --> repo_path = '/'.join(parts[index_of_github + 1: index_of_github + 3])
In the above case I will not burn any requests I will skip them
ran the script on 10, then 50 , rows now will commit my code and run it on the whole github datafarame

tnzmnjm · 2024-03-10T12:48:00Z

Progress Update

script stopped when reaching Analysing repo http://www.github.com/asicsforthemasses extract_reponame_and_username:45 - ['http:', '', 'www.github.com', 'asicsforthemasses']
skipped it for now (manually changed the test count to 0 and ran the script again) --> I might need to change the username and repo extraction using regex /(?<=github.com/)([a-z-_.0-9]+)/?([a-z-_]+)
after the script finished, I saw some extra columns in the dataframe: ['Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', 'projectref', 'nlnetpage', 'repourl', 'testfilecount'] . This happened because my script stopped andI started it again when I faces problems and each time, pandas aded an index column. I'm removing them from the df and saved this fataframe in '../data/github_df_test_count.csv'

julianharty added the enhancement New feature or request label Feb 29, 2024

tnzmnjm mentioned this issue Mar 6, 2024

Feature count test files #6

Merged

tnzmnjm linked a pull request Mar 6, 2024 that will close this issue

Feature count test files #6

Merged

julianharty closed this as completed in #6 Mar 9, 2024

tnzmnjm reopened this Mar 10, 2024

julianharty mentioned this issue Mar 11, 2024

TBD Quantitative indicators of data quality #11

Open

tnzmnjm mentioned this issue Apr 24, 2024

Improve the initial data preparation and reporting #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply the heuristics for the NLnet projects hosted on github.com #3

Apply the heuristics for the NLnet projects hosted on github.com #3

julianharty commented Feb 29, 2024

tnzmnjm commented Mar 1, 2024

tnzmnjm commented Mar 1, 2024

tnzmnjm commented Mar 4, 2024 •

edited

Loading

tnzmnjm commented Mar 6, 2024 •

edited

Loading

tnzmnjm commented Mar 9, 2024

tnzmnjm commented Mar 9, 2024

tnzmnjm commented Mar 10, 2024

Apply the heuristics for the NLnet projects hosted on github.com #3

Apply the heuristics for the NLnet projects hosted on github.com #3

Comments

julianharty commented Feb 29, 2024

Context

Further info

tnzmnjm commented Mar 1, 2024

tnzmnjm commented Mar 1, 2024

tnzmnjm commented Mar 4, 2024 • edited Loading

Progress Update

tnzmnjm commented Mar 6, 2024 • edited Loading

Progress Update

What needs to happen next

tnzmnjm commented Mar 9, 2024

Progress Update

tnzmnjm commented Mar 9, 2024

Progress Update

tnzmnjm commented Mar 10, 2024

Progress Update

tnzmnjm commented Mar 4, 2024 •

edited

Loading

tnzmnjm commented Mar 6, 2024 •

edited

Loading