Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: check affiliation utils #10

Merged
merged 6 commits into from
Jul 26, 2024
Merged

feat: check affiliation utils #10

merged 6 commits into from
Jul 26, 2024

Conversation

chesha1
Copy link
Contributor

@chesha1 chesha1 commented Jul 9, 2024

check affiliation utils

repo-level affiliation check function, it can receive two types of inputs:

  • a list of potiential companies, calculating the percentage of commits made by every company. It indicates the relative realtion between the companies and the sum of output is 1
  • a single company, calculating the absolute percentage of commits made by the company

@LeslieLeung
Copy link
Owner

Consider using the pygithub package in the project, which wraps the GitHub API. It can handle pagination and retry functionalities, saving you from writing the requests manually.

@chesha1
Copy link
Contributor Author

chesha1 commented Jul 10, 2024

Fixed. Please review it again

Copy link
Owner

@LeslieLeung LeslieLeung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall code looks good except for some styling issues.

BTW, have you run this code on some of the major repos? I wonder what the results are.

utils.py Outdated Show resolved Hide resolved
@chesha1
Copy link
Contributor Author

chesha1 commented Jul 11, 2024

Fixed. Please review it again.

I leave an example in utils.py to calculate contributions from companies in kubernetes/kubernetes, considering the first 100 contributors.

The result is below:

Contributions by company: {'GOOGLE': 62.47533868209917, 'MICROSOFT': 8.299355517558858, 'RED HAT': 16.085755622780482, 'DATABRICKS': 4.218729448901749, 'INTEL': 3.4328554517953442, 'AMAZON WEB SERVICES': 2.7620676048928057, 'GOLDMAN SACHS': 2.72589767197159}

Contributions by single company GOOGLE: 28.60196600882145 (this is because there is a robot with no company in the repo contributing most of the commits)

@LeslieLeung
Copy link
Owner

The bot you mentioned must be the ci bot. This is kind of tricky, we can exclude merge commits or bot accounts, but there is no accurate way to do this.

And upon checking up that bot, I found some edge cases that your code might not cover.

  • A person that might be affiliated to one company, but did not specified on profile
  • A person once worked for company A and then moved on to B
  • Used different name for one company, e.g. Red Hat and Red Hat, Inc

Also, I found an API from ossinsight that can list the organizations of PR creators, you can check that out. They have a chart that shows the percentage of contributors from different orgs.

And they also mentioned this:

In the overall data, about 5.62% of GitHub users has valid organization information.

@chesha1
Copy link
Contributor Author

chesha1 commented Jul 12, 2024

Finding CI robots is challenging, so I recommend using calculate_contributions_by_company rather than calculate_single_company_contributions. The results from calculate_contributions_by_company are guided by a list of input companies, which helps exclude potential CI bots.

Although some users do not show their companies, we can assume the distribution of such users is similar across all companies. Each company has a certain number of developers who do not show their affiliations. Thus, using calculate_contributions_by_company will not be affected as it shows the relative percentage of contributions from the input companies.

Handling situations where a person once worked for Company A and then moved to Company B is difficult. This is indeed a problem, and I do not have good idea about it now.

To address cases where different names are used for the same company (e.g., Red Hat and Red Hat, Inc), we can use the shortest input company name to match all variations:

company = company.upper().replace('@', '').strip()

count = False
for target_company in companies:
    if company.find(target_company) != -1:
        company = target_company
        count = True

There already exists codes that tries to match all possible company names with the shortest input.

In the above example, input compaines is ['GOOGLE', 'MICROSOFT', 'INTEL', 'Red Hat', 'Databricks', 'Amazon Web Services', 'Goldman Sachs']. It matches Intel GmbH expectedly: https://github.com/pohly

@LeslieLeung LeslieLeung mentioned this pull request Jul 14, 2024
@LeslieLeung
Copy link
Owner

I think I didn't make my point clear. The key here is that we may never get an accurate rate from this method, and releasing inaccurate data would be irresponsible.

An alternative could be using data from the OSSInsight API I mentioned earlier. We should clarify that we used their data and that it has not been proven to be accurate.

@chesha1
Copy link
Contributor Author

chesha1 commented Jul 15, 2024

Understood. I couldn't find a detailed introduction to this API, so I am wondering if the company of the PR creator is stored by OSS Insight each time. If so, we can assume their data is valid as it excludes the impact of resignations and new hires.

Do you know if there is a Python library to access OSS Insight? It seems that I have to use requests to access the OSS Insight API.

@chesha1
Copy link
Contributor Author

chesha1 commented Jul 15, 2024

Another big problem is the variations of a single company name. For example, OSS Insight data include red hat, inc, red hat, inc., redhatofficial, and red hat as separate entities, calculating their contributions independently.

Do you have any good ideas to handle this problem? One straightforward solution is to provide a list of company names as input, remove special characters from the company names in OSS Insight, and try to match them. If a match is found, we can merge the variants into the given company name. However, this approach may have many corner cases that require manual adjustments.

@LeslieLeung
Copy link
Owner

Sorry for the delay.

Do you think we can agree that:
a. It's impossible to precisely track each company's contribution
b. OSS Insight provides reasonably acceptable results

If so, we can use its data (it's fine using requests) and include a reference to it.

@chesha1
Copy link
Contributor Author

chesha1 commented Jul 24, 2024

Two new functions from the OSS Insight API:

  • list_organizations_of_pr_creators returns the raw data with percentages and organzation names.
  • is_company_dominant_in_repo checks if a company has made a significant number of contributions to a repository. It takes a threshold number as input, and if the result is false, the repo can be safely attributed to foundations.

@LeslieLeung LeslieLeung merged commit f4c23ce into LeslieLeung:main Jul 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants