-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: check affiliation utils #10
Conversation
Consider using the |
Fixed. Please review it again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall code looks good except for some styling issues.
BTW, have you run this code on some of the major repos? I wonder what the results are.
Fixed. Please review it again. I leave an example in The result is below: Contributions by company: {'GOOGLE': 62.47533868209917, 'MICROSOFT': 8.299355517558858, 'RED HAT': 16.085755622780482, 'DATABRICKS': 4.218729448901749, 'INTEL': 3.4328554517953442, 'AMAZON WEB SERVICES': 2.7620676048928057, 'GOLDMAN SACHS': 2.72589767197159} Contributions by single company GOOGLE: 28.60196600882145 (this is because there is a robot with no company in the repo contributing most of the commits) |
The bot you mentioned must be the ci bot. This is kind of tricky, we can exclude merge commits or bot accounts, but there is no accurate way to do this. And upon checking up that bot, I found some edge cases that your code might not cover.
Also, I found an API from ossinsight that can list the organizations of PR creators, you can check that out. They have a chart that shows the percentage of contributors from different orgs. And they also mentioned this:
|
Finding CI robots is challenging, so I recommend using Although some users do not show their companies, we can assume the distribution of such users is similar across all companies. Each company has a certain number of developers who do not show their affiliations. Thus, using Handling situations where a person once worked for Company A and then moved to Company B is difficult. This is indeed a problem, and I do not have good idea about it now. To address cases where different names are used for the same company (e.g., Red Hat and Red Hat, Inc), we can use the shortest input company name to match all variations: company = company.upper().replace('@', '').strip()
count = False
for target_company in companies:
if company.find(target_company) != -1:
company = target_company
count = True There already exists codes that tries to match all possible company names with the shortest input. In the above example, input compaines is |
I think I didn't make my point clear. The key here is that we may never get an accurate rate from this method, and releasing inaccurate data would be irresponsible. An alternative could be using data from the OSSInsight API I mentioned earlier. We should clarify that we used their data and that it has not been proven to be accurate. |
Understood. I couldn't find a detailed introduction to this API, so I am wondering if the company of the PR creator is stored by OSS Insight each time. If so, we can assume their data is valid as it excludes the impact of resignations and new hires. Do you know if there is a Python library to access OSS Insight? It seems that I have to use |
Another big problem is the variations of a single company name. For example, OSS Insight data include Do you have any good ideas to handle this problem? One straightforward solution is to provide a list of company names as input, remove special characters from the company names in OSS Insight, and try to match them. If a match is found, we can merge the variants into the given company name. However, this approach may have many corner cases that require manual adjustments. |
Sorry for the delay. Do you think we can agree that: If so, we can use its data (it's fine using requests) and include a reference to it. |
Two new functions from the OSS Insight API:
|
check affiliation utils
repo-level affiliation check function, it can receive two types of inputs: