Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To provide an answer to the question "How many projects use more than one license?" #58

Open
fernandocastor opened this issue Jun 12, 2013 · 23 comments
Assignees

Comments

@fernandocastor
Copy link
Member

We need to implement and test the features required to use Groundhog to answer the question in the title of the issue. We then have to use it to actually answer the question.

@rodrigoalvesvieira
Copy link
Contributor

Hmmm,

it brings another vital question to the table: How can we recognize the license(s) of a project? GitHub provides no metadata about it, in fact, GitHub does not even know what License a given project uses.

We have to design some heuristic/technique to grasp this information. The possible approaches that come to my mind are to look for a LICENSE file in the root of the project directory or look for a "License" section in the README.

Then we scrap the text and look for patterns which will allow us to determine which License it is about.

@fernandocastor
Copy link
Member Author

I agree. How hard would it be to implement one such heuristic and how comprehensive would the search be?

@dnr2
Copy link
Member

dnr2 commented Jul 4, 2013

Hi @gustavopinto , how to you plan to extract and discover the license of the project files? Are you planing to use something like FOSSology? I've started studying it to see how we could use their system with groundhog. (maybe you already know that)

@gustavopinto
Copy link
Member

I don't know yet how FOSSology works. But, I think that a good start point could be analyzing all text files on the root directory. We then read all of them, and thus perform a string search looking for words like "license", "copy[right|left]" or similar ones. But I still don't know how to discover what is the exact license on the text. :P

@gustavopinto
Copy link
Member

Maybe we can catalog all well-known license names, and thus try to find these names on the text files (only in files which exist a 'license-or-similar-word' in the text). We then must have two more license categories: unlincesed (a project which no text file with license-or-similar-word was found) and not-understandable-license (a project which we actually found a text with a license word, but we were not able to discover which license is). What do you say @dnr2?

@fernandocastor
Copy link
Member Author

I think using FOSSology, or at least trying to use it, would be a better idea. Why duplicate the effort that someone else has already spent? Someone has actually used it to answer precisely that question in the recent past: http://www.theregister.co.uk/2013/04/18/github_licensing_study/

@gustavopinto
Copy link
Member

Yes, you are right. Let's give FOSSology a chance.

@gustavopinto
Copy link
Member

Here I'm again. I think it will be very hard to integrate FOSSology into groundhog, mainly because FOSSology is written in php 👎

Do any of you know other alternatives?

@fernandocastor
Copy link
Member Author

But is it a library or a Web page/service?

@dnr2
Copy link
Member

dnr2 commented Jul 12, 2013

Both!

I've tested their demo system on [1] but it seems to be only a web based service (I couldn't find any API). So I believe that if we want to use FOSSology we only have two options: either we run our own FOSSology server or we could rely on their web based demo system.

IMO we should follow @gustavopinto 's first idea and develop some kind of algorithm to discover the project's licence, unless we find another FOSSology server.

[1] - https://fossology.ist.unomaha.edu/
[2] - http://www.fossology.org/projects/fossology/wiki/Live_Demo

@dnr2
Copy link
Member

dnr2 commented Jul 12, 2013

You can test their system in: https://fossology.ist.unomaha.edu/?mod=agent_nomos_once

Just create a license text file and upload, it really works!

@gustavopinto
Copy link
Member

yeah, @dnr2 is right.

@rodrigoalvesvieira
Copy link
Contributor

We are then set to discuss how we can develop this algorithm for identifying licenses. To begin the discussion, IMO there are tree places in a GitHub project to search for the license text:

  1. LICENSE file. Many projects have a LICENSE txt or markdown file in the source code root where the license text is present. I think this is the optimal situation and - although not so mainstream - it's quite popular and a lot of projects that we'll search will have this characteristic.
  2. License text/declaration in the README file. README files are widely adopted by projects on GitHub (as in elsewhere). I have observed that in many projects, the LICENSE file is missing but the README contains the project's legal information.
  3. License text within the source code. Now this is a fairly adopted practice (specially among popular projects) of having the license written down in some (or many) files in the source code. This is certainly the worst case because we could end up having to look up too many files. Fortunately, this is to be done locally so in no way it could drain more of our API request rates.

BTW, this is another limitation of GitHub. The service does not formally collect licensing information for its hosted projects.

This is based in my experience using GitHub, which is of course subjected to my personal biases. But I think it's a good starting point anyway. Before figuring out how we will identify/classify the licensing data, we must know how we will obtain it.

Bring your ideas onto the table!

-- rodrigo

On Jul 12, 2013, at 7:41 PM, Gustavo [email protected] wrote:

yeah, @dnr2 is right.


Reply to this email directly or view it on GitHub.

@fernandocastor
Copy link
Member Author

Sometimes the license is part of the Github webpage of a project, e.g.: https://github.com/johannilsson/android-pulltorefresh and https://github.com/openaphid/android-flip

@gustavopinto
Copy link
Member

The project webpage is built using the README.md file. So, this case will be addressed by

  1. License text/declaration in the README file.

@gustavopinto
Copy link
Member

Do any of you have seen projects with two or more licenses?

@rodrigoalvesvieira
Copy link
Contributor

I've never seen

-- rodrigo

On Jul 16, 2013, at 1:37 PM, Gustavo [email protected] wrote:

Do any of you have seen projects with two or more licenses?


Reply to this email directly or view it on GitHub.

@gustavopinto
Copy link
Member

great news! https://github.com/blog/1530-choosing-an-open-source-license

We can observe if projects will use this feature. If so, our problem was solved!

@gustavopinto
Copy link
Member

Guys, I have created this class to organize the well-know license names. If any of you know another license, please, put in this file.

@rodrigoalvesvieira
Copy link
Contributor

ok! great

@dnr2
Copy link
Member

dnr2 commented Jul 18, 2013

Me and @rodrigoalvesvieira were thinking if we could use the FOSSology's "One-Shot License Analysis" [1] together with some kind of Browser Automation API like Watij [2] to scan the files and get the related licenses...

=> Pros:
1- No need to create an already implemented algorithm.
2- Probably their analysis will be more reliable and complete than ours.
=> Cons:
1- We will have to trust that they won't change their site structure soon
2- Their site may not be always available

So what do you guys think?

(sorry for going back to the FOSSology discussion again, but I believe that this systems really works..., so we could save time using it)

[1] - https://fossology.ist.unomaha.edu/?mod=agent_nomos_once
[2] - http://watij.com/webspec-api/

@dnr2
Copy link
Member

dnr2 commented Jul 18, 2013

By the way, here are some interesting links:

FOSSology algorithm: http://www.fossology.org/projects/fossology/wiki/Symbolic_Alignment_Matrix
List of software licenses: http://www.cmu.edu/computing/software/all/

@fernandocastor
Copy link
Member Author

Hi Danilo.

Could we follow your plan without having to spend a whole lot of effort? If so, I think it's worth a shot. I think that, in our next meeting, we should discuss how to make it easy to extend groundhog. Me and Rodrigo have talked a little bit about that, but we need to settle it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants