A Repository of Data for Improving Code Search Tools
If we have learned anything in the past ten years of code search (and text analysis) research it’s that analyzing source code in terms of natural language is challenging. Each researcher (re)discovers that synonyms are different in software, that splitting identifiers is harder than it sounds, and that creating gold sets for evaluation is research within itself! To make a lasting impact on the state-of-the-art and the state-of-the-practice we cannot continue to rediscover these issues and redevelop these datasets. Thus, I propose using this site as a place where we can upload relevant datasets and link to relevant libraries.