Hello! Thank you for your interest in my project.
In this repository, I want to build a searchable database of OLS projects to make it easier to navigate them for past, current, and future OLS participants. Open Life Science - OLS is a mentorship programme to learn more about Open Science and Open Leadership, and mentees carry out individual or group projects. With over 200 projects, the list is now very rich yet also difficult to navigate. The aim of my project is to build a searchable database of OLS projects, for future, current and past OLS participants.
Thanks to my database, it will be possible to browse all projects from one page, filter them based on keywords and/or project members, and track the projects progress and outputs.
Contribution guidelines will follow soon. In the meantime, if you have ideas or comments, please open an issue or reach out to Angelica Maineri
The prototype of the database can be seen at https://angelicamaineri.github.io/OLS-project-database/
This project is carried out in the framework of OLS-7
In this section, I describe the steps undertaken to get from the input data to the prototype.
Prepare you project, formulate your vision, write down or draw what you want to achieve.
In my case, I had developed a data model which however was never fully implemented, but it helped me to expand my thoughts. I also drew a fictional example of how I would like the database (or even better, the catalogue) to look like at the end (see here). Hopefully this can be of inspiration for somebody else to pick up and continue this work!
The first step is to create (or get) the input spreadsheet in any format that can be imported into R (e.g. csv, xls, .dta, etc).
In my case, I wanted a list of the OLS projects that are already listed on the OLS website (by cohort). After talking with an expert (thanks Berenice!) I found out that such a spreadsheet already existed in the OLS paper repository in the form of a project spreadsheet, together with an overview of the people and their github handles. Hence I forked the repository and then loaded the data into R via the scripts from Phase 2.
This step is not mandatory, but often needed. From the initial spreadsheet, you might need to clean your data, rearrange columns, add information, you name it. I prefer to do this as much as possible via scripts to keep track of the changes.
In my case, I wanted to enrich the initial dataset in several ways:
- Add the domain; in this case I decided to use some very coarse categories based on the way the Dutch research council classifies domains: LSH = life sciences and health; NES = Natural and Engineering Sciences; SSH = Social Sciences and Humanities. I didn't want to tag each project manually, so I tried to map keywords to domains (the mapping is in ./Data-enrichment/Data/domains.csv) to then tag the projects automatically based on the keywords. The script can be found in ./Data-enrichment/Scripts/1_tag_domain.R. Unfortunately, not many projects have a matching domain based on the keywords. See the section "Next steps" to read some ideas on how to improve this.
- I wanted to add the links to the github profiles of all the people mentioned (participants and mentors). The steps to do it in practice are included in ./Data-enrichment/Scripts/2_tag_people.R At the end of this phase, I exported a processed and enriched .csv file.
As a side output, since I was working with the keywords, I also made a wordcloud of keywords! See [./Data-enrichment/wordcloud_1.pdf]{./Data-enrichment/wordcloud_1.pdf).
At this point, you should have a spreadsheet (I had a .csv) which is actually the table you want to display. In this phase, you create a Rmarkdown file, which will become eventually the webpage you display via GitHub pages. That means you can write text and explanations as you see fit, using Markdown language.
In my RMarkdown, I first load two packages (tidyverse and DT), I then load the .csv file and display it using the datatable function, and finally 'knit' all this into an html file named index.html, which I save in a new folder docs. You can check the datatable documentation to change the display and functionalities of the table (e.g. including search bars for each column, deciding how many rows to be displayed by default, exclude some columns, etc.).
At this point, you can just go to the settings of your (public) GitHub repository, go to Pages, and deploy from main/docs. On the top of the page, the URL to your Page will be available. Congrats!
- Make sure you have access to an updated version of projects.csv and people.csv fromhttps://github.com/open-life-science/ols-program-paper/data
- Open ./Data-enrichment/Data-enrichment.Rproj
- Run 1_tag_domain.R
- Run 2_tag_people.R
- Save df2 and export ./Data-enrichment/Data/projects_domain_people.csv
- Run project-database.Rmd
- (check that you indeed saved index.html in the docs folder)
- Commit changes and push to origin
- Admire your work on https://angelicamaineri.github.io/OLS-project-database/
Here are some ideas/suggestions for next steps. They might be helpful to someone who might want to pick this up as their own OLS project in a future cohort!
- Enrich data. The dataset could be enriched with many more information that could help finding and sorting projects, but also getting insights. For instance, links to related materials (e.g. slides on Zenodo, links to repositories).
- Automate the enrichments. Somebody who is more familiar with NLP techniques may be able to extract information and annotations from the projects’ descriptions.
- Embed prototype in the OLS website. It would be nice if the database became part of the projects’ page on the OLS website, and in the process the interface could be improved and rendered in a more aesthetically pleasing way.