diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md index 9dcfd3e..7c1eda8 100644 --- a/_episodes/01-introduction.md +++ b/_episodes/01-introduction.md @@ -1,15 +1,48 @@ --- title: "Introduction" -teaching: 0 +teaching: 10 exercises: 0 questions: -- "Key question" +- "Why are best practices necessary in research software?" +- "How Open Source can help with better quality of software?" objectives: -- "First objective." +- "Basics of Open Science in research software" +- "Introduction to the FAIR principles" keypoints: -- "First key point." +- "Best practices in research software are tied to the FAIR principles" +- "The discussed best practices are not tailored to software developers, but rather to a wider audience" --- +Scientific research nowadays relies heavily on the computational aspects provided by computer software, yet software is not always developed following practices that ensure its quality and sustainability. One of the recent publications ([Four simple recommendations to encourage best practices in research software](https://f1000research.com/articles/6-876/v1)) provided a simple, yet robust framework of simple recommendations that encourage the adoption of existing best practices in developing research software. These recommendations are designed around Open Science values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. + +Based on these recommendations, this lesson focuses on providing both the underlying context as well as some practical exercises towards establishing their usefulness in the long term. The consequent episodes of this lesson are structured in the form of one episode per recommendation: +1. Make source code publicly accessible from day one +2. Make software easy to discover by providing software metadata via a popular community registry +3. Adopt a license and comply with the license of third-party dependencies +4. Define clear and transparent contribution, governance and communication processes + +## Open Science + +"_When all researchers are aware of Open Science, and are trained, supported and guided at all career stages to practice Open Science, the potential is there to fundamentally change the way research is performed and disseminated, fostering a scientific ecosystem in which research gains increased visibility, is shared more efficiently, and is performed with enhanced research integrity._" [Open Science Skills Working Group Report (2017)](https://ec.europa.eu/research/openscience/pdf/os_skills_wgreport_final.pdf#view=fit&pagemode=none) + +Discussing best practices in developing research software, one is bound to touch on the subject of Open Science. Modern research relies on software, and building upon or reproducing that research requires access to the full source code behind that software ([ref](https://open-science-training-handbook.gitbook.io)). Sharing software used for research (whether computational in nature, or that relies on any software-based analysis/interpretation) is a necessary, though not sufficient, condition for reproducibility. In addition to reproducibility, sharing software openly allows developers to receive career credit for their efforts, either through direct citation or via published software articles. We are going to be discussing all these aspects in the following lesson. + +## FAIR principles + +Though not all the recommendations from the FAIR data principles directly apply to software, there is good alignment between the discussed best practices and the FAIR data principles. The FAIR principles are a set of community-developed guidelines to ensure that data or any digital object are Findable, Accessible, Interoperable and Reproducible. The FAIR principles specifically emphasize enhancing the ability of machines to automatically find and use data or any digital object, and support its reuse by individuals. Standards for the description, interoperability, citation etc. are at the core of these principles ([ref](https://www.incf.org/activities/standards-and-best-practices/what-is-fair)). + +The FAIR Guiding Principles, as described in [Scientific Data by Wilkinson et al](https://www.nature.com/articles/sdata201618): +- To be **Findable** +- To be **Accessible** +- To be **Interoperable** +- To be **Reusable** + +## Why best practices in research software + +There are many best practices currently in place that directly aim and are tailored for software developers. These includes aspects such as test-first programing and test coverage ([ref](https://github.com/r-lib/covr)), code quality ([ref](https://qaas.cyclopt.com/)), continuous integration ([ref](https://travis-ci.org)), etc. Unlike many software development best practices, this lesson aims to target a wider audience, particularly research funders, research institutions, journals, group leaders, and managers of projects producing research software. The adoption of these recommendations offer a simple mechanism for these stakeholders to promote the development of better software and an opportunity for developers to improve and showcase their software development skills. + +## Starting with a challenge + > ## Challenge: Create a project on GitHub > - make sure that you have a GitHub account > - click "+ -> new repository" @@ -20,4 +53,4 @@ keypoints: > {: .challenge} -- congratulations! you've created your first repository :) +Congratulations, you've created your first repository! This is the first step towards making your source code publicly accessible from day one. diff --git a/_episodes/03-use-registry.md b/_episodes/03-use-registry.md index f0c6756..f7ee1ac 100644 --- a/_episodes/03-use-registry.md +++ b/_episodes/03-use-registry.md @@ -1,13 +1,12 @@ --- title: "Make software easy to discover by providing software metadata via a popular community registry" teaching: 90 -exercises: 0 +exercises: 45 questions: - "Why are metadata important in research software?" - "What are good metadata?" - "Which are the most commonly used platforms for registering research software data." objectives: - - "Understand the importance of metadata" - "Understand why metadata are necessary for software discoverability" - "Have a clear concept of what good metadata entail" @@ -26,24 +25,28 @@ You are already using metadata, but you might not be fully aware of it. **Definition** -Metadata (for data) can be defined as ["a set of data that describes and gives information about other data"](https://en.wikipedia.org/wiki/Metadata) or ["Meta is a prefix that in most information technology usages means 'an underlying definition or description'"](https://whatis.techtarget.com/definition/metadata). For some more information and examples, [follow this link](https://web.archive.org/web/20160306145239/http://www.theguardian.com/technology/interactive/2013/jun/12/what-is-metadata-nsa-surveillance#meta=0000000). +Metadata (for data) can be defined as: +- ["A set of data that describes and gives information about other data"](https://en.oxforddictionaries.com/definition/metadata) +- ["Meta is a prefix that in most information technology usages means 'an underlying definition or description'"](https://whatis.techtarget.com/definition/metadata) + +For some more information and examples, [follow this link](https://web.archive.org/web/20160306145239/http://www.theguardian.com/technology/interactive/2013/jun/12/what-is-metadata-nsa-surveillance#meta=0000000). -For the software case, we have defined metadata as "a set of data that describes and gives information about software with the purpose of make it findable/discoverable". +For the software case, we have defined metadata as "_a set of data that describes and gives information about software with the purpose of make it findable/discoverable_". > ## Exercise: Think about metadata > #### Time 5 minutes > > Let's think about why metadata is useful to describe a publication. We have for instance a title and authors. What other metadata can you think of? Why would you say those are metadata? -> +> > By the end of this exercise, you should be able to better understand the difference between data and metadata. > > > ## Solution > > > > Some other common metadata for publications are starting page, ending page, journal where it was published, volume and item. -> > -> > They are considered metadata because they give you information about the publication but they are not the publication. -> > +> > +> > They are considered metadata because they give you information about the publication but they are not the publication. +> > > {: .solution} {: .discussion} @@ -51,8 +54,9 @@ For the software case, we have defined metadata as "a set of data that describes **Definition** From Wikipedia, ["Software documentation is written text or illustration that accompanies computer software or is embedded in the source code. It either explains how it operates or how to use it, and may mean different things to people in different roles."](https://en.wikipedia.org/wiki/Software_documentation) +All software documentation can be divided into two main categories ([ref](https://www.altexsoft.com/blog/business/software-documentation-types-and-best-practices/)), _Product_ documentation and _Process_ documentation, with the former further broken down to _System_ and _User_ documentation. However, in the majority of cases in research software, documentation refers to _User documentation_, i.e. information in the form of manuals that are mainly prepared for end-users of the product and system administrators. As such, (user) documentation includes tutorials, user guides, troubleshooting manuals, installation, and reference manuals. -That is, metadata helps describe the software in a standardised way, so it can be findable/discoverable, by both machines and humans. +Opposed to the documentation, metadata helps describe the software in a standardized way, so it can be findable/discoverable, by both machines and humans. > ## Software metadata vs documentation > @@ -61,6 +65,25 @@ That is, metadata helps describe the software in a standardised way, so it can b {: .callout} +> ## Exercise: Highlighting the importance of metadata +> #### Time 20 minutes +> +> **Part 1** +> Split into Groups of 3-4; each group decides on "keywords" / placeholders for describing a movie (do **not** include the title!). Such keywords might include attributes such as _Director_, _Year_, _Actor(s)_, _Genre_, _Duration_, _Setting_, etc. As soon the placeholders are defined, every person in the group thinks of a movie and tries to describe it based on specific keywords. +> Do people identify the movie? If you put the keywords in Google, does it give you back the correct movie? +> +> **Part 2** +> Do the same thing but for a research tool (it can be of the same scientific discipline as the people comprising the group, or a general purpose tool). +> +> By the end of this exercise, you should be able to internalize what metadata are and what **good** metadata are. +> +> > ## Solution +> > +> > TODO +> > +> {: .solution} +{: .discussion} + ## What are the existing commonly used standards descriptions for software metadata @@ -78,15 +101,14 @@ A standard can be defined as "a structure agreed and adopted by a community" or > {: .callout} -TODO: difference between control vocabulary and ontology. - +**Controlled vocabularies** provide a way to organize knowledge for subsequent retrieval. It is usually a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search. The fundamental difference between an **ontology** and other **controlled vocabularies**, e.g., thesauri, is the level of abstraction and relationships among concept. A formal ontology is a controlled vocabulary expressed in an ontology representation language. ([ref](https://semwebtec.wordpress.com/2010/11/23/contolled-vocabulary-vs-ontology/)) Examples - The [Gene Ontology](https://http://www.geneontology.org/) (GO) is the framework for the model of biology. The GO defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects: molecular function, cellular component and biological process - The [Climate and Forecast (CF) metadata](https://http://cfconventions.org/) are designed to promote the processing and sharing of files created with the NetCDF API. The CF conventions are increasingly gaining acceptance and have been adopted by a number of projects and groups as a primary standard. The conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. - The [myGrid Ontology](https://http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.4077&rep=rep1&type=pdf) is designed to support web service discovery and composition in the bioinformatics domain. myGrid supports in silico experiments in the life sciences, enabling the design and enactment of workflows as well as providing components to assist service discovery, data and metadata management. The myGrid ontology is one component in a larger semantic discovery framework for the identification of the highly distributed and heterogeneous bioinformatics services in the public domain. -- The [Sequence Ontology (SO)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1175956/) is a structured controlled vocabulary for genomic annotation. SO provides a common set of terms and definitions that will facilitate the exchange, analysis and management of genomic data. -- The [Darwin Core](https://http://rs.tdwg.org/dwc/) standard is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. +- The [Sequence Ontology (SO)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1175956/) is a structured controlled vocabulary for genomic annotation. SO provides a common set of terms and definitions that will facilitate the exchange, analysis and management of genomic data. +- The [Darwin Core](https://http://rs.tdwg.org/dwc/) standard is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. ## Standards for software metadata @@ -94,9 +116,6 @@ Examples - How can I make sure that I am using a good standard? - How to choose a standard? -The process of choosing a "good standard" is to see what is being (extensively) used by the community. It should ideally be well known and highlighted by the existence of a certain number of citations. - - The process in choosing a "good standard" is to see what is being (extensively) accepted in the scientific community. It should ideally be well known and highlighted by the existence of a certain number of citations. Example of standards: @@ -113,7 +132,6 @@ EDAM is a comprehensive ontology of well-established, familiar concepts that are - [BioSchema.org](http://bioschemas.org/specifications/Tool/specification/) Bioschemas is a community project built upon [schema.org](schema.org). It provides customisations, a.k.a. profiles, on top of schemas.org types and properties. Profiles include examples together with guidelines regarding cardinality, marginality and reuse of well-known vocabularies in Life Sciences. - The following table shows examples of metadata you would find on a software package in the SoftwareX journal. It gives a short description and a possible value. |Description|Value| @@ -128,7 +146,8 @@ The following table shows examples of metadata you would find on a software pack | Support email for questions | aperson@gmail.com | -> ## Exercise: Using a registry, e.g., bio.tools +> ## Exercise: Using a standard for creating metadata +> #### Time 10 minutes > > Ask the participants to go to the https://schema.org/SoftwareApplication and try to map the "metadata" they prepared for the previous movie/Software exercise. How many have you mapped? What are the top x metadata entries that you have missed and you think are necessary? What are your unmapped "metadata" that do not have a respective entry in schema - is there a closely related entry, or a composite one? > @@ -145,23 +164,20 @@ The following table shows examples of metadata you would find on a software pack {: .callout} - ## Existing platforms and tools for software metadata - The [schema.org](https://schema.org/SoftwareApplication) is the raw form of the possible metadata fields. It is very detailed but it is not readily useful for writing down the metadata of your software. Bioschemas has made an effort to narrow down and customise schema.org. types relevant for Life Sciences, one of them is the SoftwareApplication type. More information is available on [this link](http://bioschemas.org/specifications/Tool/specification/). -Adding metadata describing your software is commonly done via available platforms, that internally use those standards. The following list indicates some of the most prevelant ones: +Adding metadata describing your software is commonly done via available platforms, that internally use those standards. The following list indicates some of the most prevalent ones: 1. [bio.tools](https://bio.tools) This is a portal to bioinformatics resources worldwide, aimed to help bioinformaticians and scientists find, understand, compare and select resources, as well as use and connect them in workflows. As a platform, it makes use of the [EDAM ontology](http://EDAMontology.org), and therefore provides a standardized vocabulary for providing metadata. Moreover, it includes aspects such as `language` and `platform`. However, it does not support filters by `language` nor does it assign a doi to the software (which is to be expected, as it serves as a registry and not a repository). - 2. [OMICTools](https://omictools.com/) -This is a commercial service providing a registry of tools relevant in life sciences, containing sufficient metadata for connecting different tools in a single pipeline. However, it is not an open registry, i.e. the authors need to contact the development team in order for a tool to be included. +This is a **commercial** service providing a registry of tools relevant in life sciences, containing sufficient metadata for connecting different tools in a single pipeline. However, it is not an open registry, i.e. the authors need to contact the development team in order for a tool to be included. 3. [Astrophysics Source Code Library](http://ascl.net/) The Astrophysics Source Code Library (ASCL) is a free online registry for source codes of interest to astronomers and astrophysicists and lists codes that have been used in research that has appeared in, or been submitted to, peer-reviewed publications. It is fairly simple compared to other registries, but it focused on a particular domain (astrophysics). @@ -179,13 +195,14 @@ Zenodo is a general-purpose open access repository. > ## Exercise: Using a registry, e.g., bio.tools +> #### Time 10 minutes > -> Connect to the test instance of [bio.tools](link to test instance) and create a new entry on a software tool / github repo that you own or any of your favourite tools. You could find useful having a look to their [documentation on adding a tool](http://biotools.readthedocs.io/en/latest/user_guide.html#add-content). +> Connect to the test instance of [bio.tools](https://dev.bio.tools/) and create a new entry on a software tool / github repo that you own or any of your favourite tools. You could find useful having a look to their [documentation on adding a tool](http://biotools.readthedocs.io/en/latest/user_guide.html#add-content). > -> TODO: bio.tools image +> ![bio.tools main page](https://raw.githubusercontent.com/SoftDev4Research/4OSS-lesson/gh-pages/fig/bio-tools-main-ui.png) > > Once you are done, ask any of your colleagues what their tool is about. Use the search box to find it. -> +> > > ## Solution > > > > We have a tool to visualize protein sequence annotations developed in JavaScript and hosted in GitHub. A publication indexed in PubMed is already available. @@ -197,17 +214,20 @@ Zenodo is a general-purpose open access repository. > > If you go to the search box and look for "protein visualisation", you will see an entry like: > > TODO : add image > > -> > That's it! You have published your tool in the development version or bio.tools. You are ready to go live and published yout tool for real! Remember bio.tools focuses on life sciences. +> > That's it! You have published your tool in the development version or bio.tools. You are ready to go live and published your tool for real! Remember bio.tools focuses on life sciences. > {: .solution} {: .challenge} ## Wrap up -We are increasing visibility, because we are supporting findability by adding the correct/good metadata -> Connect this with FAIR principles. - - -**Instructor Notes / Setup** -- [Local Installation of Zenodo](https://github.com/zenodo/zenodo/blob/master/INSTALL.rst) -It may be interesting to have a local installation of zenodo to play around. The instructions using Docker are available on the link above. +By adding good enough metadata to our research software, we are directly supporting its findability, thus increasing the overall visibility of the software. This is tied to the **findable** aspect of the FAIR principles mentioned in the introductory episode of this lesson. The connection can be further enhanced through the rest of the best practices. For example, metadata can also support accessibility if you include a license there, or interoperability if you include input/output data types or format. There might also be some metadata supporting reusability as well. -- [Bio-Linux](http://environmentalomics.org/bio-linux-software-list/) -It is a final OS containing tools that have been already published, connected metadata, etc +> ## Optional Challenge: Mapping your metadata to the FAIR principles +> +> Using the metadata you identified for your tool earlier, try to map each one to the four FAIR principles. You can see an example in Table 1 of the [F1000 paper](https://f1000research.com/articles/6-876/v1). +> +> +> > ## Solution +> > +> > +> {: .solution} +{: .challenge} diff --git a/_extras/guide.md b/_extras/guide.md index 50d9d0b..6e8f3b3 100644 --- a/_extras/guide.md +++ b/_extras/guide.md @@ -3,3 +3,12 @@ layout: page title: "Instructor Notes" --- FIXME + +- Notes from the metadata episode + - [Local Installation of Zenodo](https://github.com/zenodo/zenodo/blob/master/INSTALL.rst) + + It may be interesting to have a local installation of zenodo to play around. The instructions using Docker are available on the link above. + + - [Bio-Linux](http://environmentalomics.org/bio-linux-software-list/) + + It is a complete Operating System (OS) containing tools that have been already published, connected metadata, etc. diff --git a/fig/bio-tools-main-ui.png b/fig/bio-tools-main-ui.png new file mode 100644 index 0000000..311e0d0 Binary files /dev/null and b/fig/bio-tools-main-ui.png differ diff --git a/setup.md b/setup.md index 12b3c0a..89ebe7d 100644 --- a/setup.md +++ b/setup.md @@ -3,4 +3,15 @@ layout: page title: Setup root: . --- -FIXME + +In order to be prepared for the lesson, you need to have accounts on the following (free) services: + +1. GitHub + +If you don't already have a [GitHub](https://github.com/) account, please follow the guide [here](https://services.github.com/on-demand/intro-to-github/create-github-account) in order to create one. + +2. BioTools + +[bio.tools](https://bio.tools/) is a portal to bioinformatics resources worldwide, aimed to help bioinformaticians and scientists, find, understand, compare and select resources as well as use and connect them in workflows. + +For the purposes of this lesson, we will be using the [**developer** instance of bio.tools](https://dev.bio.tools/) so that we can add test content (it is removed periodically).