Authors
- Arthur Newton (SURF)
- Claudio Cacciari (SURF)
- Maithili Kalamkar Stam (SURF)
License Copyright (c) 2020, SURFsara. All rights reserved. This project is licensed under the GPLv3 license. The full license text can be found in LICENSE.
Learning about the CLI tool icommands of iRODS and do basic data handling of data stored in an iRODS instance:
- uploading/downloading data
- adding metadata to data objects/collections
- querying based on metadata
- an account on Lisa or Snellius system at SURF
In this course we will use the Lisa login node as our user interface. If you don't have access to Lisa or Cartesius system and you want to only use the basic data handling of iRODS without the SLURM data processing, you have to install the icommands on your local machine.
Login to the Lisa compute cluster with your appropriate credentials:
ssh lcur#@lisa.surfsara.nl
Since 2020, the icommands are installed on Lisa (and Spider) system wide.
If you have not done it yet, clone this repository to your home folder:
git clone https://github.com/ccacciari/iRODS-RDM-HPC-course.git
cd iRODS-RDM-HPC-course
Otherwise skipt this step.
Please if you have loaded the irods module in a previous tutorial/exercise, unload it:
module unload iRODS-iCommands/4.3.0
If you now use the command ienv
, you should see somethiong like this:
ienv
irods_version - 4.2.8
...
To connect to iRODS you need to create the environment file within your homefolder of Lisa. The environment file in the git repository can be used:
mkdir -p ~/.irods
cp irods_environment.json ~/.irods/irods_environment.json
Use your favourite text editor (nano or vi) to change your username.
"irods_user_name": "YOUR DEMO ACCOUNT NAME",
Now initialize the connection:
iinit
Enter your password and you should be logged into iRODS.
A scrambled password file, ~/.irods/.irodsA
, will be created which will be used for subsequent connections.
To test your connection issue a simple command:
ils
You should see the listing of the collection defined as the home folder in your environment file, irods_home
.
You can find help for the icommands with:
ihelp
which will show you all the possible commands.
Additionally, the following command gives you more information about the iRODS environment (typically not needed):
ienv
Note that in iRODS files are called data objects, and folders are called collections.
The icommands have basic file handling functionality that have their (almost) equivalent in Linux terminal programs but with an i
in front of the name, e.g. ls
and ils
, cp
and icp
, mv
and imv
, pwd
and ipwd
, mkdir
and imkdir
, cd
and icd
, rm
and irm
.
If you want to know more about the available icommands use for example ils -h
.
However, please do realize iRODS is not a filesystem like you have on your personal computer.
Data handling is done differently.
Use ils
to see what collections are available to you and navigate to that collection with icd
:
ils
icd <collection name>
Note that you can't use bash tab completion with paths in iRODS.
You can create a collection with imkdir
.
For this course it is convenient to create a collection for yourself:
imkdir <favorite color>_<username>
icd <favorite color>_<username>
If you get an error that you are not allowed to create a collection, be sure to have navigated to a collection where you do have the permission to write. You can use:
ils -A
to check the permissions (-A
stands for Access Control Lists) of data objects and collections.
Food for brain:
- If you run a command 'ils /' would you see all the collections?
- Can different collections be at different physical locations/storage backends?
- Within a collection can you have objects stored at different physical locations/storage backends?
- How do you know what permissions you have on a collection?
In order to put a file in iRODS:
iput source_file destination_dataobject
when no destination data object is given, the local filename will be used.
iRODS will issue a warning if the file already exists (which you can force to do with the -f
option).
In order to put a folder in iRODS use the recursive option:
iput -r source_folder destination_collection
You need to add the -r
option for recursively putting data in iRODS.
Depending on the policy of the iRODS server, you can also immediately calculate the checksum while uploading a file:
iput -K source_file
Note that his can also be done afterwards with the ichksum
command.
There are many more options for iput
which could (or could not) optimize data transfer, e.g. -b
for bulk upload to overcome network overhead and -N
for number of threads.
However, the default behaviour is already high performant in most cases if the iRODS server is configured correctly.
You could check with ils
what has happened.
What does ils -l
or ils -L
show?
- Upload
alice.txt
to your home directory in iRODS - Create a new collection
aliceInWonderland
within your home directory - Move
alice.txt
into this new collection.
Food for brain:
- Should/can you use iput and icp commands interchangeably, when, why and why not?
To download a data object, you can use the iget
command:
iget source_dataobject destination_file
Again, if the local destination is not specified, the name of the data object will be used.
If the filename already exists, iget
will result in an error.
To download a collection you have to again specify the -r
option:
iget -r source_collection destination_folder
- download the data object
alice.txt
asaliceRestore.txt
- download the collection
aliceInWonderland
Food for brain:
- With the iget command, can you put data to another remote location instead of the Lisa login node?
You can remove files using irm
:
irm dataobject
It depends on the configured policy of the iRODS instance whether there is a trashbin.
Note that removing a file is just a rename.
If you really want to delete a file either use irmtrash
after removing or irm -f
upon removing a data object.
- how would you be able find the removed file back?
Food for brain:
- If you run the command 'ils /surfZone1', what do you see?
- Can you find your deleted files here?
- Can you see another user's deleted data as well?
In iRODS a data object is not only the bitstream and the filename, but user defined metadata is part of the data object. Metadata is essential to give a data object more context. You can add all kinds of metadata to a data object: administrative metadata (about the management, processing of the data), descriptive metadata (about the content, author, etc) or provenance metadata (about the history of the data object). This makes it very powerful, as metadata and data can not be out of sync. You can manually add metadata, which can also be scripted or let an iRODS rule add metadata.
In iRODS, per metadata item you can store three strings: key, value, unit. You can dismiss the unit string, but the key/value pair needs to be unique.
Manual metadata handling can be done via the imeta
command.
For each command, -d, -C, -R, or -u is used to specify which type of object to work with: data objects, collections, resources, or users, respectively.
To add metadata to a data object or a collection:
imeta add -d dataobject Key Val Unit
imeta add -C collection Key Val Unit
Again, it is possible to leave out the unit, and you can set multiple key value pairs with the same key. There is no logical limit to the number of metadata items per data object or collection.
You can remove metadata as such:
imeta rm -d dataobject Key Val Unit
imeta rm -C collection Key Val Unit
where you do have to explicitly specify the whole set of {key, value, unit} to remove it.
You can use wildcards (%
) with imeta rmw
.
You can also modify, copy, show metadata by the imeta mod
, imeta cp
, imeta ls
commands respectively:
imeta mod -d dataobject key1 val1 unit1 n:key2 v:diffval2 u:unit2
imeta cp -C -C collection1 collection2
imeta ls -C collection
You can also immediately add metadata to the data object or collection upon upload:
iput source_file --metadata "key1;val1;unit1;key2;val2;unit2"
iput source_file --metadata "key1;val1;;key2;val2;unit2"
- add metadata to alice.txt
- add metadata to the
aliceInWonderland
collection - create a new file locally (
echo "testing metadata" > lorem.txt
), upload file and add metadata in one go
Metadata attached to data objects and collections becomes very useful when you want to query for data.
The iquest
command is used to query data objects/collections.
It uses a SQL like syntax.
Do note that not all SQL options are available by default, e.g. a join is not possible.
You can look up the available keys you can query for via:
iquest attrs
Note that there are a lot.
General iquest
queries look like:
iquest "select COLL_NAME, DATA_NAME, META_DATA_ATTR_VALUE where META_DATA_ATTR_NAME like 'key1'"
In above command we wanted to look for everything which has the metadata attribute name (metadata key) that resembles 'key1' and retrieve the collection name, data object name and metadata attribute value (metadata value). iRODS will respond something like:
COLL_NAME = /surfZone1/home/irods-user1
DATA_NAME = testfile
META_DATA_ATTR_VALUE = val1
------------------------------------------------------------
You can change the format of the iRODS response by using a C like format string after iquest
like so:
iquest "User %-6.6s has %-5.5s access to file %s" "SELECT USER_NAME, DATA_ACCESS_NAME, DATA_NAME WHERE COLL_NAME = '/surfZone1/home/irods-user1'"
where after iquest
there is the format string and the second string is again the SQL like query.
This can be convenient for making reports but also for retrieving absolute filepaths:
iquest "%s/%s" "select COLL_NAME, DATA_NAME where META_DATA_ATTR_NAME like 'TEMPERATURE' and META_DATA_ATTR_VALUE > '300'"
This could be useful in concatenating commands in HPC data staging which we will be using in the following section:
iquest "%s/%s" "select COLL_NAME, DATA_NAME where META_DATA_ATTR_NAME like 'key1' and META_DATA_ATTR_VALUE = 'val1'" | xargs iget
- try to find the files you have added above by searching for the associated metadata you added
- search for files with
META_DATA_ATTR_NAME
isauthor
andMETA_DATA_ATTR_VALUE
isLewis Carroll
. Do you know these files?
Now that you know the basic data handling in iRODS, you can follow the next section which is about setting up a data processing pipeline while still retaining data provenance with the data handling tool discussed in this section iRODS in HPC.