A basic Bash tutorial by James A. Fellows Yates (@jfy133) and Thiseas C. Lamnidis (@TCLamnidis).
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
- (Bespoke)BareBonesBash(BroughtByBeardedBioinformaticians)
- Introduction
- tl;dr
- Navigating the maze
- Playing with files, one bit at a time
- Asking the computer for help (it loves helping people)
- The Lord of the Pipes: One command to do them all.
- Now you're thinking with portals! Symlinks and their usefulness.
- Lazyness 101: Minimising our work by maximising the work of the computer
- The road so far
- OPTIONAL: The Cleanup Crew
- For next time!
The aim of this tutorial is to make you familiar with using bash everyday... for the rest of your life 😈. More specifically, we want to do this in the context of bioinformatics. We will start with how to navigate [Thiseas insisted on that fancy word...] around a filesystem in the terminal, download sequencing files, and then to manipulate these. Within these sections we will also show you simple tips and tricks to make your life generally easier. In fact, some of these commands we only just learnt last week [Thanks Aida!] and we've been using the terminal for more than 2 years.
This tutorial is designed to be self sufficient using public data. Thus you can do this anywhere on any machine with a UNIX terminal (no warranty provided).
You BEFORE this tutorial. You AFTER this tutorial.
If you want to know what you will learn in this tutorial, or are already too scared to read the rest of this tutorial, you can look at this table as a quick reference. To understand actually what each command does, carry on reading below!
command | description | example | common flags or arguments |
---|---|---|---|
pwd | print working directory | pwd | |
ls | list contents of directory | ls | -l (long info) |
mkdir | make directory | mkdir pen | |
cd | change directory | cd ~/pen | ~ (home dir), - (previous dir) |
ssh | log into a remote server | ssh @.com | -Y (allows graphical windows) |
mv | move something to a new location (& rename if needed) | mv pen pineapple | |
rmdir | remove a directory | rmdir pineapple | |
wget | download something from an URL | wget www.pineapple.com/pen.txt | -i (use input file) |
cat | print contents of a file to screen | cat pen.txt | |
gzip | a tool for dealing with gzip files | gzip pen.txt | -l (show info) |
zcat | print contents of a gzipped file to screen | zcat pen.txt.gz | |
whatis | get a short description of a program | whatis zcat | |
man | print the man(ual) page of a command | man zcat | |
head | print first X number of lines of a file to screen | head -n 20 pineapple.txt | -n (number of lines to show) |
| | pipe, a way to pass output of one command to another | cat pineapple.txt | head | |
tail | print last X number of lines of a file to screen | tail -n 20 pineapple.txt | -n (number of lines to show) |
less | print file to screen, but allow scrolling | less pineapple.txt | |
wc | tool to count words, lines or bytes of a files | wc -l pineapple.txt | -l (number of lines not words) |
grep | print to screen lines in a file matching a pattern | grep pineapple.txt | grep pen | |
ln | make a (sym)link between a file and a new location | ln -s pineapple.txt pineapple_pen.txt | -s (make symbolic link) |
nano | user-friendly terminal-based text editor | nano pineapple_pen.txt | |
rm | more general 'remove' command, including files | rm pineapple_pen.txt | -r (to remove directories) |
$VAR | Dollar sign + text indicates the name of a variable | $PPAP=Pen | |
echo | prints string to screen | echo $PPAP | |
for | begins 'for' loop, requires 'in', 'do' and 'done' | for p in apple pineapple; do echo "$p$PPAP"; done applePen pineapplePen |
A terminal is simply a fancy window that allows you to access the command-line interface of a computer or server.
The command-line itself is how you can work on the computer with just text.
bash
(bourne again shell) is one of the most languages used in the terminal.
After opening the terminal what you will normally see is a blank screen with a 'command prompt'. This typically consists of your username, the device name, a colon, a directory path and ends with a dollar symbol. Like so:
<username>@<device_name>:~$
Throughout this tutorial, we will indicate the prompt of each command just with
a single $
, without the rest of the prompt. Make sure NOT to copy the $
as the command won't work! This symbol is important for reasons you will see
later.
Furthermore, the symbols<>
are used to show things that will/should be
replaced by another value. For example, in Thiseas' command prompt <username>
will be replaced by lamnidis
, as that is his username.
Keep an eye out for both of these throughout the tutorial.
Note that prompts are customisable, so they will not always be displayed as above [look at Thiseas' magical prompt as an example. James keeps his vanilla as he is a barbarian].
The prompt is never involved in any command, it is just there to help you know who and where you are. Therefore you must always make sure when copying a command (see later) that you do NOT include the prompt.
Here, the directory ~
, stands for your home directory. This shorthand can be
seen and used both on your machine and on the cluster. Note that this shorthand
will point to a different place, depending on the machine and the user.
If you want to know what the shorthand means, (here comes your first command!)
you can type in pwd
, which stands for "print working directory".
The working directory stands for whichever directory you are currently in.
$ pwd
This prints the entire "filepath" of the directory i.e. the route from the "root" (a specific directory on the machine), through every subdirectory, leading to your particular folder.
There are two types of filepaths:
- An absolute path will start with the deepest directory in the machine,
shown as a
/
. Paths starting with~
are also absolute paths, since~
translates to an absolute path of your specific home directory. That is the directory path you see in the output of thepwd
command you just ran. - Alternatively a relative path always begins from your working directory (i.e.
your current directory). Often this type of path will begin with one (
./
) or two (../
) dots followed by a forward slash, but not always. In the syntax of relative pathways.
means "the current directory" and..
means "the parent directory" (or the 'one above').
As a real life example, imagine you are walking down the street when a car stops to ask for the way to the Brandenburger Tor. You could tell them how to get to the Tor from Ethiopia (since that is the presumed root where all humans started their journey) [haha, human history joke], or you could say "Take a left here, straight for 3 blocks, and you're there.". The latter set of directions is relative to their current position, while the first one is not.
Now let's look around at our current location and see what we can find within
our home directory. We can use the command ls
, shorthand for "list", which
will (surprise surprise) list the directory contents.
$ ls
Your home directory should come equipped with multiple subdirectories like "Documents", "Pictures", etc.
It is now time to start moving [navigating] towards "the Brandenburger Tor"
from the example above. We can navigate through directories using the command
cd
which stands for "change directory".
$ cd Documents/
This command will move you from your home directory to its "Documents"
subdirectory. Note that Documents/
above is indeed a relative path, since it
starts from the home directory (the initial ./
is implied). To find the
absolute path of the "Documents" directory we will once again use pwd
.
BONUS TIP TIME! You can use the ↑ and ↓ to search through your last 1000 used commands.
$ pwd
Now we can move up one directory, back to your home using the relative path
../
.
$ cd ../
We can also change directories using absolute paths. Lets do this using the
absolute path we printed using pwd
in the previous step. Type cd
, but don't
press enter yet!
Copy and paste the output of the previous pwd
command
(which you can see in your terminal does not have the command prompt), after
the cd
, then press enter. NOTE: Putty users have to highlight the text
and it copies automatically, and use right click or shift + Insert
to paste.
For example:
$ cd /home/fellows
BONUS TIP TIME! Now for one last move, here is a lesser-known trick. When
using cd
you can use a dash (-
) to indicate 'my previous location'. This is
useful since you can traverse [ha! I have my own fancy words now! - James]
multiple directories with one cd
command. So, now, to return to our home
directory from the documents directory we can type:
$ cd -
$ pwd
And voilá! We are back in our home directory. Now try running this:
## Remember in "A land before time" when the dinosaur's mother died?
While reading that command, you
might have been reminded of one of the most emotionally devastating moments of
any person's life. However, the computer would show no signs of emotional
struggle. Sure, computers don't have feelings and all, but ALSO the
computer never even read that sad reminder. A computer will NOT read
anything that comes after a comment character, which in bash is a hash (#
)
[NOT a hashtag!]. You can use comments to explain what a certain bit of code
does, without affecting how that code runs. This can also be a useful lifehack
in certain situations, an example of which will be given later.
However, often when working in bioinformatics we will be working remotely on a
server. The most typical way is to log in via "secure shell", known as
ssh
. Note that you can normally only log into an institute's server being on
the network of the institute and or via VPN, so make sure you are on either of
those.
If you're working just on your personal computer, skip to the next section!
A typical ssh
command consists of the ssh
, with a user, '@' symbol and then
the address of the server. For example:
$ ssh <user>@<server>
You can find our your user and server from your IT department.
Now, looking at your terminal you will see ~
. In bash ~
points to a
different home directory, as you are on a different machine. However, all of
the commands you've learned so far will still work the same 😉. You can
double check both of these by typing
$ pwd
It is important to keep your corner of the servers well organised, and the
trick to doing that is making your own directories. Often a lot of them.
You can make a new empty directory using the command mkdir
, shorthand for
"make directory".
$ mkdir ~/BareBonesBosh
$ ls ~
You can now see your new and devoid-of-content directory. But don't celebrate yet! The directory has the wrong name! [Who could have seen this coming?] If you saw the typo and fixed it already, NO BROWNIES FOR YOU!
But don't lose hope, as we can rename things with the mv
command,
shorthand for "move".
In fact move, as the name suggests, will move a file/folder into a new location,
also renaming it in the process if necessary. It works by typing mv
, the old
location and a target location.
$ mv ~/BareBonesBosh ~/BearBonesBash
The command above will now move the directory into the same location, but
as the target location is spelt differently, the directory will now have a
different name. Thus, essentially renaming it to BearBonesBash
.
But oh no! Not again! This is not a bash tutorial for ancient bear genomics!
Let's just delete that empty directory and start over, using the rmdir
command, short for "remove directory".
$ rmdir ~/BearBonesBash
$ ls ~
And now we can create our directory, properly named this time.
$ mkdir ~/BareBonesBash
So we have places to organise our files... buuut we don't have any files yet! Lets change that.
We ain't playing with bears today - that's dangerous (as we saw above), instead, lets play with some Mammoths!
We're going to use wget
to download a FASTQ file from the ENA (European Nucleotide Archive). So while in
our ~/BareBonesBash
directory, we will give wget
the link to the file, and
we should see a loading bar. Once downloaded (it should be pretty quick),
we can use ls
to check the contents.
$ wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/001/ERR2020601/ERR2020601.fastq.gz
## if you don't have wget, you can instead use 'curl' with the command below.
# curl -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/001/ERR2020601/ERR2020601.fastq.gz
## Then to check if the file is now in our working directory
$ ls
Great, the file is there! Now it's time for organising. We should move our
fastq file into our newly created directory. This time, we can use mv
the
way it was meant to be used.
$ mv ~/ERR2020601.fastq.gz ~/BareBonesBash
$ ls
You should now see that the fastq file is no longer in your home folder. Let's go find it!
$ cd ~/BareBonesBash
$ ls
Good, the file arrived in ~/BareBonesBash
safely! Maybe now would be a good
time to check if we downloaded the right thing. In bash, you can normally use
cat
with text files, short for concatenate, which is used to print the
contents of a file to the screen. Lets try this with our newly downloaded file.
BONUS TIP TIME! If you're anything like Thiseas, who gets triggered at slow computer things, and prefer to have the computer do the work for you - try typing a couple of characters then press the "TAB" key on your keyboard.
$ cat ERR2020601.fastq.gz
Yay for auto-complete! But you probably had a bunch of junk printed to screen.
[Looks like curiosity killed the cat
! ]
That's because the FASTQ file, as with almost all FASTQs, is compressed (as
indicated by the .gz). To view the human readable contents of the file, we
can instead use zcat
. Don't forget your auto-complete!
$ zcat ERR2020601.fastq.gz
That looks much better, we can now actually see some DNA sequences! But you may have also noticed that a lot of stuff zipped past without you being able to see it. You could try scrolling but likely you'll not be able to go back far enough to see your previous commands.
BONUS TIP TIME! try pressing ctrl+l
, which will clear your terminal of
all the junk that was printed to your screen. This does NOT delete those lines,
it simply scrolls down for you. You can still find all your previous work if
you scroll up.
Now it's time for the inevitable tangent when your tutor thinks of a very (un)funny metaphor to explain something! [James added that and I didn't notice till it was too late >.<]
As we just learned, the FASTQ file we've been playing with is compressed, Zipped, if you will. We constantly compress files in multiple different ways, but why? As the name suggests, compression saves disk space, so we can have more files stored on our system.
An everyday example of the benefits of compression comes from music. To keep the calculations smaller we'll take a time machine back to 2001, when having one of these things made you instantly popular and better geared than James Bond [tech-savvy Pierce Brosnan, not the trigger-happy Daniel Craig]:
That amazing piece of technology came with a storage space of 5GB, while an uncompressed music album takes up 640MB of space. THAT IS 7.8125 ALBUMS! At 20 songs per album, that makes less than 160 songs total! "But my iPod had 800 songs in it, and still had space!" I hear you thinking. That's mp3 compression for you. Compression programmes you might be familiar with are, for example, WinZip or WinRar.
Is there some way we could work out how much space we are saving by compressing
this FASTQ file? Let's ask the computer to help us find a way! The first command
to use here is whatis
, which will give a one line explanation of what a certain
command does. The second command we need is man
. Using whatis
we can find out
what man
does.
$ whatis man
This will inform us that man
is
an interface to the on-line reference manuals
. Cool! Now try to get
information on zcat
using man
.
$ man zcat
This will open the manual page for zcat
in a separate screen, which you can
exit by pressing "q
" on your keyboard. You can scroll up or down with the
arrow keys. At the top of the screen you will see the command the manual is
for, in this case it should read gzip
. That is because zcat
is part of the
programme gzip
. You will see a long description of the programme, followed by
(scroll down) a section with all the options available.
Isn't this great! The option -l
lists the size of a file in both compressed
and uncompressed form, as well as the compression ratio (how effective the
compression was). Most programmes you will use DO have a man
page, making
this command extremely useful.. Now that we learned about the -l
option of
gzip
, let's use it to see how efficient the compression of this FASTQ file is.
[Say it with us: "man
is love. man
is life."]
$ gzip -l ERR2020601.fastq.gz
A compression factor of 74.9%
is pretty good. It means our compressed FASTQ file is
25.1% the size of the uncompressed file would.
After that tangent, let's get back to our regularly scheduled program(ming)!
We will now try out three semi-related commands to make viewing the contents
of a file easier, and begin to familiarise with the most important
functionality of bash: the concept of |
(a.k.a. the "pipe").
A pipe passes the output of one command and gives it as input to the next. It
allows us to string commands together, one after the other, which means you can
do more complicated (and beautiful) things to your files "on the fly". The command
head
allows you to view the first 10 lines of a file, while tail
will show
the last 10 lines.
We will now print the file to screen with zcat
, but rather than just let
the whole thing print, we will "pipe" the output into head
, which will
allow us to see just the first 10 lines.
$ zcat ERR2020601.fastq.gz | head
We can also display more lines with the -n
flag (short for "number of
lines"). To see the first 20 lines you would use
$ zcat ERR2020601.fastq.gz | head -n 20
The same option exists for tail, note that but options are not universal! Not every programme will use the same options!
$ zcat ERR2020601.fastq.gz | tail -n 4
And you can also chain them altogether [not unlike a human centipede... No gif here so we don't get fired]
$ zcat ERR2020601.fastq.gz | head -n 20 | tail -n 4
The above command will print the whole file, but capture only the first 20 lines, before printing out the last 4 lines of these 20.
In practice, what was just printed on your screen is the record of a single read, which spans 4 lines of the FASTQ file.
- The record begins with the read ID, preceded by an
@
. - The next line contains the sequence of the read.
- The third line is a separator line ('
+
'). - Finally, the fourth line of this record contains the base quality score for each position on the read, encoded a certain way. We won't go into the specific encoding of base quality scores here, but it is easy enough to find information about it online, if you want to know more.
But what if you wanted to view the whole file "at your own leisurely pace"
We can use the tool less
, which prints the file to screen, but allows you
to move up and down the output with your arrow keys
. You can also move down
a full screen with the spacebar
.
$ zcat ERR2020601.fastq.gz | less
You can quit by pressing "q" on your keyboard.
Now we've had a look inside and checked that the file is a pretty normal FASTQ file, lets start asking more informative bioinformatic questions about it. A pretty standard question would be, how many reads are in this FASTQ file? We know now that each read record in a FASTQ file has four components, and thus takes up 4 lines. So if we count the number of lines in a file, then divide by four, we can work out how many reads are in our file.
For this we can use 'wc', which stands for "word count". However, we
don't want to count words, we want to count the number of lines. We can
therefore use the flag -l
(try using what we learnt about man
above to find lists of
these flags!). But remember we first have to decompress the lines we are
reading from the file with zcat
.
$ zcat ERR2020601.fastq.gz | wc -l
This should give us 18880, which divided by four (since there are four lines per read), is 4720 reads.
Next, maybe we want to know what the name of each read is. When we used
less
above, we saw each read header began with "@". Maybe we can use this
to our advantage!
The command grep
will only print lines in a file that match a certain
pattern. So for example, we want to search for every line in our FASTQ file
that contains a '@'. Lets try it out again in combination with zcat
and
our pipes.
$ zcat ERR2020601.fastq.gz | grep @
Unfortunately we seem to have picked up some other stuff because of the @ characters in the base quality lines.
We can make our "pattern", in this case "@"
, to be more specific by adding
"ERR" to it. But let's also avoid flooding your screen with 4720 lines of
stuff, and pipe that output into less
, so we can look at it more carefully.
$ zcat ERR2020601.fastq.gz | grep @ERR | less
Remember to press "q" to exit.
And for one final recap, we previously calculated there to be 4720 reads in our
file. If we have just extracted the unique read headers for every read, then
in principle we can also just count these with wc
!
$ zcat ERR2020601.fastq.gz | grep @ERR | wc -l
The FASTQ we have been working with so far was downloaded from the ENA. It is important to keep the file name intact, so we can easily identify this specific FASTQ file in the ENA database in the future, if need be.
In order to retain the original file, but also to play around with the contents, we can use a symbolic link (symlink). You have doubtless seen these many times right on your desktop, in the form of desktop shortcuts! They are small portals that let you go to a remote location really fast, and take something from there.
Imagine if you could reach the TV remote from the sofa, although for some strange reason you left it in the freezer when picking up the (now half-melted) ice cream. [No, of course Thiseas has never done that!]
So let us make a new subdirectory to store our symlink to the FASTQ file we already downloaded, and move to that directory.
$ mkdir ~/BareBonesBash/FastQ.Portals
$ cd ~/BareBonesBash/FastQ.Portals
It is now time to make the symlink. We do this with the ln
command (short for
"link"), by providing the -s
option, which specifies we want to create
a symbolic link (i.e. a shortcut).
Note: You should give absolute paths to the file your symlinks point to, or
things will break down. (Note that a path that starts with ~
is technically
an absolute path, since it is also not relative to your current position.)
$ ln -s ~/BareBonesBash/ERR2020601.fastq.gz .
Make sure you included that .
in the command above. As discussed in the
"Relative Paths" section, that points your current working directory, thus
telling the ln
programme that it should create the link in the current
directory. You should now see the symlink in the directory.
To see where the link points to we can use ls -l
, which provides exended
information on the files shown with ls
. (For more information you can look
at the man
page for ls
).
$ ls -l
We can now look at the original FASTQ file by using our symlink. Note
that while command looks the same as in the section above, we are in a
different directory, so the ERR2020601.fastq.gz
here is technically
different to the original. It is now a shortcut to the originl file, which
happens to have the same name. So, repeating above:
$ zcat ERR2020601.fastq.gz | head -n 20 | tail -n 4
Which should print out the same read as it did on the original FASTQ file.
Now for a bit of honesty. A single sample will not get you a nature publication. [ok, maybe sometimes]. We will need more data if we're gonna make it to the most prestigious journals. So let's download another 7 samples from James' Mammoth project to get us on our way to a nature cover page. (See here for the ENA page of the project) [Yay! Free Publicity!].
It is a lot of work to run wget
7 times while changing the command everytime.
Bonus tip time! One way would be to press the 'up' arrow on your keyboard, which will allow you to scroll through all your previous commands. Thus you could pull up the previous command, then just change a couple of characters. This can be useful in certain cases, but doing that 8 times is still too much work.
Good thing we're here to learn how to be lazy! We can download multiple files
from an ftp server by giving wget
a file that contains the ftp links for each
file we want downloaded.
But how can we make this file? There are multiple options for text editing in
the terminal. If you're absolutely insane you may look up vim
[Thiseas' poison], or we can use nano
which is much more user friendly.
Editing the contents of a file in nano
is mostly as you would with your
standard TextMate
or gedit
on your local machine. However, the main
difference is how you save, and close the program which you perform using
keyboard combinations (like you would use ctrl + c
to copy a line in your
typical 'Microsoft Office' suite).
So open up the program with
$ nano
And you will now see a blank window, with a section at the bottom with
a variety of commands at the bottom (where ^
corresponds to the ctrl
or
cmd
key on your keyboard). You can try typing and deleting text as you
normally would on your offline text editor, moving around the page with your
arrow keys.
To save the contents of the file, we want to begin by initating our exit
with ctrl+x
. At the bottom you will be prompted to "Save modified buffer",
press y
on your keyboard to agree. Now you will be asked what you want
the file to be called. Type ~/BareBonesBash/Ftp.Link.txt
to give both
the directory and the file name (Ftp.Link.txt
), and then press enter
.
We can check that the file was successfully generated by navigating into
the directory and doing ls
.
$ cd ~/BareBonesBash/
$ ls
Great! There is a file there! But wait! OH NO! There is another typo! We have multiple links not a Link!
[Dear lord, how much nerdier can we get here -.- ...]
Lets remove that file, and start again.
So far, we learnt rmdir
to remove a directory. To remove a file, we can
instead use rm
for – you guessed it! – remove.
$ rm Ftp.Link.txt
$ ls
And it's gone!
You can also use rm
to remove directories using it with the flag -r
,
but this is 'less' safe - it will not warn you if a directory has stuff
already inside it.
Anyway, lets start again, but this time get ready to download our extra
Mammoth files, using our relative paths to go back to our home directory,
and opening up nano
again
$ cd ~
$ nano
Copy the text below into the blank window, as you would normally when at
your terminal command prompt (cmd+v
on OSX or ctrl+shift+v
on Linux).
ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/009/ERR2020609/ERR2020609.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/001/ERR2020611/ERR2020611.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/007/ERR2020567/ERR2020567.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/005/ERR2020565/ERR2020565.fastq.gz #ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/001/ERR2020601/ERR2020601.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/003/ERR2020613/ERR2020613.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/008/ERR2020618/ERR2020618.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/007/ERR2020617/ERR2020617.fastq.gz
(Note the #
, this is commented out as we've already downloaded this file!)
So again to recap exiting and save a file in nano we do the following dance:
- to initate exit:
ctrl+x
- Press
y
to say you want to "Save modified buffer", - Type the file name
~/BareBonesBash/Ftp.Links.txt
and then pressenter
.
To verify that it worked correctly, we can either use the command that we
learnt above to print to screen the contents of the file (which is...?), or
we can use nano
again, but with the file as an argument to open the
file and see the contents.
$ nano ~/BareBonesBash/Ftp.Links.txt
This time when you exit with ctrl+x
you'll immediately return to your
command prompt, as you made no changes to the file.
Woop! Now lets utilise the file we just created, by downloading all the files stored in the URLs. IN ONE GO!
You can provide a file to wget
with URLs (like the one you just made) using
the flag -i
, for "input".
$ cd ~/BareBonesBash
$ wget -i ~/BareBonesBash/Ftp.Links.txt
## curl cannot handle links from a file, so if you are using curl, you should run the command below to download all the files.
# curl -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/009/ERR2020609/ERR2020609.fastq.gz -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/001/ERR2020611/ERR2020611.fastq.gz -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/007/ERR2020567/ERR2020567.fastq.gz -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/005/ERR2020565/ERR2020565.fastq.gz -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/003/ERR2020613/ERR2020613.fastq.gz -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/008/ERR2020618/ERR2020618.fastq.gz -O ftp.sra.ebi.ac.uk/vol1/fastq/ERR202/007/ERR2020617/ERR2020617.fastq.gz
Look at that! One command instead of 7! You're becoming a bash pro already!
Time for another tangent! You will now learn the echo
command.
In bash echo
just prints things. The name refers to the fact that since the
computer "says" what you just told it to say, it behaves like an echo of yourself.
Tradition dictates that the first thing you have the computer say in programming
tutorials is "Hello World!", so here goes:
$ echo Hello World!
Great! Now back to the question at hand. What is a variable anyway? That is a good
question! A variable is something that changes! But what does that mean, exactly? A
variable can be "set" (i.e. telling the computer what that variable means) to a
variety of things. Some variables are set for you, the moment you open your terminal
or log into a servers. By convention, such variables have names in ALL CAPS. An example
of such a variable is HOME
, which stores the location of your home directory. Therefore, when you use the shorthand ~
, the
computer looks into that variable to see what that means. However, the computer
cannot always tell what is a variable and what is just text. It relies on you to
tell it what should and should not be "unpacked". "Unpacking" means "telling the
computer to look at what is inside a variable. We signal the computer that we wish
to look inside a variable by using the $
character in front of the variable name.
Try this:
$ echo HOME #This will print the word HOME.
$ echo $HOME #This will print the contents of the variable HOME.
Variables as the one above are called environment variables and should generally NOT be changed on a whim [even though the temptation might be a whim away. A whim away... A-whim-away...].
But you can also set your own variables, which is extremely handy. Any variable can be easily
overwritten, which is one reason why they are so useful. Therefore, as long as you don't
give your variables names in ALL CAPS, you won't run the risk of overwriting environment
variables, and everyone is happy. One way to assign variables is by using an =
. In the
example below, we will set and overwrite the variable GreekFood
, and then "unpack" it in
several sentences [which also happen to be objectively true].
$ GreekFood=4 #Here, 'GreekFood' is a number.
$ echo "Greek food is $GreekFood people who want to know what heaven tastes like."
#
$ GreekFood=delicious #Now we overwrite that number with a word (or a "string" of characters).
$ echo "Everyone says that Greek food is $GreekFood."
#
$ GreekFood="Greek wine" #We can overwrite 'GreekFood' again, but when there is a space in our string, we need quotations.
$ echo "The only thing better than Greek food is $GreekFood!"
#
$ GreekFood=7 #And, of course, we can overwrite with a number again too.
$ echo "I have been to Greece $GreekFood times already this year, for the food and wine!"
#
We will talk about quotes another time, so just forget you used them for the moment 😉.
Now you have a basic understanding of Greek food. I mean variables in bash! Let's see how we can use this knowledge.
Now to minimise our workload in making the symlinks for all the FASTQ files we downloaded previously! We can do this using a for loop, one of the basic methods of all programming.
Imagine you have to order pizzas for a varying number of scientists every week. [Just a random example]. For every person you will need an extra pizza. This is a sort of "for loop": whereby you go through the list of names of hungry scientists, and you add one more pizza to the list for every name. Note that the specific names of the scientists wouldn't really matter here, only the number of names. So in pseudocode (code-like writing that is human readable but a computer will not understand it), the above loop would look like this:
## Don't copy and paste this, it will not work.
for every scientist:
Order another pizza
done
Let's stop daydreaming of pizza now and return to the task at hand. For each FASTQ file we want to make a symlink to that file.
Lets first check this will work as we expect by converting our pseudo-code to
real code, but getting the computer to TELL us what it says we have told it
what to do (with actually doing it), by putting the command in echo
!
$ for fastq in ERR2020609.fastq.gz ERR2020611.fastq.gz ERR2020567.fastq.gz ERR2020565.fastq.gz ERR2020613.fastq.gz ERR2020618.fastq.gz ERR2020617.fastq.gz; do
> echo "ln -s ~/BareBonesBash/$fastq ~/BareBonesBash/FastQ.Portals"
> done
Look! You can see all the hard work of typing you DON'T have to do! Woohoo!
So try removing the echo and see what happens.
$ for fastq in ERR2020609.fastq.gz ERR2020611.fastq.gz ERR2020567.fastq.gz ERR2020565.fastq.gz ERR2020613.fastq.gz ERR2020618.fastq.gz ERR2020617.fastq.gz; do
> ln -s ~/BareBonesBash/$fastq ~/BareBonesBash/FastQ.Portals
> done
After writing the first line, and pressing enter, you may have noticed how
the prompt changes from $
to >
. This means that your command still expects
more information before it can execute! In this case it is completed when
done
is encountered. (If you accidently get stuck there and can't leave,
press ctrl + z
on your keyboard).
Once you've written done - lets see if the command worked correctly!
Bonus tip time! ls
will also accept a particular path to print to screen.
i.e. If you're in a different directory but want to see the contents of a
different one, you can follow the example here, where we are in ~/BareBonesBash
but want to check the contents of FastQ.Portals/
$ ls -l FastQ.Portals/
Going back to our loop - the above example fastq
(case-sensitive) is
the variable. In this case it is set to a string of characters,
corresponding to the name of the first FASTQ file (ERR2020617.fastq.gz
).
At that point the command given within the loop (in this case ln -s
) is
executed, before the next FASTQ. After it has completed, the next file in the
list (ERR2020611.fastq.gz
) is picked up, and the loop is repeated.
Described in more pseudocode:
for every_object in a_list; do
<this_command> on <this_object>
done
It is important that you separate out your 'loop' from the command itself using
; do
, and finish the loop with done
, otherwise bash will keep waiting for
some other input.
In the loop we just performed, we want to use what is in the fastq
variable,
to tell the computer what to perform ln
on. To tell the computer what to use
in fastq
, we prefix $
to the variable name.
This means that when reading ~/BareBonesBash/$fastq
, the computer knows that
$fastq
means "use what ever is stored in the variable fastq
", thus seeing
~/BareBonesBash/ERR2020609.fastq.gz
.
In the second part of the command (~/BareBonesBash/FastQ.Portals
), there is
no $
in front of the sequence of letters FastQ
. In this case the computer
reads it as the letters themselves and not the contents of a variable
(which is what we wanted to happen). The word is also in written in a different
case, so it would NOT be read as the variable even with the $
character.
See the example below for more info:
$ echo -e "$FastQ <---- Not a set variable"
$ echo -e "$fastq <---- The last FastQ file in the list of files in the loop."
However, this not the only way to write a loop. In the loop we ran above, we
still had to do a lot of writing; writing out the name of every file. But worry
not, this is Lazyness 101, and here we like to NOT write a lot! It is our
right not to type more than we need to! It is therefore our right - nay,
our responsibility - to use wildcards "refers to a character that can be
substituted for zero or more characters in a string". In bash, the wildcard
character is the asterisk (*
)
[Not to be confused with Asterix, James. AGAIN, REALLY!?)].
For example, we could remove (using rm
as we learnt above) any object with any
combination of characters in it's name, with the following command. But we won't
do that.
# rm ~/BareBonesBash/FastQ.Portals/*
In the context of a loop, we can use the wildcard to tell bash the loop
should be performed on ALL items in a directory that match the criterion given.
If we want to create a symlink (with ln -s
) for every item within the
~/BareBonesBash
directory, and place that symlink within the
~/BareBonesBash/FastQ.Portals
directory, we could use:
$ for fastq in ~/BareBonesBash/*; do
> ln -s $fastq ~/BareBonesBash/FastQ.Portals
> done
If you need to be more specific with your loop, you can also use the wildcard
with some other characters. For example, ERR*
would mean perform a command on
every file that begins with ERR, regardless of what comes after ERR. Finally,
we can use characters AFTER the wildcard as well, to only pick up files that have
a certain suffix as well as prefix (e.g. ERR*.gz
will find all files that begin
with ERR
and end with .gz
, regardless of what (if anything) comes between the
two).
For example, lets try out one of our old commands in a loop. Lets use
gzip -l
on every file starting with ERR
and ending with .gz
in our
new directory.
$ for fastq in ~/BareBonesBash/ERR*.gz; do
> gzip -l $fastq
> done
Therefore, loops and wildcards allow us to do repetitive tasks, and reap the rewards thereof, without having to do all the repetitive work!
As a final practice, have a look inside your ~/FastQ.Portals
directory. You
might notice that you have also symlinks to FastQ.Portals
and Ftp.Links.txt
as well as our FastQ files. These came in our previous ln -s
loop, as we
used the wildcard for everything in ~/BareBonesBash
.
Try writing your own loop using for
, *
, and rm
to remove ONLY those
two files.
Thank you for joining us in this small tutorial, and for putting up with our terrible pop-culture and video game references. All in all, you should now know how to move around using the Terminal, as well as the basic commands you need to create, view and manipulate files and directories.
We are planning a second part of this tutorial series, with slightly more advanced tricks, to ensure using bash doesn't make you feel... BASHED!
Please let us know if you have feedback or if there are any questions, don't be... BASHFUL!
[I'll see myself out...]
It is extremely important to ALWAYS keep your directories clean from random clutter. This lowers
the chances you will get lost in your directories, but also ensures you can stay lazy, since TAB
completion will not keep finding similarly named files. So let's clean up your home directory by
removing all the clutter we downloaded and worked with today. The command below will remove the
~/BareBonesBash
directory as well as all of its contents.
$ cd ~ # We shouldn't delete a directory while we are still in it. (It is possible though).
$ rm -r ~/BareBonesBash
We will learn about:
- advanced echo
- double and single quotes (or in grep and loops)
- rev
- cut
- find
- awk
- sed
- parallel
- while loops
- if statements
- bash arithmetic "$((8*8))"