In this exercise we will focus primarily on quality analysis and quality control of Illumina sequencing data, since that is the type of NGS data you are currently most likely to encounter in new datasets. You can view older versions of this exercise for tips on how to handle Ion Torrent or 454 data if you encounter that in your work.
What sequence data looks like — the FASTQ file
NGS read files tend to be distributed in *.fastq format. To check if a file is in fastq format from your UNIX command line, you can type:
The head command will show you the first 10 lines of the file by default. It should look approximately like this:
@QLH88:00038:00079 TTGACTCATAAAATAATCGCCAAACAAGAATTTGGTTCCTTTTACGTACTTGATATCGATAAATCTTGCGGAATCTAGAAAATTCATTTTCGGCCAATTTAAACCCTTCTTCGAAA + *),,,/688444/9><>=AB699-4424424>:A::::ACDH:A<??<11-1<>>>999488084/,/1-*-+,,,/8888*4244444,44249:244-44-4959:/,*,,,,* @QLH88:00038:00096 ATTTTGGGATTTTTAGAGTTTGAAAACGAGAACTCCTTTCCTTATTTGGTGTACCTACTTGAGCCGGATGAAAGGAAACTTTCACGTCCGATTTTGAAGGGGGGAGATCCTATAGAATCCTATCCCAAATTTTTTCTTTTGCTAGGCCCATAACTAAAAAGCCCACTTTCTTACGATTACGC + 99>>/99148444+44999954499-4999B<99954:6:@D8::B>@>BB@?B999448=@AABB>99444-4:;;-499-444444244/770/1,/8888(////42489994289<A99>D199-44444(488828;799@@;57331333333+39963333,36633337734.. @QLH88:00038:00098 GTGTGCAATAACACACGAAAATCATCAAAAATGAGGCGTATGCTCGCTCCGGGGCTCGTTTGACCTTCCAAACGGCCCAGAAAACCCGTGATGGCCAACCGTATGCATAGACAACGTCTTGACGGACGTCCACGAACAAATTGGCATTTTGACGTCG
Information about each sequence read comes in The content of each of those lines is:
- Line 1 (starts with @) — description of the read and its location on the lane
- Line 2 — the sequence of the read
- Line 3 (starts with +) — additional description can be given here
- Line 4 — quality scores encoded as ASCII characters
Even if you change the name of your file when you make a copy, remember that it should have a *.fastq or *.fq extension so that the file type is identifiable to yourself and others.
Most of the data we’ll look at today is paired-end data, except for one dataset where the target sequence was short, noncoding RNAs. We’ll talk about why that makes sense during class.
When a paired end data set is produced, it’s usually distributed as a pair of FASTQ files. These paired files with have exactly the same number of sequences in them and the order of the sequences is preserved in each file, so the first read in each file of the pair comes from the same spot on the sequencing lane. It is critical that these FASTQ files be processed together with a data cleaning program that is paired end aware. If a member of a read pair is cleaned from one file but not from the other, then the pairing of the reads in the two files will be lost, and programs like sequence assemblers that consider paired end reads in specific ways to help improve assembly with interpret the data incorrectly. In this lab, we’ll use a program called Trimmomatic that correctly handles paired end reads.
Getting FASTQ files from NCBI
NCBI maintains a centralized repository of sequence data sets, the Sequence Read Archive (SRA). You can browse the SRA online, but to get files from it you’ll first need to install the SRA Toolkit on your computer. The read cleaning programs don’t really need to be run on a cluster if you’re not processing a large number of samples, so you can install sratoolkit on your own laptop along with Trimmomatic and FastQC.
On OSX, you can install and use version 2.10 of sratoolkit via homebrew; as of 1/15/2020 this install worked, and I was able to download the files for this lab using the commands below.
Now that you have SRAToolkit what do you do with it? Go to the SRA main page and search for “tomato chloroplast”. This will give you an idea what kind of results come up. Scroll down the list a bit and click through one of the links for “BINF 6350 student-generated data”. Under “Study” on the results page, click one of the links that says “all runs”. This will get you a table with a list of the run identifiers for each of the sequencing runs in the collection.
To get one of these files over onto your own computer, you use a program from the SRA Toolkit called “fastq-dump”. On a Mac, if you installed sratoolkit using homebrew as shown above, you can just type:
If you have a working executable, that will cause it to display all of the program options.
The run identifier is the code associated with a single run in the SRA. It starts with SRR. You can test and make sure you can get the file using a command like:
fastq-dump -X 5 -Z SRR1763780
That will dump the first five reads in the file to standard output in fastq format. The sequence identifier here is for one of the Ion Torrent data sets from our BINF 6350 class.
To get a complete file from the tomato chloroplast set, you would type:
You can access any run file in the SRA once you know its run number. You should see a file named SRR1763780.fastq in your current working directory after running this command.
Files you’ll use for lab
For an example of what happens when you get a file of paired end sequencing reads from the SRA, you can use an accession number from our comparative RNA-Seq study of Vibrio vulnificus strains.
fastq-dump --split-files SRR1391072
You’ll notice that this download will take much longer to complete. The data will be downloaded into two files with _1.fastq and _2.fastq extensions. If you have a very slow internet connection you can also download pre-sampled data here. The pre-sampled data was reduced by selecting 25% of the original reads at random using a program called seqtk.
Visualizing the quality of your sequence reads with FastQC
Before we start cleaning sequence reads we have to have a way to visualize and understand the quality of the whole data set. You can download FastQC from the link in the title of this section. Use it to open the FASTQ files that you just downloaded.
There’s a video tutorial about the use of FastQC. Check that out to review the modules that we looked at in class if you need to.
When we trim reads, what are we trying to get rid of?
- Specific adapter sequences — generic sequences that are used to connect fragments to beads or surfaces. Often the sequencing instrument has a built in pipeline that will attempt to remove these internally, but it’s always a good idea to filter anyway, as external filtering programs like Trimmomatic seem to do a better job.
- Low quality regions — if you look at the read quality plot in FastQC and you see that the mean per-base score falls into an unacceptable range at some point in the read, you can trim the entire data set to a length that removes a particular range, or to where the quality score falls below a threshold.
- Low-quality reads — Even if the mean per-base score stays in the acceptable range, you may want to filter for specific reads that are of low average quality across the whole read. We may also want to use a windowed trim to target reads that have internal low quality regions.
- Sequences that start out too short or become too short after trimming.
Cleaning reads with Trimmomatic
Trimmomatic is really designed to work with Illumina sequence reads, but we can apply it to other short-read NGS data sets as well.
If we wanted to work with the same data as in the figure above, remove low-scoring sequence from the trailing end of the read (where we would expect quality to drop off at some point) and then get rid of all reads that are less than 50 nucleotides after trimming, we could use:
java -jar /path/to/trimmomatic SE SRR1763770.crop.fastq SRR1763770.trail.fastq TRAILING:20 MINLEN:50
The complete list of Trimmomatic options is:
- ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
- SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
- LEADING: Cut bases off the start of a read, if below a threshold quality
- TRAILING: Cut bases off the end of a read, if below a threshold quality
- CROP: Cut the read to a specified length
- HEADCROP: Cut the specified number of bases from the start of the read
- MINLEN: Drop the read if it is below a specified length
- TOPHRED33: Convert quality scores to Phred-33
- TOPHRED64: Convert quality scores to Phred-64
The program will apply as many of these as you want in the order you name them in a command line, and you can apply the same filter more than once and with different parameters. You can try them out one by one to see the effect specific filters have, and load your output file back into FastQC to compare side by side.
For paired end sequence reads, the options are the same as for single reads, but the format of the command line has to be a little different. This is because there are two input files and four files will be written, a file containing the reads from the 1.fastq (forward) file which kept their mate, the reads from the 1.fastq file where the mate was dropped, the reads from the 2.fastq (reverse) file which kept their mate, and the reads from 2.fastq where the mate was dropped.
java -jar /path/to/trimmomatic PE SRR1391072_1.fastq SRR1391072_2.fastq SRR1391072_1_pair_out.fastq SRR1391072_1_unpair_out.fastq SRR1391072_2_pair_out.fastq SRR1391072_2_unpair_out.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 (etc...)
Here’s a help thread on SeqAnswers that gives an idea what Trimmomatic commands will look like when running them at the Windows command line.
I have a few data sets for you to look at. Job 1 is to try to understand why they are the way they are. Job 2 is to decide how to use Trimmomatic commands to make them better.
- A paired end E. coli genomic sequencing run that we will assemble next week. Find the paired files here.
- The V. vulnificus transcriptome data linked above. Think about what a transcriptome sequencing run means before you look at the duplications here and freak out — if a given transcript is present in a large number of copies then there will be duplication!
- Some old paired end whole genome shotgun data from a bacterial strain that the authors were attempting to assemble — this is the dataset that I showed in class that has a problem with certain tiles. R1 R2
- An ncRNA dataset, ERR3650066 — get it from the SRA! Short RNA sequences where the insert size might be shorter than the sequence read are most likely to have adapter read-through. Try to figure out what’s wrong with this one and if you can fix it with Trimmomatic.
For your report
The report you prepare today can be straightforward.
- Get your reads from the SRA. Be on a good connection when you do this!
- For each data set, use Trimmomatic to trim the reads.
- Visualize the data before and after trimming using FastQC — you’ll have to do this twice when trimming paired reads.
- Test and compare 2 combinations of parameters for each dataset. For instance, use sliding window trimming, hold all other parameters constant, and compare to the results without sliding window trimming. Or, use a different quality threshhold, holding all other parameters constant.
- EXPLAIN WHY YOU CHOSE THE COMBINATIONS OF PARAMETERS AND WHAT YOU WERE TRYING TO GET RID OF WITH EACH ONE.
- Report on the outcome — you can create a table that shows the number of reads retained for each set of input conditions and compare the overall quality via your QA plots from FastQC.