Sequence Read QC
DNA Sequencing is a continually evolving technology, with new platforms becoming available each year, all designed with the aim of reducing the cost and increasing the speed of large sequencing projects. As you can see from the figure below, certain types of sequencing dominate the current market. The top three instruments shown are established high and moderate capacity Illumina sequencing instruments, the next two are Ion Torrent instruments, and the two below that are newer Illumina instruments. Beyond that there is a small and diverse group of cutting edge sequencing instruments that are just beginning to be more widely adopted.
In this exercise we will focus primarily on quality analysis and quality control of Illumina and Ion Torrent sequencing data, since those are the types you are currently most likely to encounter.
What sequence data looks like — the FASTQ file
NGS read files tend to be distributed in *.fastq format. To check if a file is in fastq format from your UNIX command line, you can type:
head filename
The head command will show you the first 10 lines of the file by default. It should look approximately like this:
@QLH88:00038:00079 TTGACTCATAAAATAATCGCCAAACAAGAATTTGGTTCCTTTTACGTACTTGATATCGATAAATCTTGCGGAATCTAGAAAATTCATTTTCGGCCAATTTAAACCCTTCTTCGAAA + *),,,/688444/9><>=AB699-4424424>:A::::ACDH:A<??<11-1<>>>999488084/,/1-*-+,,,/8888*4244444,44249:244-44-4959:/,*,,,,* @QLH88:00038:00096 ATTTTGGGATTTTTAGAGTTTGAAAACGAGAACTCCTTTCCTTATTTGGTGTACCTACTTGAGCCGGATGAAAGGAAACTTTCACGTCCGATTTTGAAGGGGGGAGATCCTATAGAATCCTATCCCAAATTTTTTCTTTTGCTAGGCCCATAACTAAAAAGCCCACTTTCTTACGATTACGC + 99>>/99148444+44999954499-4999B<99954:6:@D8::B>@>BB@?B999448=@AABB>99444-4:;;-499-444444244/770/1,/8888(////42489994289<A99>D199-44444(488828;799@@;57331333333+39963333,36633337734.. @QLH88:00038:00098 GTGTGCAATAACACACGAAAATCATCAAAAATGAGGCGTATGCTCGCTCCGGGGCTCGTTTGACCTTCCAAACGGCCCAGAAAACCCGTGATGGCCAACCGTATGCATAGACAACGTCTTGACGGACGTCCACGAACAAATTGGCATTTTGACGTCG
Information about each sequence read comes in The content of each of those lines is:
- Line 1 (starts with @) — description of the read and its location on the lane
- Line 2 — the sequence of the read
- Line 3 (starts with +) — additional description can be given here
- Line 4 — quality scores encoded as ASCII characters
Even if you change the name of your file when you make a copy, remember that it should have a *.fastq or *.fq extension so that the file type is identifiable to yourself and others.
In class, we’ll talk about how paired end reads (two reads sequenced from the ends of the same shotgun DNA fragment) can provide information for scaffolding a genome, using a strategy of creating a mix of shorter and longer fragments. For nearly every application, paired end reads make sense, because they carry information about both ends of the fragment and about its approximate length. That’s useful for assembly, for mapping, and for checking the integrity of results.
When a paired end data set is produced, it’s usually distributed as a pair of FASTQ files. These paired files with have exactly the same number of sequences in them and the order of the sequences is preserved in each file, so the first read in each file of the pair comes from the same spot on the sequencing lane. It is critical that these FASTQ files be processed together with a data cleaning program that is paired end aware. If a member of a read pair is cleaned from one file but not from the other, then the pairing of the reads in the two files will be lost, and programs like sequence assemblers that consider paired end reads in specific ways to help improve assembly with interpret the data incorrectly. In this lab, we’ll use a program called Trimmomatic that correctly handles paired end reads.
Getting FASTQ files from NCBI
NCBI maintains a centralized repository of sequence data sets, the Sequence Read Archive (SRA). You can browse the SRA online, but to get files from it you’ll first need to install the SRA Toolkit on your computer. We’re actually going to install stuff on your computer this week, because the read cleaning programs don’t really need to be run on a cluster if you’re not processing a large number of samples, and also just to give you some more experience in doing what tends to be a frequent task in a researcher’s life.
On OSX, you can install and use version 2.9.3 of sratoolkit via homebrew; as of 1/15/2019 this install worked, and I was able to download the files for this lab using the commands below.
Now that you have SRAToolkit what do you do with it? Go to the SRA main page and search for “tomato chloroplast”. This will give you an idea what kind of results come up. Scroll down the list a bit and click through one of the links for “BINF 6350 student-generated data”. These sequences were generated in Dr. Weller’s laboratory class a couple of years ago. Under “Study” on the results page, click one of the links that says “all runs”. This will get you a table with a list of the run identifiers for each of the sequencing runs in the collection.
To get one of these files over onto your own computer, you use a program from the SRA Toolkit called “fastq-dump”. On a Mac, if you installed sratoolkit using homebrew as shown above, you can just type:
fastq-dump --help
If you have a working executable, that will cause it to display all of the program options.
The run identifier is the code associated with a single run in the SRA. It starts with SRR. You can test and make sure you can get the file using a command like:
fastq-dump -X 5 -Z SRR1763780
That will dump the first five reads in the file to standard output in fastq format. The sequence identifier here is for one of the Ion Torrent data sets from our BINF 6350 class.
To get a complete file from the tomato chloroplast set, you can type:
fastq-dump -A SRR1763780
You can access any run file in the SRA once you know its run number. You should see a file named SRR1763780.fastq in your current working directory after running this command.
For an example of what happens when you get a file of paired end sequencing reads from the SRA, you can use an accession number from our comparative RNA-Seq study of Vibrio vulnificus strains.
fastq-dump --split-files SRR1391072
You’ll notice that this download will take much longer to complete. The data will be downloaded into two files with _1.fastq and _2.fastq extensions. If you have a very slow internet connection you can also download pre-sampled data here. The data was reduced by selecting 25% of the original reads at random using a program called seqtk.
Visualizing the quality of your sequence reads with FastQC
Before we start cleaning sequence reads we have to have a way to visualize and understand the quality of the whole data set. You can download FastQC from the link in the title of this section. Use it to open the FASTQ files that you just downloaded.
There’s a video tutorial about the use of FastQC. Check that out to review the modules that we looked at in class if you need to.
When we trim reads, what are we trying to get rid of?
- Specific adapter sequences — generic sequences that are used to connect fragments to beads or surfaces. Often the sequencing instrument has a built in pipeline that will attempt to remove these internally, but it’s always a good idea to filter anyway, as external filtering programs like Trimmomatic seem to do a better job.
- Low quality regions — if you look at the read quality plot in FastQC and you see that the mean per-base score falls into an unacceptable range at some point in the read, you can trim the entire data set to a length that removes a particular range, or to where the quality score falls below a threshold.
- Low-quality reads — Even if the mean per-base score stays in the acceptable range, you may want to filter for specific reads that are of low average quality across the whole read. We may also want to use a windowed trim to target reads that have internal low quality regions.
- Sequences that start out too short or become too short after trimming.
Cleaning Ion Torrent sequence reads with FASTX
The chloroplast reads (the single file) that you downloaded above were generated on the department’s Ion Torrent instrument. Calibration of Ion Torrent quality scores, especially on older instruments like the one used to gather this data, is rather different from calibration for Illumina reads. An Ion Torrent data set that is of reasonable, usable quality may have an average quality score in the 20s rather than in the mid-30s as is typical for Illumina data.
FASTX is a flexible toolkit that was designed for single reads of any length. It’s no longer maintained by the developers, but still works. If you have not been able to get it to work, you can use Trimmomatic in single-end mode on this data set. A new second gen read cleaning pipeline that has appeared more recently is Atropos, so you could try that one as well.
It is not pair aware and will not maintain the paired-ness of two corresponding read files from the same paired end data set. But we can use it for this single end Ion Torrent data and many other data sets as well.
fastx_clipper is the FASTX Toolkit program for removing adapters. Here, we first get rid of the Ion Torrent A1 adapter, and in a subsequent step, the TrP1 adapter.
fastx_clipper -v -n -a CCATATCATCCCTGCGTGTCTCCCACTCAG -i in.fastq -o out1.fastq fastx_clipper -v -n -a CCTCTCTATGGGCAGTCGGTGAT -i out1.fastq -o out2.fastq
fastx_trimmer is the FASTX Toolkit command that removes sequence by position in the sequence. Here I have used it to remove the first 9 characters of every sequence, and to trim off any sequence that goes past 300, which is the very maximum length we could expect from the kit that was used.
fastx_trimmer -v -f 9 -l 300 -i out2.fastq -o out3.fastq
Finally, the quality filter step retains any remaining sequences which have a quality of at least 12 over 90 percent of their length. We could use a different combination of parameters here and get a quite different result. So you could test and see what would happen if you required a quality of at least 20 over 50% of the length, for instance.
fastq_quality_filter -v -q 12 -p 90 -i out3.fastq -o out4.fastq
Cleaning reads with Trimmomatic
Trimmomatic is really designed to work with Illumina sequence reads, but we can apply it to our 454 data set to see how the commands are analogous to FASTX.
If we wanted to work with the same data as above, remove low-scoring sequence from the trailing end of the read (where we would expect quality to drop off at some point) and then get rid of all reads that are less than 50 nucleotides after trimming, we could use:
java -jar /path/to/trimmomatic SE SRR1763770.crop.fastq SRR1763770.trail.fastq TRAILING:20 MINLEN:50
The complete list of Trimmomatic options is:
- ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
- SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
- LEADING: Cut bases off the start of a read, if below a threshold quality
- TRAILING: Cut bases off the end of a read, if below a threshold quality
- CROP: Cut the read to a specified length
- HEADCROP: Cut the specified number of bases from the start of the read
- MINLEN: Drop the read if it is below a specified length
- TOPHRED33: Convert quality scores to Phred-33
- TOPHRED64: Convert quality scores to Phred-64
The program will apply as many of these as you want in the order you name them in a command line, and can apply the same filter more than once and with different parameters. You can try them out one by one to see the effect specific filters have, and load your output file back into FastQC to compare side by side.
For paired end sequence reads, the options are the same as for single reads, but the format of the command line has to be a little different. This is because there are two input files and four files will be written, a file containing the reads from the 1.fastq (forward) file which kept their mate, the reads from the 1.fastq file where the mate was dropped, the reads from the 2.fastq (reverse) file which kept their mate, and the reads from 2.fastq where the mate was dropped.
java -jar /path/to/trimmomatic PE SRR1391072_1.fastq SRR1391072_2.fastq SRR1391072_1_pair_out.fastq SRR1391072_1_unpair_out.fastq SRR1391072_2_pair_out.fastq SRR1391072_2_unpair_out.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 (etc...)
Here’s a help thread on SeqAnswers that gives an idea what Trimmomatic commands will look like when running them at the Windows command line.
For your report
The report you prepare today can be rather simple.
- Get your reads from the SRA. Be on a good connection when you do this!
- For each data set, use Trimmomatic to trim the reads.
- Visualize the data before and after trimming using FastQC — you’ll have to do this twice when trimming paired reads.
- Test and compare 3 combinations of parameters for each dataset. For instance, use sliding window trimming, hold all other parameters constant, and compare to the results without sliding window trimming. Or, use a different quality threshhold, holding all other parameters constant.
- Report on the outcome — you can create a table that shows the number of reads retained for each set of input conditions and compare the overall quality via your QA plots from FastQC.