BINF 6203: NGS Data QC
Today, you’ll use trimming software to clean up several NGS data sets. We’ll take a look at the following:
- Chloroplast genome sequences generated by Spring 2014 BINF 6350 students. (Ion Torrent, single end)
- Some very old and crappy 454 sequence data for Vibrio vulnificus E64MW (single end)
- Some high-quality RNASeq data for Vibrio vulnificus C-strains (paired end Illumina)
If you do not have space to download the full SRA files you can get pre-sampled data here.
There are four software packages on the lab computers that will be useful in this exercise. First, sratools is what you use to download the data from NCBI. FastQC is a visualization tool that will let you see the quality of the sequences. FASTX Toolkit is useful for trimming single-end reads while Trimmomatic is for paired-end Illumina data sets. All of these packages are relatively easy to install, so if the version on the lab computers is outdated or seems not to work, you can install them in your own home directory.
You don’t have to analyze all of the samples for each data set. Choose one chloroplast data set to work on, choose one sample’s worth of paired reads from the RNASeq data set, etc.
For each of the sequence data sets, you should report on the initial quality of the data and where you observed problems. The “Methods” section in your report should describe what you did to clean each data set, in such a way that someone else could repeat what you did. Finally, in the “Results” section, you should report on the quality of your final data set. Feel free to use figures from FastQC to illustrate changes. This report does not have to be super-long to be clear and correct.
As you clean your data, you need to balance the desire to keep only the highest scoring reads and regions, against the desire to keep as much of your data as possible. When you consider your results, you should consider how much sequence you discarded and how you may have changed the average length of your reads, along with considering the quality scores.
Get a file with fastq-dump
fastq-dump is a tool in the SRA Toolkit which can do several things. It can convert a file in the SRA format to fastq. It can also get a file directly from the short read archive and convert it to fastq on the fly. The basic form of the command is:
fastq-dump -Z runidentifier > runname.fastq
The run identifier is the code associated with a single run in the SRA. It starts with SRR. You can test and make sure you can get the file using a command like:
fastq-dump -X 5 -Z SRR1763780
That will dump the first five reads in the file to standard output in fastq format. The sequence identifier here is for one of the Ion Torrent data sets from our BINF 6350 class. Then get the full file with a command like this:
fastq-dump -Z SRR1763780 > BC_30.fastq
Assess sequence quality with FastQC
FastQC is a Java application for visualization of quality score information on a per base and per read basis. It should be installed in the Applications folder on the lab computers.
To load a file into FastQC, simply click File > Open. Files need to be in *.fastq format. Each file in a pair of paired-end files has to be opened separately.
To check if a file is in fastq format from your UNIX command line, you can type:
head filename
The head command will show you the first 10 lines of the file by default. It should look approximately like this:
@QLH88:00038:00079 TTGACTCATAAAATAATCGCCAAACAAGAATTTGGTTCCTTTTACGTACTTGATATCGATAAATCTTGCGGAATCTAGAAAATTCATTTTCGGCCAATTTAAACCCTTCTTCGAAA + *),,,/688444/9><>=AB699-4424424>:A::::ACDH:A<??<11-1<>>>999488084/,/1-*-+,,,/8888*4244444,44249:244-44-4959:/,*,,,,* @QLH88:00038:00096 ATTTTGGGATTTTTAGAGTTTGAAAACGAGAACTCCTTTCCTTATTTGGTGTACCTACTTGAGCCGGATGAAAGGAAACTTTCACGTCCGATTTTGAAGGGGGGAGATCCTATAGAATCCTATCCCAAATTTTTTCTTTTGCTAGGCCCATAACTAAAAAGCCCACTTTCTTACGATTACGC + 99>>/99148444+44999954499-4999B<99954:6:@D8::B>@>BB@?B999448=@AABB>99444-4:;;-499-444444244/770/1,/8888(////42489994289<A99>D199-44444(488828;799@@;57331333333+39963333,36633337734.. @QLH88:00038:00098 GTGTGCAATAACACACGAAAATCATCAAAAATGAGGCGTATGCTCGCTCCGGGGCTCGTTTGACCTTCCAAACGGCCCAGAAAACCCGTGATGGCCAACCGTATGCATAGACAACGTCTTGACGGACGTCCACGAACAAATTGGCATTTTGACGTCG
Even if you change the name of your file when you make a copy, remember that it should have a *.fastq or *.fq extension.
What am I seeing in FastQC?
FastQC has several views on the data. Here’s a video from the developers that explains each of the views in detail. This will mostly be a bit of a repeat of what I went over in class, but there is some more detail in there also:
FastQC for Illumina walkthrough (2013)
Start with the chloroplast data set whose identifier is given in the fastq-dump example, above.
In order to evaluate your data, you should know the following things about the protocol that generated it. The sequence in this file is single end sequence. It has been pre-filtered and de-multiplexed. The sequence was generated on the Ion Torrent instrument using a 300 nucleotide kit. That means the number of flow cycles (one flow of each nucleotide across the chip is a cycle) is greater than 300, by an unspecified amount that should result in the average length of read in your data set being 300. You should expect a distribution of lengths around 300, but really, any data beyond about 350 nucleotides is suspect, because there simply were not that many flow cycles run.
The Ion Torrent platform uses two adapters:
> A1 adapter CCATATCATCCCTGCGTGTCTCCCACTCAG > TrP1 adapter CCTCTCTATGGGCAGTCGGTGAT
The instrument’s software attempts to trim these adapters for you, and actually gets rid of any reads which are less than the length of the adapter plus 8 (reasoning that there is no possible way such short reads will contain valuable information).
The Ion Torrent uses the Q+33 encoding to encode quality scores, BUT (and this is a big BUT) the calibration of the quality scores is different than Illumina’s. Thus, you should not be expecting a distribution of quality scores up around 35, like we saw in the Illumina example data in class. The calibration tables for Ion Torrent’s Q-scores result in a somewhat lower distribution. I’ve posted a technical report about this on the Moodle site.
Clean data with FASTX-Toolkit
The FASTX-Toolkit works best on single-end sequence data. Not that you can’t use it with paired-end data but if you do you’ll need to do some special manipulations to fix up the pairings again and filter off reads that lost their mate. FASTX-Toolkit has been around for a long time and there are definitely other tools out there to do the same thing, (such as fastq-mcf) but for now we’ll learn the basics this way.
You need to do the following steps:
- clip residual sequencing adapters
- trim off parts of sequences that are clearly too long and possibly a few bases at the beginning of the sequence if they look anomalous
- filter the sequence that’s left to make sure none of it is too low-quality
Step 1: Clip adapters
The program you’ll use for this is fastx_clipper. To see all of its command options, open a UNIX terminal window and type:
fastx_clipper -h
Read over the fastx_clipper options. Figure out which ones you need to use to accomplish the following:
- get rid of the A1 and TrP1 adapters (you’ll need to clip twice)
- don’t discard any sequences — keep them all, whether the adapter was found and clipped out or not
- don’t clip the adapter if the adapter alignment length is less than 7 nucleotides
- run the program in verbose mode so that you can see the sequence count
Step 2: Trim for length
At this step you’ll take the output of the previous command and get rid of any sequences that are longer than they should be. Anything that has more than 350 nucleotides definitely needs to get its tail end trimmed off. You may want to trim a little off the front end of the sequence as well, and if it looks like the tail end of the sequences is always bad, you may want to trim to 300 instead of 350, for instance.
fastx_trimmer is the program that you’ll use to do this. First run it with just the -h flag to see the program help. Then figure out how to cut off any sequence greater than 350. Check the position-specific quality values for your sequences in FastQC. Do you think you should cut off anything at the 5′ end of the read? If so, try it and tell me about it in your report.
Step 3: Filter for quality
Ion Torrent quality scores are calibrated a little bit low. You don’t want to be too strict with your trimming or you won’t have any sequences left. The combination of factors here is
- what is your quality score threshhold
- how much of the sequence has to meet that threshhold
For instance, you could say that your quality score threshhold is 20, but only 50% of the sequence has to be that good. Or you could set a lower q-score threshhold but demand that more of the sequence match it. If you have time to test different combinations and see what they do to the size of your data set, do it.
Clean data with Trimmomatic
Trimmomatic is a Java program that is optimized for handling paired-end data intelligently. By default, the “1” and “2” files of a paired-end data set are designed so that the pairs of reads in the files correspond to each other. If you trim the files and reads are lost, the paired files need to be re-matched and a single-end file created out of the unpaired reads. You can do this with a program that doesn’t natively handle paired ends, but why would you do that to yourself?
To run Trimmomatic on paired-end data, you need a command of the form:
java -jar <path to trimmomatic.jar> PE [-threads <threads] [-phred33 | -phred64] [-trimlog <logFile>] <input 1> <input 2> <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ...
This is an example of a Trimmomatic command from one of the scripts I used to process the Vibrio vulnificus RNASeq data:
java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE UNCC_Gibas_121012_Run_Sample_4_JY1305_1HS_GATCAG_L003_R1_001.fastq.gz UNCC_Gibas_121012_Run_Sample_4_JY1305_1HS_GATCAG_L003_R2_001.fastq.gz JY1305_1HS_fwd_paired.fastq JY1305_1HS_fwd_unpaired.fastq JY1305_1HS_rev_paired.fastq JY1305_1HS_rev_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10
In this example, I am ONLY removing the adapters. The file TruSeq3-PE-2.fa is distributed with the Trimmomatic package but I can’t include them in this post. You’ll find that file in the Trimmomatic directory in your applications folder.
The authors of Trimmomatic have a good rundown on the various options on their website, and we will go over them in class. You should be aiming to do the same things that you did with the FASTX toolkit — remove adapters, trim low quality regions, get rid of trimmed reads that are too short once trimmed.
Some tips:
- At first, do one thing at a time to see what effect particular trims are having
- If all your sequence disappears and you don’t know why, see #1
- There is no single right way, and different datasets need different things