BINF 6203: Using FastQC and fastq-mcf
In this exercise we’ll learn some basic data cleansing steps for NGS data.
Part I: Use FastQC
FastQC is a Java application for visualization of quality score information on a per base and per read basis. To download FastQC, simply get the *.dmg file HERE, and install the application in your Applications folder. If you’re not on a Mac you’ll need to choose one of the other downloads that’s appropriate for your system.
To load a file into FastQC, simply click File > Open. Files need to be in *.fastq format. To check if a file is in fastq format from your UNIX command line, you can type:
The head command will show you the first 10 lines of the file by default. It should look approximately like this:
@QLH88:00038:00079 TTGACTCATAAAATAATCGCCAAACAAGAATTTGGTTCCTTTTACGTACTTGATATCGATAAATCTTGCGGAATCTAGAAAATTCATTTTCGGCCAATTTAAACCCTTCTTCGAAA + *),,,/688444/9><>=AB699-4424424>:A::::ACDH:A<??<11-1<>>>999488084/,/1-*-+,,,/8888*4244444,44249:244-44-4959:/,*,,,,* @QLH88:00038:00096 ATTTTGGGATTTTTAGAGTTTGAAAACGAGAACTCCTTTCCTTATTTGGTGTACCTACTTGAGCCGGATGAAAGGAAACTTTCACGTCCGATTTTGAAGGGGGGAGATCCTATAGAATCCTATCCCAAATTTTTTCTTTTGCTAGGCCCATAACTAAAAAGCCCACTTTCTTACGATTACGC + 99>>/99148444+44999954499-4999B<99954:6:@D8::B>@>BB@?B999448=@AABB>99444-4:;;-499-444444244/770/1,/8888(////42489994289<A99>D199-44444(488828;799@@;57331333333+39963333,36633337734.. @QLH88:00038:00098 GTGTGCAATAACACACGAAAATCATCAAAAATGAGGCGTATGCTCGCTCCGGGGCTCGTTTGACCTTCCAAACGGCCCAGAAAACCCGTGATGGCCAACCGTATGCATAGACAACGTCTTGACGGACGTCCACGAACAAATTGGCATTTTGACGTCG
Even if you change the name of your file when you make a copy, remember that it should have a *.fastq or *.fq extension.
What am I seeing in FastQC?
FastQC has several views on the data. Here’s a video from the developers that explains each of the views in detail. This will mostly be a bit of a repeat of what I went over in class, but there is some more detail in there also:
Choose one of the chloroplast fastq files from the course dropbox to open.
In order to evaluate your data, you should know the following things about the protocol that generated it. The sequence in this file is single end sequence. It has been pre-filtered and de-multiplexed (or so Ion Torrent claims — there is at least one file where de-multiplexing did not happen correctly). The sequence was generated on the Ion Torrent instrument using a 300 nucleotide kit. That means the number of flow cycles (one flow of each nucleotide across the chip is a cycle) is greater than 300, by an unspecified amount that should result in the average length of read in your data set being 300. You should expect a distribution of lengths around 300, but really, any data beyond about 350 nucleotides is suspect, because there simply were not that many flow cycles run.
The Ion Torrent platform uses two adapters:
> A1 adapter CCATATCATCCCTGCGTGTCTCCCACTCAG > TrP1 adapter CCTCTCTATGGGCAGTCGGTGAT
The instrument’s software attempts to trim these adapters for you, and actually gets rid of any reads which are less than the length of the adapter plus 8 (reasoning that there is no possible way such short reads will contain valuable information). In the test set that I used fastq-mcf did not find any leftover adapters that needed trimming, but Dr. Weller tells me that there’s at least one of these fastq files where the instrument failed to recognize its own adapters. In the class dropbox, there is an “adapters.fq” file that contains these sequences, just in case you want to give it as input when trimming and see if it makes a difference.
The Ion Torrent uses the Q+33 encoding to encode quality scores, BUT (and this is a big BUT) the calibration of the quality scores is different than Illumina’s. Thus, you should not be expecting a distribution of quality scores up around 35, like we saw in the Illumina example data in class. The calibration tables for Ion Torrent’s Q-scores result in a somewhat lower distribution. Dr. Weller has provided lecture notes containing some details about this which you’ll find on the Moodle site.
In your laboratory write up, consider the following questions:
- Which part of the sequence (numerical range) should be trimmed in the file you chose based on your expectation of results? On average, which part should be trimmed based on quality?
- Is there variation in the sequence length in the file you chose? What is the range?
- Is the G/C distribution normal?
- Are there any overrepresented sequences your file? If so what are they?
- Are there any uncalled bases in the sequence?
- Do you see evidence of other biases in your data? What are they?
Part II: use fastq-mcf
fastq-mcf is part of the ea-utils package and can be used for trimming sequence reads based on quality score and presence of adapter sequence. On a Mac or Linux workstation or virtual machine, expand the archive and then type “make” to compile. Be sure you expand the archive and compile in a directory that you have read-write access to. This install worked on my Mac laptop.
If you’re a UNIX whiz, you can set up your paths so that you don’t have to type the full path to the fastq-mcf program every time. I made a copy of one of the chloroplast fastq files in my home directory. Then I ran the fastq-mcf program using this command:
/Applications/ea-utils.1.1.2-537/fastq-mcf adapters.fq mcf-test.fastq -o mcf-test-out.fastq
This is running the program using the default options for trimming and filtering, which are pretty generous. Some option flags that you might wish to explore include the -q flag (mean quality score cutoff) and the -L and -l (maximum and minimum length after trim) parameters. A complete list of other parameters is available on fastq-mcf’s Google Code page.
The cool thing about working with FastQC and fastq-mcf together is that you can try different combinations of options one after the other and open the output files right next to each other in FastQC to compare them. So you can try different Q-score threshholds, different length cutoffs (both minimum and maximum, since based on the protocol you don’t expect reads longer than 350 or so), or try the trim with and without adapter.
In your laboratory write up, consider the following:
- What is the final combination of fastq-mcf parameters that you settled on after testing?
- Are there remaining problems with your data that fastq-mcf did not correct?
- Choose one other parameter not mentioned above. Research that parameter, figure out what it’s supposed to do, and test the impact on your data set if any.
Present a summary of your before and after trim results. Side by side screenshots from FastQC are helpful if there has been a change. Don’t just put them in there without comment though — in your results section you need to explain what your figures mean and why it was important to present them.
Part III: Compare to CLC
Now go through and follow CLC’s data loading tutorial. This is the point in the CLC process where you’ll trim and filter data. A complete description of the CLC read trimming program is available HERE. Discuss the differences between the CLC pre-filtering process and fastq-mcf filtering. Could you replicate your choices in fastq-mcf exactly in CLC? How different are the results?