WGBS Analysis Part 12

Reviewing trimgalore output with multiqc

Yesterday I started running [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) so I could evaluate trimgalore output. After my script finished running, I transferred the files to relevant subdirectories in this gannet folder, and moved the HTML reports to my repository and class repository. Then, I looked at the multiqc reports after the first, second, and third trims.

multiqc output

The main thing I wanted to check was the overrepresented sequences remaining in the analysis files, so I started by checking the summary module in each report:

Screen Shot 2021-02-08 at 2 46 25 PM

Screen Shot 2021-02-08 at 2 46 37 PM

Screen Shot 2021-02-08 at 2 46 51 PM

Figures 1-3. MultiQC status checks after the first, second, and third trims.

All samples passed the overrepresented sequences check! When I dug into the reports further, I found that some files still had adapter sequences after the second trim, but they were gone after the third trim:

Screen Shot 2021-02-10 at 10 49 18 AM

Screen Shot 2021-02-10 at 10 49 35 AM

Screen Shot 2021-02-10 at 10 49 54 AM

Figures 4-6. MultiQC overrepresented sequences for sample 7 read 1

I looked at the rest of the MultiQC modules from the third trim to see if there were any other inconsistencies between samples:

Screen Shot 2021-02-08 at 2 51 49 PM

Screen Shot 2021-02-08 at 2 52 23 PM

Figures 7-8. Modules with inconsistencies

The per sequence GC content had 2 files (sample 2 reads 1 and 2) that did not pass the test. There were no spikes that indicated a poly-G tail, and the distributions weren’t completely different from the other samples, so I’m not concerned. I also had 10 files (samples 1-4 and 6, reads 1 and 2) that didn’t pass sequence duplication levels. Again, the distributions didn’t look too different from the other samples. The one thing that does concern me is that all of these samples are from the same treatment: 3N and high pH. The only other sample in that treatment, sample 5, passed the sequence duplication test. When looking at sample methylation levels in a PCA, I’ll need to check if all six samples cluster together, or if the sequence duplication levels will affect that clustering.

Percentage reads lost from trimming

Similar to the Hawaii data, I trimmed poly-G tails out of these samples, and wanted to check how much of the data was lost from the third round of trimming. To do this, I downloaded the sequence count data from multiqc reports for trims 2 and 3. I saved the calculated the number of total reads, percent unique reads, and percent duplicate reads for each round of trimming, and saved the output for trim 2 here and trim 3 here. For both rounds of trimming, the percent duplicate reads are about ~25-30%. This is higher than what I saw for the Hawaii data, but is consistent with the fact that DNA from histology may not be of the highest quality, leading to a higher likelihood of repeat reads. I combined the information from both trims in this file, and also calculated the difference in total reads between trims. The percent of reads trimmed between rounds 2 and 3 were ~1-2%, which is consistent with the Hawaii data. It’ll be interesting to see how the higher percentage of duplicate reads will affect [bismark](https://github.com/FelixKrueger/Bismark) mapping.

Going forward

  1. Start bismark
  2. Update the repository README files
  3. Write methods
  4. Write results
  5. Identify DML
  6. Determine if RNA should be extracted
  7. Determine if larval DNA/RNA should be extracted
Written on February 7, 2021