SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trinity and jellyfish problem jevcampe Bioinformatics 26 02-16-2017 10:03 AM
FastQC Problem polsum Bioinformatics 8 11-04-2016 08:07 AM
Problem with trinity and jellyfish RyNkA Bioinformatics 1 07-05-2013 03:18 AM
Problem with Sequence Quality using FastQC tahamasoodi Bioinformatics 5 11-01-2012 01:14 AM
FASTQC problem on read with no sequence data turnersd Bioinformatics 3 06-20-2012 02:44 PM

Reply
 
Thread Tools
Old 10-23-2013, 02:25 PM   #1
Skiaphrene
Member
 
Location: Lausanne CH

Join Date: Aug 2013
Posts: 18
Question Problem with FASTQC on Trinity Mouse DC reads example dataset

Dear SEQanswers,


My name is Alex SMITH and I've recently started a RNA-seq bioinformatics post-doc at the Malaghan Institute of medical Research. In order to practice with the tools I'll be using, I decided to try and map the RNA-seq reads from the Mouse Dendritic Cell dataset (GSE29209 ; GSM722533) that was used as an example in the original Trinity paper "Full-length transcriptome assembly from RNA-Seq data without a reference genome" (2011) to the latest mouse genome. I downloaded the paired-end read file (52.6 M reads) and split it into 2 separate end files. However, before attempting the mapping, given that the reads were generated using Illumina technology, I decided to run them through FASTQC to get a feel for them. I was very surprised when FASTQC reported very high levels of read duplication - for example, for each end file, the 4 most duplicated reads accounted for almost 3% of all reads (each representing more than 35k reads), and the total sequence duplication level reported is >=70% in both cases.

I realise that FASTQC is not the best software for getting an idea of sequence duplication, given that it does not take paired ends into account and limits itself to unique sequences from the 50 first nucleotides in the 200 000 first reads, as is known to give such results, but as I am not very experienced with RNA-seq these results worried me. I tried to find out what these 4 most duplicated reads corresponded to by blasting them against the Mouse genome, the whole of the nr database (temporary report link expires on 10-25 05:00 am), the 92 common ERCC RNA-seq spike-in control sequences, and against whatever Illumina adaptors, primers, barcodes etc that I could find. However, I have come up completely blank! Looking at the sequences of these heavily-replicated 50nt read parts, I also noticed that there were very few "double nucleotides", which one might expect in any given sequence. I've attached the two tables of over-represented sequences, and copied these sequences below (they are different, and not mirrors, for ends 1 and 2):

Ends1:
TCTAGAGTACAGTGACGAGTGACGATACACGCATACGACTGACGCCGTAC
CACGTCACGTGTACGTAGTACGTACGCATACACGCATGTACGTATATAGT
AGATCTCATATCGTCGCTCGTCATGCGTGTATGCGTCTGCATACGGCGCA
GTGCAGTGCGCACATATCACATGCTATGCGTGTATGACAGTCGTATACTG

Ends2:
TGTCGATTATCGCACTGGTGCGAATGGATACGCGACATCTATCTGATGAC
CACTATAGCGATAGACAAGCATGCGCTGCGTCGACTCAGATGAGTGCACG
GCGATCGCTCTATCTGCTCATCTGCACTGCATATGAGCACGCTACTGCTA
ATAGCGCAGAGCGTGATCATGACTATACATGATCTGTGTGCAGCACATGT


I have attached the FASTQC duplicate sequence graphs and the per-base sequence quality box plots (for end 1) as well. Please note that the 4 first over-represented sequences did not seem to correspond to any particular quality distribution (i.e. were not all low-quality). Obviously the 5th for each was!


Searching on SEQanswers, I found these interesting threads, but was not able to find a consensus interpretation:
http://seqanswers.com/forums/showthread.php?t=24094
http://seqanswers.com/forums/showthread.php?t=28607
http://seqanswers.com/forums/showthread.php?t=30397
http://seqanswers.com/forums/showthread.php?t=24040


A blog post on FASTQC duplicate sequences (pointed to by one of the threads) was interesting as well:
http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/


I have no idea on how to interpret the strong presence of these peculiar sequences other than some problem in the library preparation, which I would find surprising given that this dataset was used as an example in a paper. Bottom line, I don't know what the best way to deal with them would be: keep them (as the mapping results should not be impacted), or remove them (only losing about 3% total reads)? Should I just go ahead and do the mapping, then use Picard tools to look at library diversity (but then these reads shouldn't map anyway)? Maybe in practice it doesn't change anything but I would like to be sure I understand what I'm doing (or not doing)!


Thank you in advance for any help or enlightenment you can bring and always, thanks for reading!


-- Alex
Attached Images
File Type: png ORSeqs-End1.png (43.5 KB, 9 views)
File Type: png duplication_levels_End1.png (30.8 KB, 9 views)
File Type: png per_base_quality_End1.png (18.5 KB, 4 views)
File Type: png ORSeqs-End2.png (50.9 KB, 6 views)
Skiaphrene is offline   Reply With Quote
Old 11-17-2013, 01:21 PM   #2
Skiaphrene
Member
 
Location: Lausanne CH

Join Date: Aug 2013
Posts: 18
Default

Doesn't anybody have any ideas? I'm sorry for the long post, but I wanted to make sure I had "done my homework" before posting for help... The question boils down to:

"What could these Illumina reads with very FASTQC high duplication levels be after eliminating all the most obvious answers?"

Thanks,

-- Alex
Skiaphrene is offline   Reply With Quote
Old 12-18-2013, 12:02 AM   #3
choishingwan
Member
 
Location: Hong Kong

Join Date: Feb 2012
Posts: 21
Default

From my experience, RNA Sequencing reads does have a relatively high duplication rate base on its nature and most of the time I don't read into the duplication rate from the FastQC report and focus mainly on the sequence score.
choishingwan is offline   Reply With Quote
Old 12-18-2013, 02:25 PM   #4
Skiaphrene
Member
 
Location: Lausanne CH

Join Date: Aug 2013
Posts: 18
Default

Hi choishingwan,

Thank you for your answer! I guess maybe I'm looking too much into this... Sometimes you just get problems in the data, in the tools, or both, and you just have to work with them anyway. In this case I would like to point out that some of these replicates had very high read quality scores, while others didn't, so I didn't find any pattern there.

Either way, I'm going to find another practice dataset!

Best regards,

-- Alex
Skiaphrene is offline   Reply With Quote
Old 12-18-2013, 05:21 PM   #5
choishingwan
Member
 
Location: Hong Kong

Join Date: Feb 2012
Posts: 21
Default

Try and see if those reads are all coming from the same lane or if those are the second read of the read pair. Usually the lane will fail together or in general, the second read pair usually have a relatively lower quality score. If I remember correctly, you should be aiming for q30>80%, you can check illumina for the specification. Another thing to look for is to see if there is a high amount of over represented sequence at the beginning of your reads, that might be adapters that require trimming, though I haven't got a data that require to do so yet.
choishingwan is offline   Reply With Quote
Reply

Tags
duplication, fastqc, mouse, rna-seq, trinity

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO