SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
50 bp paired end reads vs. 100 bp single end reads efoss Bioinformatics 12 01-15-2014 08:05 PM
Can Cuffdiff treat paired-end and single-end reads at the same time? zun RNA Sequencing 3 06-12-2012 05:37 PM
illumina single-end reads run cufflink louis7781x Bioinformatics 3 04-23-2011 06:05 AM
How to estimate error rate for short-reads and base-calling duplicate? zchou Illumina/Solexa 10 01-20-2010 08:13 AM
indels using single end short reads! bioinfosm Bioinformatics 10 08-01-2008 12:57 PM

Reply
 
Thread Tools
Old 05-21-2012, 03:52 AM   #1
inbarpl
Junior Member
 
Location: Beer Sheava

Join Date: Jul 2011
Posts: 4
Default duplicate reads in Illumina short, single end reads of RNAseq data

Dear all,
I am performing QC to fastq files of Illumina 76bp length single end reads of RNAseq data. I keep getting indications that there are many PCR duplicates; Fastqc report indicates 74.1% sequence duplication level, but no overrepresented sequences list is given. When SAMtools Duplicates removal (rmdup) option is performed, 74.9% of the sequences were found to be duplicates. When comparing the mapping results before and after the duplicates removal I see that the highly expressed genes, has the highest fraction of duplicates (which were removed). This is not the first illumina single end short reads RNA seq experiment that I see this phenomena.
I keep wondering whether this is an experimental artifact (then, should we repeat on the experiment?) or just a possible valid result (in this scenario I must than believe that few different inserts which were generated from different copies of the same kind of RNA transcripts were cleaved at the same base, leading to identical 5 end of an insert which are then sequenced).
I would be glad to know your opinion in this matter
Many thanks
Inbar
inbarpl is offline   Reply With Quote
Old 05-21-2012, 01:39 PM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

With 76-mer single reads, even for a perfectly diverse library, the theortical depth limit at any point is 152 if you use rmdup. So any gene that has more coverage than that ceiling is going to be whacked down to 152x. So you won't be able to quantify expression of those highly expressed genes.

That library sounds awfully non-diverse, but if your sample is dominated by a couple of genes at super high levels, maybe it's accurate. I guess you could examine the highly represented reads. Do they cover whole genes as if the sample had a huge amount of that RNA? Or is there just one position that has 100K reads, and adjacent positions have much less?
swbarnes2 is offline   Reply With Quote
Old 05-22-2012, 12:15 AM   #3
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Exactly, I'd have a look at the shape of the read alignments before de-duplication to see whether it looks like PCR or simply very high coverage. 74 % isn't exceptionally high, I usually see 60-80 % for libraries which look OK.
In any case, de-duplication on reads for downstream quantification is a delicate matter, as it is difficult to discern PCR copies from valid, high-coverage, reads as swbarnes2 pointed out.
arvid is offline   Reply With Quote
Old 05-22-2012, 12:28 AM   #4
inbarpl
Junior Member
 
Location: Beer Sheava

Join Date: Jul 2011
Posts: 4
Default

swbarnes2, Thanks a lot for your answer,
I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ?
many thanks
Inbar
inbarpl is offline   Reply With Quote
Old 05-22-2012, 08:36 AM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by inbarpl View Post
swbarnes2, Thanks a lot for your answer,
I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ?
many thanks
Inbar
If you filter single end data for uniqueness, you will have exactly two reads beginning at every point; one in the forward direction, one in the reverse.

So with 76-mers, the base at position 100 will be covered by 152 reads, 76 in the forward direction, starting at bases 35-100, and 76 in the reverse direction, starting from 100-175. You can't have three reads all running forward, starting at position 75, becuae your rmdup will get rid of two of them.

With paired end, you can have three reads which run in the forward direction starting at base 75, if their mates all start at different sites, because if their mates are at different sites, they must have come from different fragments. So there's a ceiling there too, depending on how variant your insert sizes are, but it's far higher than the ceiling for single read runs.
swbarnes2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO