![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
50 bp paired end reads vs. 100 bp single end reads | efoss | Bioinformatics | 12 | 01-15-2014 09:05 PM |
Can Cuffdiff treat paired-end and single-end reads at the same time? | zun | RNA Sequencing | 3 | 06-12-2012 06:37 PM |
illumina single-end reads run cufflink | louis7781x | Bioinformatics | 3 | 04-23-2011 07:05 AM |
How to estimate error rate for short-reads and base-calling duplicate? | zchou | Illumina/Solexa | 10 | 01-20-2010 09:13 AM |
indels using single end short reads! | bioinfosm | Bioinformatics | 10 | 08-01-2008 01:57 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Beer Sheava Join Date: Jul 2011
Posts: 4
|
![]()
Dear all,
I am performing QC to fastq files of Illumina 76bp length single end reads of RNAseq data. I keep getting indications that there are many PCR duplicates; Fastqc report indicates 74.1% sequence duplication level, but no overrepresented sequences list is given. When SAMtools Duplicates removal (rmdup) option is performed, 74.9% of the sequences were found to be duplicates. When comparing the mapping results before and after the duplicates removal I see that the highly expressed genes, has the highest fraction of duplicates (which were removed). This is not the first illumina single end short reads RNA seq experiment that I see this phenomena. I keep wondering whether this is an experimental artifact (then, should we repeat on the experiment?) or just a possible valid result (in this scenario I must than believe that few different inserts which were generated from different copies of the same kind of RNA transcripts were cleaved at the same base, leading to identical 5’ end of an insert which are then sequenced). I would be glad to know your opinion in this matter Many thanks Inbar |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]()
With 76-mer single reads, even for a perfectly diverse library, the theortical depth limit at any point is 152 if you use rmdup. So any gene that has more coverage than that ceiling is going to be whacked down to 152x. So you won't be able to quantify expression of those highly expressed genes.
That library sounds awfully non-diverse, but if your sample is dominated by a couple of genes at super high levels, maybe it's accurate. I guess you could examine the highly represented reads. Do they cover whole genes as if the sample had a huge amount of that RNA? Or is there just one position that has 100K reads, and adjacent positions have much less? |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Berlin Join Date: Jul 2011
Posts: 156
|
![]()
Exactly, I'd have a look at the shape of the read alignments before de-duplication to see whether it looks like PCR or simply very high coverage. 74 % isn't exceptionally high, I usually see 60-80 % for libraries which look OK.
In any case, de-duplication on reads for downstream quantification is a delicate matter, as it is difficult to discern PCR copies from valid, high-coverage, reads as swbarnes2 pointed out. |
![]() |
![]() |
![]() |
#4 |
Junior Member
Location: Beer Sheava Join Date: Jul 2011
Posts: 4
|
![]()
swbarnes2, Thanks a lot for your answer,
I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ? many thanks Inbar |
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]() Quote:
So with 76-mers, the base at position 100 will be covered by 152 reads, 76 in the forward direction, starting at bases 35-100, and 76 in the reverse direction, starting from 100-175. You can't have three reads all running forward, starting at position 75, becuae your rmdup will get rid of two of them. With paired end, you can have three reads which run in the forward direction starting at base 75, if their mates all start at different sites, because if their mates are at different sites, they must have come from different fragments. So there's a ceiling there too, depending on how variant your insert sizes are, but it's far higher than the ceiling for single read runs. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|