SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
count reads or count base-pairs yuelics Introductions 3 07-29-2011 06:41 AM
Quantification: count reads or count base pairs? yuelics Bioinformatics 0 07-27-2011 05:48 AM
Obtaining reference identical reads from a BAM file Sakti Bioinformatics 2 05-17-2011 11:40 AM
Sam flags for bwa-aligned paired end reads with identical + / - strand coordinates spark Bioinformatics 0 03-09-2011 05:00 AM
PubMed: A new strategy for genome assembly using short sequence reads and reduced rep Newsbot! Literature Watch 1 11-18-2010 01:52 AM

Reply
 
Thread Tools
Old 11-18-2009, 10:11 AM   #1
hingamp
Member
 
Location: Marseille, France

Join Date: Sep 2009
Posts: 27
Default Should identical reads be reduced to a single count?

Hi,

We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?

This was used for instance in PNAS December 23, 2008 vol. 105 no. 51 20179-20184:

"Genomic Mapping of the Sequence Tags.

Position-specific base compositions were made by compiling all uniquely aligned reads. The first base of every sequence tag was discarded because of nearly random utilization at the beginning of all sequences. To eliminate redundancies created by PCR amplification, all tags with identical sequences were considered single reads. After removal of adaptor sequences from the reads, the reads were compressed to a nonredundant list of unique sequence tags, which were then mapped to the human genome (hg17) with MosaikAligner (29), using a maximum of 2 mismatches over 95% alignment of the tag (34 nt) and a hash size of 15."

Last edited by hingamp; 11-23-2009 at 12:08 AM.
hingamp is offline   Reply With Quote
Old 11-23-2009, 12:33 AM   #2
hingamp
Member
 
Location: Marseille, France

Join Date: Sep 2009
Posts: 27
Default

In the hope of stimulating a little discussion on this data cleaning issue, here is a picture of SOLiD reads mapped to a reference genome (Whole transcriptome protocol, total RNA). Over and above the 'apparently random' reads near the top that stagger regularly across the sequence, notice a few 'columns' of identical reads. Are these reads PCR amplification artefacts, in which case multiple identical reads should only be counted once for quantification of gene expressions? Please ignore the sequence mismatches that accumulate near the ends (the colored bases in the reads differ form the reference), I'll start a distinct thread on this subject.
Attached Images
File Type: jpg suspect_pileup.jpg (13.1 KB, 410 views)
hingamp is offline   Reply With Quote
Old 11-23-2009, 05:54 AM   #3
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by hingamp View Post
Hi,

We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?
Note that one of the main purposes of RNA-seq on a SOLiD will be to count of the number of transcripts in an RNA sample. For this purpose one would not want to discard all reads with identical starts. They are probably indicative of the representation of that RNA in the sample.

They might also represent PCR biases, but there is a more likely culprit: The RNAseIII used to fragment the RNA prior to adaptor ligation. RNAseIII is a double-strand-specific RNAse. At the most recent SOLiD User's Conference, Ambion reps said that they had found conditions under which it will also cleave single-stranded RNA. The implication being that these conditions are used for RNA fragmentation in the SOLiD Whole Transcriptome kit. Nevertheless this enzyme is a likely source of start-bias, cleaving double-stranded RNA with abandon then begrudgingly slicing up a little bit of single stranded RNA as well.

Note that, while not ideal, start-bias is not the same as a count-bias. For a count, you just need to unambiguously identify a given read as belonging to a transcript from a certain gene. Where it starts, for this purpose, is not critical.

Nevertheless, if you wanted to, you could use another method of RNA fragmentation that would be less likely to have start biases. A presenter at the meeting had tried heat, and found it to be less biased. But unless you want to delve deep into the issue, it might be better to avoid tinkering. I think that heat probably results in 5'-hydroxyl, 2'-3'-cyclic phosphate RNA fragment terminii. These would likely not be good substrate for the adaptor ligation step. To convert these fragments into the 5'-phosphate,3'-hydroxyl ended fragments that the ligase will want, you would need to do end repair. T4-PNK, by itself, might do the trick. Or not.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-18-2010, 10:25 AM   #4
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Is there any tool to remove exact duplicate reads? It should remove all the reads with the same exact sequence, leaving alone the one with better quality values, shouldn't?
javijevi is offline   Reply With Quote
Old 02-22-2010, 02:26 AM   #5
hingamp
Member
 
Location: Marseille, France

Join Date: Sep 2009
Posts: 27
Default

Quote:
Originally Posted by javijevi View Post
Is there any tool to remove exact duplicate reads? It should remove all the reads with the same exact sequence, leaving alone the one with better quality values, shouldn't?
What I can say is that the samtools rmdup does not fit the bill. Maybe Picard (Java) does, but we've not tried it (we've basically uncompressed the BAM files & located duplicates in the SAM file using brute force RAM-greedy scripts).
hingamp is offline   Reply With Quote
Old 02-22-2010, 05:04 AM   #6
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by hingamp View Post
What I can say is that the samtools rmdup does not fit the bill. Maybe Picard (Java) does, but we've not tried it (we've basically uncompressed the BAM files & located duplicates in the SAM file using brute force RAM-greedy scripts).
Is it possible to get these scripts? I've also written one of them, but I'm a biologist and not very experienced in optimizing memory resources in my code, so that my script takes days to finish and, what is more important, does not select the duplicate with the best quality values.
javijevi is offline   Reply With Quote
Old 02-22-2010, 06:01 AM   #7
hingamp
Member
 
Location: Marseille, France

Join Date: Sep 2009
Posts: 27
Default

I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).
hingamp is offline   Reply With Quote
Old 02-22-2010, 07:04 AM   #8
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by hingamp View Post
I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).
I see. My script does the same, but I think it is impractical for hugue datasets. For example, with a 43 GB SAM file (corresponding to the mapping of a 33 GB fastq file), I guess that I need at least all the 32 GB RAM of my machine, and even it has to use the swap partition.

Any idea from the more IT-skilled people?
javijevi is offline   Reply With Quote
Old 02-22-2010, 07:58 AM   #9
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

I would suggest changing the data structure you guys use. Change it to
a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
footprint but keeping the semantics of a hash table. Also, I would code it in C to
stay closer to the hardware.

What is the biological reason for not being able to use samtools/picard mark dups?
__________________
-drd
drio is offline   Reply With Quote
Old 02-23-2010, 02:01 AM   #10
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by drio View Post
I would suggest changing the data structure you guys use. Change it to
a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
footprint but keeping the semantics of a hash table.
Thanks a lot. I'll give a try... if I manage to understand it...

Quote:
Originally Posted by drio View Post
Also, I would code it in C to
stay closer to the hardware.
Yes. Learning C is my first TODO.

Quote:
Originally Posted by drio View Post
What is the biological reason for not being able to use samtools/picard mark dups?
None in my case. I just wanted to get rid of duplicated reads from the very beginning, so that calculations and files will be less hugue. I will check it.
javijevi is offline   Reply With Quote
Old 02-23-2010, 02:28 AM   #11
hingamp
Member
 
Location: Marseille, France

Join Date: Sep 2009
Posts: 27
Default

Quote:
Originally Posted by drio View Post
I would suggest changing the data structure you guys use. Change it to
a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
footprint but keeping the semantics of a hash table. Also, I would code it in C to
stay closer to the hardware.
+1 regarding coding in C; Perl and huge RAM works but clearly feels "quick & dirty"
Quote:
Originally Posted by drio View Post
What is the biological reason for not being able to use samtools/picard mark dups?
Again please check carefully as this dates back a while, but I seem to recall samtools did just what rmdups suggests: remove all duplicates (not keeping one read for each duplicate cluster). I've not tried it, but Picard is supposed to treat duplicates better?
hingamp is offline   Reply With Quote
Old 02-23-2010, 09:20 AM   #12
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by hingamp View Post
+1 regarding coding in C; Perl and huge RAM works but clearly feels "quick & dirty"

Again please check carefully as this dates back a while, but I seem to recall samtools did just what rmdups suggests: remove all duplicates (not keeping one read for each duplicate cluster). I've not tried it, but Picard is supposed to treat duplicates better?
As a warning to users, do not use "samtools rmdup" as it is explicitly stated by its author.. Use picard instead.
nilshomer is offline   Reply With Quote
Old 02-23-2010, 11:52 AM   #13
duffman
Junior Member
 
Location: connecticut

Join Date: Aug 2009
Posts: 3
Default

Quote:
Originally Posted by hingamp View Post
I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).
If the SAM file is sorted by reference position, then you should be able to simply pass through the file, detecting duplicates by maintaining the previous line in memory, writing non-redundant reads to another file. This uses almost no RAM. (You can sort an unsorted SAM file from the Unix command line via sort -k 3,3 -k 4,4n alignments.sam > alignments_sorted.sam)
duffman is offline   Reply With Quote
Old 03-18-2010, 10:57 PM   #14
dukevn
Member
 
Location: RI

Join Date: Apr 2009
Posts: 50
Default

I wonder if this should be applied for Illumina-produced data or not? Should I filter out the identical (repeated) reads before doing any analysis?

Thanks,

D.
dukevn is offline   Reply With Quote
Old 05-21-2010, 12:13 AM   #15
jwfoley
Senior Member
 
Location: Stanford

Join Date: Jun 2009
Posts: 181
Default

I see this all the time with mammalian genomes regardless of protocol and regardless of platform. It's probably caused by sequencing errors that turn common non-uniquely-mapping reads into uniquely-mapping reads.

Don't remove duplicate reads because they're duplicates; that defeats the purpose of quantification. Filter them out by the posterior probability that their best alignment is correct. That's what the MAPQ field in SAM format is supposed to contain, although many mappers ignore the spec so you might have to calculate it yourself. If you do that, the "stacks" go away on their own.
jwfoley is offline   Reply With Quote
Old 05-26-2010, 10:18 PM   #16
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

Quote:
Originally Posted by jwfoley View Post
Don't remove duplicate reads because they're duplicates; that defeats the purpose of quantification. Filter them out by the posterior probability that their best alignment is correct. That's what the MAPQ field in SAM format is supposed to contain, although many mappers ignore the spec so you might have to calculate it yourself. If you do that, the "stacks" go away on their own.
Hi jwfoley, can you please explain how to calculate the posterior probability that the alignment is correct or not? What is the threshold to use and moreover, how does one interpret the MAPQ field in the SAM format. I will really appreciate your reply!

Thanks in advance.
thinkRNA is offline   Reply With Quote
Old 05-27-2010, 01:23 PM   #17
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by dukevn View Post
I wonder if this should be applied for Illumina-produced data or not? Should I filter out the identical (repeated) reads before doing any analysis?

Thanks,

D.
It can definitely happen in Illumina data. PCR duplicates are an issue in both Illumina and SOLiD platforms.

That said, I think you have to consider your experiment before deciding if duplicate removal is for you.

Keep in mind that duplicates are generally thought to be due to PCR rather than independent observations of the same positions. This assumption is based on the state where coverage is much lower than read length. Consider the chances that your particular experiment will have duplicate reads by chance.

Two examples:

A 25-base single-end genomic library. Once you have 25x coverage, you're in a range of coverage where it's very likely to see duplicates that are independent observations, not due to PCR. I might apply a statistical model to the 25-base single-end library to figure out how likely duplicates are and then only filter if the number of duplicates is significantly outside that distribution.

A 50-base mate-pair genomic library. Each read here contains 100 bases and has 1200-1500 bases between each end. In this library, you can exceed 100x coverage and still be fairly confident that duplicates are generally not independent observations, but due to PCR. If I'm at 25X coverage, I feel confident that consolidating duplicate reads is not losing me a significant amount of relevant data--the vast majority of duplicates in this library are PCR duplicates.

As for the specific case of RNAseq here, I think pmiguel contributed some valuable information on how to avoid RNAseIII-related biases that's worth looking at (sorry, I do whole genome, generally, so my advice here is more general).

As for algorithms for removing duplicates, Picard MarkDuplicates is adequate and does the job thoroughly and correctly. It is pretty much "standard operating procedure" for our genomic analysis team.

Edit: Let me clarify something in case it's unclear. For the example 25-base single-end genomic library, I would not do blind duplicate removal. Even at, say, 10 or 15x coverage, there's a very good chance duplicates are independent observations in such a library, so removing all duplicates would be removing a lot of real, useful data.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 05-27-2010 at 01:50 PM. Reason: Clarification.
Michael.James.Clark is offline   Reply With Quote
Old 05-27-2010, 01:27 PM   #18
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Duplicate removal always reminds me of the Birthday problem. lh3 has a good white-paper on the theory behind duplicate removal.
nilshomer is offline   Reply With Quote
Old 05-28-2010, 01:50 PM   #19
jwfoley
Senior Member
 
Location: Stanford

Join Date: Jun 2009
Posts: 181
Default

Quote:
Originally Posted by thinkRNA View Post
Hi jwfoley, can you please explain how to calculate the posterior probability that the alignment is correct or not? What is the threshold to use and moreover, how does one interpret the MAPQ field in the SAM format. I will really appreciate your reply!

Thanks in advance.
Say a given tag aligns to a single site with zero mismatches, but aligns to five sites with one mismatch each. It fits the definition of a unique best hit, but there's actually a reasonable chance it came from any of those other five sites and had a single base-calling error. So the posterior probability that the best alignment is actually correct is P[zero base errors] / (P[zero base errors] + 5 * P[one base error]), if we ignore other possible alignments beyond one mismatch (and probabilities of multiple base errors become vanishingly small, so this is safe, though I usually carry it out to two mismatches since mappers tend to provide those).

The simplest way to model base errors is binomial: each base in your read has an equal, independent probability of being correct or incorrect. Then if your reads are 25 bases long and the error rate is 1%, the probability of a read with zero errors is 0.78, and the probability of a read with one error is 0.20, so for the example above, the posterior probability that the best alignment is correct would be 0.44, which is low.

Of course it would make even more sense to use base quality scores instead of treating all errors as equally likely, if you have that information available. I haven't tried that because I typically deal with Eland and Corona Lite output, which doesn't include the base qualities like SAM does. Likewise, it would be better to consider insertion and deletion errors (which BFAST provides), not just mismatches. But even using the simple binomial model, I find that a threshold of 0.9 removes the vast majority of these "stacks," at least in samples that aren't PCR-bottlenecked, because most of these are actually artifacts of mapping rather than PCR.

The SAM spec includes a "MAPQ" field that's supposed to contain a Phred-scale posterior probability score for each alignment, though it doesn't say how to calculate the score (nor should it). Assuming your mapper puts something useful in here, you can just parse it out, convert out of Phred scale, and filter on that (or convert your threshold into Phred scale). Of course, some mappers (e.g. Bowtie) essentially ignore the spec, so in that case you'll have to count multiple alignments of each read and calculate the probability yourself.

Last edited by jwfoley; 05-28-2010 at 01:55 PM.
jwfoley is offline   Reply With Quote
Old 10-29-2010, 08:36 AM   #20
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

why doesn't samtools rmdup fit the bill? Is this an rna-seq specific problem? Also, I didn't see any posts in regard to this: you have one tag mapping to multiple splice-junctions. Will any duplicate-removing program pick this up and/or would you keep it and consider it as a possible isoform (event) if it maps to the same gene. what if different genes? thanks.
JohnK is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO