Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Should identical reads be reduced to a single count?

    Hi,

    We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?

    This was used for instance in PNAS December 23, 2008 vol. 105 no. 51 20179-20184:

    "Genomic Mapping of the Sequence Tags.

    Position-specific base compositions were made by compiling all uniquely aligned reads. The first base of every sequence tag was discarded because of nearly random utilization at the beginning of all sequences. To eliminate redundancies created by PCR amplification, all tags with identical sequences were considered single reads. After removal of adaptor sequences from the reads, the reads were compressed to a nonredundant list of unique sequence tags, which were then mapped to the human genome (hg17) with MosaikAligner (29), using a maximum of 2 mismatches over 95% alignment of the tag (34 nt) and a hash size of 15."
    Last edited by hingamp; 11-23-2009, 12:08 AM.

  • #2
    In the hope of stimulating a little discussion on this data cleaning issue, here is a picture of SOLiD reads mapped to a reference genome (Whole transcriptome protocol, total RNA). Over and above the 'apparently random' reads near the top that stagger regularly across the sequence, notice a few 'columns' of identical reads. Are these reads PCR amplification artefacts, in which case multiple identical reads should only be counted once for quantification of gene expressions? Please ignore the sequence mismatches that accumulate near the ends (the colored bases in the reads differ form the reference), I'll start a distinct thread on this subject.
    Attached Files

    Comment


    • #3
      Originally posted by hingamp View Post
      Hi,

      We've been looking at SOLiD RNA-seq mappings in detail, and we've noticed two read families: reads that align "randomly", and reads that form (sometimes huge) perfectly identical pileups... In the same way that identical reads are suspected PCR artefacts and counted only once in ChIP-seq, should 100% identical reads (including base calls in conflict with reference sequence be counted only once in RNA-seq?
      Note that one of the main purposes of RNA-seq on a SOLiD will be to count of the number of transcripts in an RNA sample. For this purpose one would not want to discard all reads with identical starts. They are probably indicative of the representation of that RNA in the sample.

      They might also represent PCR biases, but there is a more likely culprit: The RNAseIII used to fragment the RNA prior to adaptor ligation. RNAseIII is a double-strand-specific RNAse. At the most recent SOLiD User's Conference, Ambion reps said that they had found conditions under which it will also cleave single-stranded RNA. The implication being that these conditions are used for RNA fragmentation in the SOLiD Whole Transcriptome kit. Nevertheless this enzyme is a likely source of start-bias, cleaving double-stranded RNA with abandon then begrudgingly slicing up a little bit of single stranded RNA as well.

      Note that, while not ideal, start-bias is not the same as a count-bias. For a count, you just need to unambiguously identify a given read as belonging to a transcript from a certain gene. Where it starts, for this purpose, is not critical.

      Nevertheless, if you wanted to, you could use another method of RNA fragmentation that would be less likely to have start biases. A presenter at the meeting had tried heat, and found it to be less biased. But unless you want to delve deep into the issue, it might be better to avoid tinkering. I think that heat probably results in 5'-hydroxyl, 2'-3'-cyclic phosphate RNA fragment terminii. These would likely not be good substrate for the adaptor ligation step. To convert these fragments into the 5'-phosphate,3'-hydroxyl ended fragments that the ligase will want, you would need to do end repair. T4-PNK, by itself, might do the trick. Or not.

      --
      Phillip

      Comment


      • #4
        Is there any tool to remove exact duplicate reads? It should remove all the reads with the same exact sequence, leaving alone the one with better quality values, shouldn't?

        Comment


        • #5
          Originally posted by javijevi View Post
          Is there any tool to remove exact duplicate reads? It should remove all the reads with the same exact sequence, leaving alone the one with better quality values, shouldn't?
          What I can say is that the samtools rmdup does not fit the bill. Maybe Picard (Java) does, but we've not tried it (we've basically uncompressed the BAM files & located duplicates in the SAM file using brute force RAM-greedy scripts).

          Comment


          • #6
            Originally posted by hingamp View Post
            What I can say is that the samtools rmdup does not fit the bill. Maybe Picard (Java) does, but we've not tried it (we've basically uncompressed the BAM files & located duplicates in the SAM file using brute force RAM-greedy scripts).
            Is it possible to get these scripts? I've also written one of them, but I'm a biologist and not very experienced in optimizing memory resources in my code, so that my script takes days to finish and, what is more important, does not select the duplicate with the best quality values.

            Comment


            • #7
              I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).

              Comment


              • #8
                Originally posted by hingamp View Post
                I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).
                I see. My script does the same, but I think it is impractical for hugue datasets. For example, with a 43 GB SAM file (corresponding to the mapping of a 33 GB fastq file), I guess that I need at least all the 32 GB RAM of my machine, and even it has to use the swap partition.

                Any idea from the more IT-skilled people?

                Comment


                • #9
                  I would suggest changing the data structure you guys use. Change it to
                  a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
                  footprint but keeping the semantics of a hash table. Also, I would code it in C to
                  stay closer to the hardware.

                  What is the biological reason for not being able to use samtools/picard mark dups?
                  -drd

                  Comment


                  • #10
                    Originally posted by drio View Post
                    I would suggest changing the data structure you guys use. Change it to
                    a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
                    footprint but keeping the semantics of a hash table.
                    Thanks a lot. I'll give a try... if I manage to understand it...

                    Originally posted by drio View Post
                    Also, I would code it in C to
                    stay closer to the hardware.
                    Yes. Learning C is my first TODO.

                    Originally posted by drio View Post
                    What is the biological reason for not being able to use samtools/picard mark dups?
                    None in my case. I just wanted to get rid of duplicated reads from the very beginning, so that calculations and files will be less hugue. I will check it.

                    Comment


                    • #11
                      Originally posted by drio View Post
                      I would suggest changing the data structure you guys use. Change it to
                      a prefix trie (http://en.wikipedia.org/wiki/Trie). That will reduce the memory
                      footprint but keeping the semantics of a hash table. Also, I would code it in C to
                      stay closer to the hardware.
                      +1 regarding coding in C; Perl and huge RAM works but clearly feels "quick & dirty"
                      Originally posted by drio View Post
                      What is the biological reason for not being able to use samtools/picard mark dups?
                      Again please check carefully as this dates back a while, but I seem to recall samtools did just what rmdups suggests: remove all duplicates (not keeping one read for each duplicate cluster). I've not tried it, but Picard is supposed to treat duplicates better?

                      Comment


                      • #12
                        Originally posted by hingamp View Post
                        +1 regarding coding in C; Perl and huge RAM works but clearly feels "quick & dirty"

                        Again please check carefully as this dates back a while, but I seem to recall samtools did just what rmdups suggests: remove all duplicates (not keeping one read for each duplicate cluster). I've not tried it, but Picard is supposed to treat duplicates better?
                        As a warning to users, do not use "samtools rmdup" as it is explicitly stated by its author.. Use picard instead.

                        Comment


                        • #13
                          Originally posted by hingamp View Post
                          I've just asked my colleague who used that approach but it seems the script was not kept. From what I remember it just parsed the SAM file and built a memory hash (perl) using chrom coordinates as key. The SAM file has quality data, so you can choose to keep the best read. It ran pretty fast, but used huge RAM (we're now equiped with ~64Gb machines to do just that kind of silly scripts with no optimization whatsoever).
                          If the SAM file is sorted by reference position, then you should be able to simply pass through the file, detecting duplicates by maintaining the previous line in memory, writing non-redundant reads to another file. This uses almost no RAM. (You can sort an unsorted SAM file from the Unix command line via sort -k 3,3 -k 4,4n alignments.sam > alignments_sorted.sam)

                          Comment


                          • #14
                            I wonder if this should be applied for Illumina-produced data or not? Should I filter out the identical (repeated) reads before doing any analysis?

                            Thanks,

                            D.

                            Comment


                            • #15
                              I see this all the time with mammalian genomes regardless of protocol and regardless of platform. It's probably caused by sequencing errors that turn common non-uniquely-mapping reads into uniquely-mapping reads.

                              Don't remove duplicate reads because they're duplicates; that defeats the purpose of quantification. Filter them out by the posterior probability that their best alignment is correct. That's what the MAPQ field in SAM format is supposed to contain, although many mappers ignore the spec so you might have to calculate it yourself. If you do that, the "stacks" go away on their own.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              58 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              45 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X