Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • why retain unmapped reads?

    I have heard that it is important for downstream analyses to retain unmapped reads. I am interested to know the reason for this recommendation.

    Specifically, I am using BWA + GATK to call SNPs from Illumina data. It is not clear to me if the GATK SNP calling pipeline ever utilizes unmapped reads. We expect a large proportion of unmapped reads, so we could save a lot of disk space by getting rid of them.

  • #2
    You don't *have* to retain unmapped reads if you are calling SNPs and especially if you are archiving the original FASTQ files you could remove unmapped reads from the BAMs...
    Last edited by adaptivegenome; 01-12-2012, 09:07 PM. Reason: typo

    Comment


    • #3
      If you want to call structural variants at some point, you will probably want to keep the unmapped reads as they could cover breakpoints that prevent them from aligning.

      However, if you only want to call SNPs and you are guaranteed not to care about calling anything else, then I agree with genericforms.

      Comment


      • #4
        So that others can re-run the data. It tells others what the real source data is. There's other information in the unmappeds: often viral or bacterial sequences that may be of interest (i.e. the sample has herpesvirae).

        A classic example is a paired end rna seq. One read pair may not map but you still need it to do paired end processing; aligners require the two pairs to be there. Improvements in alignment software with something as tricky as rna alignment are likely (someday). Another case might be a very wacky indel. Trying to align all the reads to a small area or alternate genome build using different software might provide insight.

        Comment


        • #5
          Depends what you want

          If there are unmapped reads, either the mapper has made a mistake, the reference has gaps, or the sample is different from the reference in some way that the mapper cannot compensate for. The differences may be structural variants, repeats, paralogues of genes, duplications of regions, etc.

          If you want a set of conservative SNPs and you don't care about accessing all variation, then that's fine, you don't care about those problematic parts of the genome.

          If you have some phenotype you are investigating, or you want a complete/sensitive set of variants, then you may be concerned about missing SNPs or more complex variants. In that case you want to keep the unmapped reads to do stuff with them (count them, or assemble the unampped reads, or assemble ALL the reads, or use paired-ends to detect structural variants, etc)

          Comment


          • #6
            You can also use the reads to look for potential contamination. Throw them into an assemblier and blast the bigger contigs you get out. If you see decent size contigs for a viral or bacterial species you man want to add a contamination filter step to the begining of the mapping pipeline and see how that changes your results.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X