Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to eliminate reads that map to multiple places in genome

    I am trying to compare expression of histone genes in an rnaseq dataset using igv; many of the reads map to multiple places in the genome,,not surprising given the homology in this protein family but, i need to get accurate expression data . How do i fix the alignments to get rid of these kinds of reads?

  • #2
    Typically, ambiguously-mapped reads are either given a single random mapping location, or mapped to all locations via secondary alignments. Expression quantification tools understand this.

    You can get rid of multimapping reads by filtering out everything with a MAPQ of 3 or less, but that will cause underrepresentation of the expression of homologous genes.

    Comment


    • #3
      thanks for the response

      i tried setting the igv threshold for map quality to 50 and i am still getting reads that map to multiple places based on the "blat read sequence" readout table

      some reads start in one histone gene and end in another one 90Kb away...?

      how do i tell if my reads were given a single random location or mapped to all via secondary alignments?
      thank you

      Comment


      • #4
        It depends on what aligner you used and if it followed the consensus of giving a low MAPQ value to the multi-mapped reads. Brian was referring to default behavior of his BBMap aligner. You can either re-map the data (if you are not sure what your original aligner did) or use one of the suggestions in this Biostars thread to filter your BAM files.

        Comment


        • #5
          You can also look at the NH tag:

          Code:
          NH i Number of reported alignments that contains the query in the current record
          That's an optional field, but it should tell you how many alignments there are for a read, in the sam file. There's also

          Code:
          H0 i Number of perfect hits
          ...but that is much less common.

          Comment


          • #6
            Check out figure 12 and the area around it in this paper.

            Sequence read alignment to a reference genome is a fundamental step in many genomics studies. Accuracy in this fundamental step is crucial for correct interpretation of biological data. In cases where two or more closely related bacterial strains are being studied, a common approach is to simply map reads from all strains to a common reference genome, whether because there is no closed reference for some strains or for ease of comparison. The assumption is that the differences between bacterial strains are insignificant enough that the results of differential expression analysis will not be influenced by choice of reference. Genes that are common among the strains under study are used for differential expression analysis, while the remaining genes, which may fail to express in one sample or the other because they are simply absent, are analyzed separately. In this study, we investigate the practice of using a common reference in transcriptomic analysis. We analyze two multi-strain transcriptomic data sets that were initially presented in the literature as comparisons based on a common reference, but which have available closed genomic sequence for all strains, allowing a detailed examination of the impact of reference choice. We provide a method for identifying regions that are most affected by non-native alignments, leading to false positives in differential expression analysis, and perform an in depth analysis identifying the extent of expression loss. We also simulate several data sets to identify best practices for non-native reference use.


            It might not be such a good idea to remove them.

            Comment


            • #7
              Some commandline examples for mapping quality filtering.

              Mapping score distribution depends on mapper program/version used.

              First check the actual MAPQ score histogram for your bam file (first 1M of reads):

              samtools view $1 | head - -n 1000000 | cut -f5 | sort | uniq -c

              Than filter with desired threshold using samtools and -q parameter (assuming the input is test.bam):

              #!/bin/sh
              MAP_QUAL=30
              STRAIN=test
              samtools view -q $MAP_QUAL -b ${STRAIN}.bam -o ${STRAIN}.Q$MAP_QUAL.bam -U ${STRAIN}.Q${MAP_QUAL}U.bam


              The passing MAPQ filter reads will end up in test.Q30.bam

              And failing ones would end up in test.Q30U.bam file.

              Comment


              • #8
                Be very careful when dealing with MAPQ scores. It seems every aligner has a different way to determine them and they aren't standardized at all. See here:

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Working...
                X