Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • qesecir
    Junior Member
    • Feb 2010
    • 2

    [BWA] Calculate % of reads not aligned to reference

    Hi,

    I am using BWA to align single reads to reference genome using BWA. Because the sequenced genome is quite diverged from the reference (genetic distance is about 40 to 50%, but still belong to the same species), I am curious how many reads are not aligned by BWA, how to find out this information?

    Thanks
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    samtools idxstats?

    Comment

    • cascoamarillo
      Senior Member
      • Oct 2010
      • 164

      #3
      Hi,

      When I use this command, I get this:

      ref 12398 1478064 0
      * 0 0 8139515

      with the reference sequence name, sequence length, # mapped reads and # unmapped reads. What is the meaning of the "0" and the "*"? (if they have any).

      best

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        idxstats samtools idxstats <aln.bam>
        Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
        The final line is the unmapped reads (* instead of a reference sequence name, zero length).

        Comment

        • cascoamarillo
          Senior Member
          • Oct 2010
          • 164

          #5
          Thanks, now I get it. But one more question; what if I want to know the total number of reads mapped in a multi-sequence (several contigs) reference?
          i.e.:

          samtools idxstats <aln.bam>

          contig00001 22002 147 0
          contig00002 19783 23 0
          contig00003 19528 25 0
          contig00004 17742 192 0
          contig00005 16684 35 0
          .
          .
          .
          contig61681 100 0 0
          contig61682 100 0 0
          contig61684 100 0 0
          contig61685 100 0 0
          contig61686 100 0 0
          * 0 0 2005333

          Is there a way to see the sum of all the mapped reads.

          thanks in advance.

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Add up all the values in column 3 (it doesn't hurt to include the final row as it has zero there) for the number of mapped reads.

            That's trivial in Perl / Python / etc. You can do it with a Unix one liner too, this is one way using the cut command to select just column 3, and awk for counting:

            Code:
            samtools idxstats example.bam | cut -f3 | awk 'BEGIN {total=0} {total += $1} END {print total}'
            There may well be a neater Unix solution...

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Also you can use 'samtools flagstat' as pointed out by av_d on this thread:
              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

              Comment

              • ruchira
                Junior Member
                • Jan 2012
                • 2

                #8
                With the human genome as reference, when I run samtools idxstat I get a count of unmapped reads for each chromosome. But if the read was unmapped, how was it assigned to a particular chromosome?

                Comment

                • swbarnes2
                  Senior Member
                  • May 2008
                  • 910

                  #9
                  Originally posted by ruchira View Post
                  With the human genome as reference, when I run samtools idxstat I get a count of unmapped reads for each chromosome. But if the read was unmapped, how was it assigned to a particular chromosome?
                  Most likely, its mate mapped to that chromosome. Sam specs call for unmapped reads to be given the mapping coordaintes of their mates, when their mates mapped. So the read has both the unampped flag set, and a mapping position.

                  Comment

                  • ruchira
                    Junior Member
                    • Jan 2012
                    • 2

                    #10
                    Thanks very much for your answer, swbarnes2. I had used bwa align and then bwa sampe to generate the sam files. So I guess perhaps bwa sampe included the chromosome information but still listed the reads as unmapped.

                    I expected some reads to be completely unmapped (because they were not human DNA). Where would these appear in the idxstats output?

                    Comment

                    • nchernia
                      Junior Member
                      • Jan 2012
                      • 2

                      #11
                      I wanted the same information recently and thought I'd post my steps. My output from BWA is in .sam format, so first it needs to be converted to bam, then sorted, then indexed. Not everyone will need to sort.

                      Code:
                      samtools view -Sb filename.sam > filename.bam
                      samtools sort filename.bam filename_sorted
                      samtools index filename_sorted.bam
                      Then I run this awk command. The "cut" of the previous is unnecessary since you can just read the appropriate field via awk.

                      Code:
                      samtools idxstats filename_sorted.bam | awk 'BEGIN {a=0;b=0} {a += $3; b+=$4 } END{print a " mapped " b " unmapped "}'

                      Comment

                      • wmyashar
                        Junior Member
                        • Jun 2012
                        • 4

                        #12
                        Is there a way to utilize samtools idxstats so i can read the amount of mapped/unmapped reads to particular genes rather than the entire chromosome itself? I get an output like this:
                        chr1 197195432 2022423 0
                        chr2 181748087 1915486 0
                        chr3 159599783 1418344 0
                        chr4 155630120 1352178 0
                        chr5 152537259 1526530 0
                        .....
                        thank you in advance

                        Comment

                        • swbarnes2
                          Senior Member
                          • May 2008
                          • 910

                          #13
                          Originally posted by wmyashar View Post
                          Is there a way to utilize samtools idxstats so i can read the amount of mapped/unmapped reads to particular genes rather than the entire chromosome itself? I get an output like this:
                          chr1 197195432 2022423 0
                          chr2 181748087 1915486 0
                          chr3 159599783 1418344 0
                          chr4 155630120 1352178 0
                          chr5 152537259 1526530 0
                          .....
                          thank you in advance
                          You want to filter your .bam based on those coordiantes, then count. I think samtools view can take a .bed file and do that, so can BEDTools.

                          Comment

                          • clsppb
                            Junior Member
                            • Oct 2011
                            • 6

                            #14
                            &quot;Mapping&quot; unmapped reads

                            Originally posted by swbarnes2 View Post
                            Most likely, its mate mapped to that chromosome. Sam specs call for unmapped reads to be given the mapping coordaintes of their mates, when their mates mapped. So the read has both the unampped flag set, and a mapping position.
                            By "its mate" are you referring to paired-end protocols?

                            Comment

                            • Gonza
                              Member
                              • Mar 2013
                              • 78

                              #15
                              Hi all,

                              after sorting, indexing and counting reads that align to chromosomes in a BAM file i get this (example for one library only):

                              1 | 30,427,671 | 5,913,901 | 0
                              2 | 19,698,289 | 3,386,635 | 0
                              3 | 23,459,830 | 4,784,837 | 0
                              4 | 18,585,056 | 3,292,873 | 0
                              5 | 26,975,502 | 5,032,188 | 0
                              Mt | 366,924 | 37,747 | 0
                              Pt | 154,478 | 9,107 | 0
                              * 0 0 0


                              First column Arabidopsis chromosomes, 2nd chromosomes length (in bp), 3rd mapped reads and 4th unmapped. Correct?

                              Now, is it possible, for example, that in chr 1 i only mapped 5 million reads and I have 0 unmpped for all? Looks odd to me.
                              Any comments.
                              Thanks.
                              Last edited by Gonza; 10-29-2014, 05:39 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              40 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              46 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...