Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [BWA] Calculate % of reads not aligned to reference

    Hi,

    I am using BWA to align single reads to reference genome using BWA. Because the sequenced genome is quite diverged from the reference (genetic distance is about 40 to 50%, but still belong to the same species), I am curious how many reads are not aligned by BWA, how to find out this information?

    Thanks

  • #2
    samtools idxstats?

    Comment


    • #3
      Hi,

      When I use this command, I get this:

      ref 12398 1478064 0
      * 0 0 8139515

      with the reference sequence name, sequence length, # mapped reads and # unmapped reads. What is the meaning of the "0" and the "*"? (if they have any).

      best

      Comment


      • #4
        idxstats samtools idxstats <aln.bam>
        Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
        The final line is the unmapped reads (* instead of a reference sequence name, zero length).

        Comment


        • #5
          Thanks, now I get it. But one more question; what if I want to know the total number of reads mapped in a multi-sequence (several contigs) reference?
          i.e.:

          samtools idxstats <aln.bam>

          contig00001 22002 147 0
          contig00002 19783 23 0
          contig00003 19528 25 0
          contig00004 17742 192 0
          contig00005 16684 35 0
          .
          .
          .
          contig61681 100 0 0
          contig61682 100 0 0
          contig61684 100 0 0
          contig61685 100 0 0
          contig61686 100 0 0
          * 0 0 2005333

          Is there a way to see the sum of all the mapped reads.

          thanks in advance.

          Comment


          • #6
            Add up all the values in column 3 (it doesn't hurt to include the final row as it has zero there) for the number of mapped reads.

            That's trivial in Perl / Python / etc. You can do it with a Unix one liner too, this is one way using the cut command to select just column 3, and awk for counting:

            Code:
            samtools idxstats example.bam | cut -f3 | awk 'BEGIN {total=0} {total += $1} END {print total}'
            There may well be a neater Unix solution...

            Comment


            • #7
              Also you can use 'samtools flagstat' as pointed out by av_d on this thread:
              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

              Comment


              • #8
                With the human genome as reference, when I run samtools idxstat I get a count of unmapped reads for each chromosome. But if the read was unmapped, how was it assigned to a particular chromosome?

                Comment


                • #9
                  Originally posted by ruchira View Post
                  With the human genome as reference, when I run samtools idxstat I get a count of unmapped reads for each chromosome. But if the read was unmapped, how was it assigned to a particular chromosome?
                  Most likely, its mate mapped to that chromosome. Sam specs call for unmapped reads to be given the mapping coordaintes of their mates, when their mates mapped. So the read has both the unampped flag set, and a mapping position.

                  Comment


                  • #10
                    Thanks very much for your answer, swbarnes2. I had used bwa align and then bwa sampe to generate the sam files. So I guess perhaps bwa sampe included the chromosome information but still listed the reads as unmapped.

                    I expected some reads to be completely unmapped (because they were not human DNA). Where would these appear in the idxstats output?

                    Comment


                    • #11
                      I wanted the same information recently and thought I'd post my steps. My output from BWA is in .sam format, so first it needs to be converted to bam, then sorted, then indexed. Not everyone will need to sort.

                      Code:
                      samtools view -Sb filename.sam > filename.bam
                      samtools sort filename.bam filename_sorted
                      samtools index filename_sorted.bam
                      Then I run this awk command. The "cut" of the previous is unnecessary since you can just read the appropriate field via awk.

                      Code:
                      samtools idxstats filename_sorted.bam | awk 'BEGIN {a=0;b=0} {a += $3; b+=$4 } END{print a " mapped " b " unmapped "}'

                      Comment


                      • #12
                        Is there a way to utilize samtools idxstats so i can read the amount of mapped/unmapped reads to particular genes rather than the entire chromosome itself? I get an output like this:
                        chr1 197195432 2022423 0
                        chr2 181748087 1915486 0
                        chr3 159599783 1418344 0
                        chr4 155630120 1352178 0
                        chr5 152537259 1526530 0
                        .....
                        thank you in advance

                        Comment


                        • #13
                          Originally posted by wmyashar View Post
                          Is there a way to utilize samtools idxstats so i can read the amount of mapped/unmapped reads to particular genes rather than the entire chromosome itself? I get an output like this:
                          chr1 197195432 2022423 0
                          chr2 181748087 1915486 0
                          chr3 159599783 1418344 0
                          chr4 155630120 1352178 0
                          chr5 152537259 1526530 0
                          .....
                          thank you in advance
                          You want to filter your .bam based on those coordiantes, then count. I think samtools view can take a .bed file and do that, so can BEDTools.

                          Comment


                          • #14
                            &quot;Mapping&quot; unmapped reads

                            Originally posted by swbarnes2 View Post
                            Most likely, its mate mapped to that chromosome. Sam specs call for unmapped reads to be given the mapping coordaintes of their mates, when their mates mapped. So the read has both the unampped flag set, and a mapping position.
                            By "its mate" are you referring to paired-end protocols?

                            Comment


                            • #15
                              Hi all,

                              after sorting, indexing and counting reads that align to chromosomes in a BAM file i get this (example for one library only):

                              1 | 30,427,671 | 5,913,901 | 0
                              2 | 19,698,289 | 3,386,635 | 0
                              3 | 23,459,830 | 4,784,837 | 0
                              4 | 18,585,056 | 3,292,873 | 0
                              5 | 26,975,502 | 5,032,188 | 0
                              Mt | 366,924 | 37,747 | 0
                              Pt | 154,478 | 9,107 | 0
                              * 0 0 0


                              First column Arabidopsis chromosomes, 2nd chromosomes length (in bp), 3rd mapped reads and 4th unmapped. Correct?

                              Now, is it possible, for example, that in chr 1 i only mapped 5 million reads and I have 0 unmpped for all? Looks odd to me.
                              Any comments.
                              Thanks.
                              Last edited by Gonza; 10-29-2014, 05:39 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 08:47 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X