Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • vaibhavvsk
    Member
    • Sep 2011
    • 14

    How to extract mapped and unmapped raw reads from bwa's sam file ?

    I have aligned paired end with reference sequence using BWA and got sam output file.I tried picard tools samToFastq tool to convert sam into paired end fastq files but it's for just mapped reads.
    I want to know how to extract unmapped raw reads in separate raw fastq files from sam file.
    Vaibhav Kulkarni
  • marcowanger
    Senior Member
    • Dec 2008
    • 273

    #2
    first, identify the unmapped reads from the BAM file using the FLAG field (0x4). Record the name of the read Format file see: http://samtools.sourceforge.net/SAM1.pdf

    Then, read your fastq file, extract the reads matched the names previously recorded
    Marco

    Comment

    • ulz_peter
      Senior Member
      • Feb 2010
      • 219

      #3
      If you're familiar with scripting, that would be a relatively easy task:
      Unmapped reads have an asterisk (*) in the third column, so you filter out the other ones and then extract the read name (first column) and write it into the output file after a @ sign. Then take the sequence (column 10) write it into the next line; then write the name of the read again after a + sign (it might be enough to just write a + sign). And write the qualities (column 11)in the last row.

      Have fun programming!!

      Comment

      • vaibhavvsk
        Member
        • Sep 2011
        • 14

        #4
        Thanks marcowanger & ulz_peter for your guidelines .It can be easily done by shell,Perl or Java.
        Vaibhav Kulkarni

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          Originally posted by ulz_peter View Post
          If you're familiar with scripting, that would be a relatively easy task:
          Unmapped reads have an asterisk (*) in the third column, so you filter out the other ones and then extract the read name (first column) and write it into the output file after a @ sign. Then take the sequence (column 10) write it into the next line; then write the name of the read again after a + sign (it might be enough to just write a + sign). And write the qualities (column 11)in the last row.

          Have fun programming!!
          That may not do exactly what is required. Some unmapped reads don't have an asterisk in the third column (reference name), specifically for paired ends where only one read maps, the other read it given dummy positions to match (for sorting purposes), but the FLAG says it is unmapped.

          i.e. You can only trust the FLAG. One easy way to do this is with the -f and -F arguments to samtools view.

          Comment

          • Alex Renwick
            Member
            • Jul 2011
            • 44

            #6
            Originally posted by maubp View Post
            ...

            i.e. You can only trust the FLAG. One easy way to do this is with the -f and -F arguments to samtools view.
            Yep, that's the way to do it.

            Position 4 indicates an unmapped read. Hence,
            Code:
            samtools view -F4 sample.bam > sample.mapped.sam
            samtools view -f4 sample.bam > sample.unmapped.sam

            Comment

            • lukas1848
              Member
              • Jun 2011
              • 54

              #7
              you could also use a simple bash one liner to do that:

              grep ^...............4 aln.sam -v > mapped.aln.sam

              its not pretty, but it works.

              Comment

              • Alex Renwick
                Member
                • Jul 2011
                • 44

                #8
                Originally posted by lukas1848 View Post
                you could also use a simple bash one liner to do that:

                grep ^...............4 aln.sam -v > mapped.aln.sam

                its not pretty, but it works.
                Actually, that won't really work. The flag is a bit field, so the values are additive. If the flag value is 4, then the "unmapped" flag is set and the other ten flags are unset. If a read is unmapped and paired, for example, the flag value would be 4 + 1 = 5.

                Also, there's the less subtle point that your grep expression requires the flag to be exactly 16 characters from the beginning of the line. That won't always be the case.

                Comment

                • Bgansw
                  Member
                  • Nov 2011
                  • 24

                  #9
                  extract unmapped reads

                  Originally posted by Alex Renwick View Post
                  Yep, that's the way to do it.

                  Position 4 indicates an unmapped read. Hence,
                  Code:
                  samtools view -F4 sample.bam > sample.mapped.sam
                  samtools view -f4 sample.bam > sample.unmapped.sam
                  Hello Alex,

                  I wanted to confirm if I'm interpreting my results correctly.I'm trying to align my sample illumina bacterial genome to an already published reference bacterial genome. The aim is to find out what genes are present in my sample genome, but missing in the reference. The plan of action was to run bwasw, and then pull out the sequence in the sample genome that do not align to the reference, then look at what these sequences are.

                  I aligned my query genome to the ref using bwasw.

                  The output was in sam format. I converted that to bam format using

                  samtools view -bS aln.sam > aln.bam

                  Then I ran the command lines quoted above.

                  When I headed my mapped.sam file, it appears to have worked, however, my unmapped.sam file is empty,the size is zero.
                  This is consistent for two of my sample genomes. It is highly likely that the two sample genomes have no extra genes, when compared to the reference, thus there are no unmapped reads, compared to the reference.

                  However, is there any other way to check this?

                  I'd be very grateful for any help/advice on this.

                  Thanx

                  bgansw

                  Comment

                  • vaibhavvsk
                    Member
                    • Sep 2011
                    • 14

                    #10
                    Hello Alex,

                    I have also same kind of experience as with Bgansw

                    samtools view -F4 sample.bam > sample.mapped.sam

                    samtools view -f4 sample.bam > sample.unmapped.sam

                    unmapped file is empty.
                    Vaibhav Kulkarni

                    Comment

                    • swbarnes2
                      Senior Member
                      • May 2008
                      • 910

                      #11
                      Poking around, I found this thread:

                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                      Where someone claims that bwasw doesn't return unmapped reads. You can probably confirm this, by counting reads in your fastq, and reads in the .bam

                      So, you can process the .sam file and the fastq file with a script, to get the read that are in the fastq, but not in the .sam. Or use an aligner that will output the unaligned reads. Bowtie and bwa aln of course for Illumina reads, but you must have longer reads since you aren't using bwa aln, so I'm not sure what's appropriate for those.

                      Comment

                      • cfrias
                        Junior Member
                        • Jan 2011
                        • 5

                        #12
                        How to extract multi-mapped reads from bwa's sam file?

                        Hi, I want to extract multi-mapped reads from sam file generated from bwa (bwsw).
                        I am working with 454 data and I have used bwa 0.5.9.

                        The problem is that the flag to indicate that the read is a multiple hit not appear
                        (256 or NH:i: ),
                        So, I can not extract the reads that has multiple hits

                        so I have two question?
                        ## I'ts fine is all flags I can see in the sam file are 0, 4, 16?
                        ## How can I extract the reads that have multiple Hits?

                        e.g: I have two reads mapping in the same contig

                        @SQ SN:Contig1276 LN:500
                        HKU61D301B85SS_nem_043 0 Contig1276 172 27 8S90M1I8M2I18M428S * 0 0 ATCTGTGGAAAAATCATCACCACTGGGGAGTGGAGGAAAGGTAAGATTGCTGAATGGAAGACGAGGTCTCTGCCATTGTCCGATGGCCACATCACCTGCGAAGAATATCAGTTTGAATGTAAAATAGTTTATTGTAAACCTCTGGTTCTCGGATTTTCTTGGTTTCCCTACCATATTAAATGATGGGAAACCAAGAAAATACCAAGAAACGAAGGTTTACAACAAACTGCTTGCAAAAGACGATTTCAGGGTTTTCTGTTTTGCCTTGCAATCACCCCTGTAAGTTCTCATTTATATTTGAATGCTTCACATATTCTTTGTGCATTTTTGTTATATTTTGTTAAATGAGCATGTTTTTTTCCCCCATGCAGAATATGACCAGCAATAACGTTATATGGAATGAGAGTAAGGAGAAAGCAGAAGGAAATCAGTTATCGTCATGGAATGTGGACATTTTTGTGCAAGTTGTGCGAGAAAATGGTGAGATTTTTCACATTATTATTCAGCTTATTTATTAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGGT IIIIIIDD62222DFIIEA??E@A10//42511200115;5400005:;;;::78444457=;<@;888<<<=ACCECDBDEFDDFF@>55<<@330001426?@????79799B@77333990/3----3.44335577:?;;??@<<7999@;;;?@@@>>>:11114/77@@@@A@666??AAB>A3333389>8886/------,1181359?BDDDFFEFIIFFEDDDDFB;;;;;EEEEFB@@@@@?<33325535///58;<;<551225558;;:9;788/- AS:i:76 XS:i:0 XF:i:3 XE:i:1 XN:i:0
                        HKU61D301B85SS_nem_043 0 Contig1276 400 19 204S30M1D46M275S * 0 0 ATCTGTGGAAAAATCATCACCACTGGGGAGTGGAGGAAAGGTAAGATTGCTGAATGGAAGACGAGGTCTCTGCCATTGTCCGATGGCCACATCACCTGCGAAGAATATCAGTTTGAATGTAAAATAGTTTATTGTAAACCTCTGGTTCTCGGATTTTCTTGGTTTCCCTACCATATTAAATGATGGGAAACCAAGAAAATACCAAGAAACGAAGGTTTACAACAAACTGCTTGCAAAAGACGATTTCAGGGTTTTCTGTTTTGCCTTGCAATCACCCCTGTAAGTTCTCATTTATATTTGAATGCTTCACATATTCTTTGTGCATTTTTGTTATATTTTGTTAAATGAGCATGTTTTTTTCCCCCATGCAGAATATGACCAGCAATAACGTTATATGGAATGAGAGTAAGGAGAAAGCAGAAGGAAATCAGTTATCGTCATGGAATGTGGACATTTTTGTGCAAGTTGTGCGAGAAAATGGTGAGATTTTTCACATTATTATTCAGCTTATTTATTAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGGT IIIIIIDD62222DFIIEA??E@A10//42511200115;5400005:;;;::78444457=;<@;888<<<=ACCECDBDEFDDFF@>55<<@330001426?@????79799B@77333990/3----3.44335577:?;;??@<<7999@;;;?@@@>>>:11114/77@@@@A@666??AAB>A3333389>8886/------,1181359?BDDDFFEFIIFFEDDDDFB;;;;;EEEEFB@@@@@?<33325535///58;<;<551225558;;:9;788/- AS:i:57 XS:i:0 XF:i:3 XE:i:1 XN:i:0

                        I think that I can filt it using a Perl script that checking the MAPQ(5 column)?
                        (In that case the higher is 27)
                        It is correct?

                        Thanks in advance,

                        Cris

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Today, 08:59 AM
                        0 responses
                        9 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        21 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        17 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        30 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...