SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to extract multi-mapped reads by samtools? mavishou RNA Sequencing 5 12-05-2016 05:27 AM
how to extract raw unaligned reads? joseph Bioinformatics 2 12-20-2011 05:24 PM
BWA output bitwise flag for mapped/unmapped reads wenhuang Bioinformatics 1 08-29-2011 03:54 PM
Extract perfectly mapped reads from SAM/BAM file Graham Etherington Bioinformatics 2 07-21-2011 07:27 AM
Maq problem: 0 raw reads mapped pparg Bioinformatics 8 11-17-2009 12:16 AM

Reply
 
Thread Tools
Old 01-10-2012, 08:41 PM   #1
vaibhavvsk
Member
 
Location: Pune

Join Date: Sep 2011
Posts: 14
Default How to extract mapped and unmapped raw reads from bwa's sam file ?

I have aligned paired end with reference sequence using BWA and got sam output file.I tried picard tools samToFastq tool to convert sam into paired end fastq files but it's for just mapped reads.
I want to know how to extract unmapped raw reads in separate raw fastq files from sam file.
__________________
Vaibhav Kulkarni
vaibhavvsk is offline   Reply With Quote
Old 01-10-2012, 09:20 PM   #2
marcowanger
Senior Member
 
Location: Hong Kong

Join Date: Dec 2008
Posts: 350
Default

first, identify the unmapped reads from the BAM file using the FLAG field (0x4). Record the name of the read Format file see: http://samtools.sourceforge.net/SAM1.pdf

Then, read your fastq file, extract the reads matched the names previously recorded
__________________
Marco
marcowanger is offline   Reply With Quote
Old 01-10-2012, 09:24 PM   #3
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

If you're familiar with scripting, that would be a relatively easy task:
Unmapped reads have an asterisk (*) in the third column, so you filter out the other ones and then extract the read name (first column) and write it into the output file after a @ sign. Then take the sequence (column 10) write it into the next line; then write the name of the read again after a + sign (it might be enough to just write a + sign). And write the qualities (column 11)in the last row.

Have fun programming!!
ulz_peter is offline   Reply With Quote
Old 01-10-2012, 10:05 PM   #4
vaibhavvsk
Member
 
Location: Pune

Join Date: Sep 2011
Posts: 14
Default

Thanks marcowanger & ulz_peter for your guidelines .It can be easily done by shell,Perl or Java.
vaibhavvsk is offline   Reply With Quote
Old 01-11-2012, 04:56 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by ulz_peter View Post
If you're familiar with scripting, that would be a relatively easy task:
Unmapped reads have an asterisk (*) in the third column, so you filter out the other ones and then extract the read name (first column) and write it into the output file after a @ sign. Then take the sequence (column 10) write it into the next line; then write the name of the read again after a + sign (it might be enough to just write a + sign). And write the qualities (column 11)in the last row.

Have fun programming!!
That may not do exactly what is required. Some unmapped reads don't have an asterisk in the third column (reference name), specifically for paired ends where only one read maps, the other read it given dummy positions to match (for sorting purposes), but the FLAG says it is unmapped.

i.e. You can only trust the FLAG. One easy way to do this is with the -f and -F arguments to samtools view.
maubp is offline   Reply With Quote
Old 01-11-2012, 10:22 AM   #6
Alex Renwick
Member
 
Location: Houston, Texas

Join Date: Jul 2011
Posts: 44
Default

Quote:
Originally Posted by maubp View Post
...

i.e. You can only trust the FLAG. One easy way to do this is with the -f and -F arguments to samtools view.
Yep, that's the way to do it.

Position 4 indicates an unmapped read. Hence,
Code:
samtools view -F4 sample.bam > sample.mapped.sam
samtools view -f4 sample.bam > sample.unmapped.sam
Alex Renwick is offline   Reply With Quote
Old 01-12-2012, 01:12 AM   #7
lukas1848
Member
 
Location: Germany

Join Date: Jun 2011
Posts: 54
Default

you could also use a simple bash one liner to do that:

grep ^...............4 aln.sam -v > mapped.aln.sam

its not pretty, but it works.
lukas1848 is offline   Reply With Quote
Old 01-12-2012, 10:00 AM   #8
Alex Renwick
Member
 
Location: Houston, Texas

Join Date: Jul 2011
Posts: 44
Default

Quote:
Originally Posted by lukas1848 View Post
you could also use a simple bash one liner to do that:

grep ^...............4 aln.sam -v > mapped.aln.sam

its not pretty, but it works.
Actually, that won't really work. The flag is a bit field, so the values are additive. If the flag value is 4, then the "unmapped" flag is set and the other ten flags are unset. If a read is unmapped and paired, for example, the flag value would be 4 + 1 = 5.

Also, there's the less subtle point that your grep expression requires the flag to be exactly 16 characters from the beginning of the line. That won't always be the case.
Alex Renwick is offline   Reply With Quote
Old 02-20-2012, 03:57 PM   #9
Bgansw
Member
 
Location: Sydney, Australia

Join Date: Nov 2011
Posts: 24
Default extract unmapped reads

Quote:
Originally Posted by Alex Renwick View Post
Yep, that's the way to do it.

Position 4 indicates an unmapped read. Hence,
Code:
samtools view -F4 sample.bam > sample.mapped.sam
samtools view -f4 sample.bam > sample.unmapped.sam
Hello Alex,

I wanted to confirm if I'm interpreting my results correctly.I'm trying to align my sample illumina bacterial genome to an already published reference bacterial genome. The aim is to find out what genes are present in my sample genome, but missing in the reference. The plan of action was to run bwasw, and then pull out the sequence in the sample genome that do not align to the reference, then look at what these sequences are.

I aligned my query genome to the ref using bwasw.

The output was in sam format. I converted that to bam format using

samtools view -bS aln.sam > aln.bam

Then I ran the command lines quoted above.

When I headed my mapped.sam file, it appears to have worked, however, my unmapped.sam file is empty,the size is zero.
This is consistent for two of my sample genomes. It is highly likely that the two sample genomes have no extra genes, when compared to the reference, thus there are no unmapped reads, compared to the reference.

However, is there any other way to check this?

I'd be very grateful for any help/advice on this.

Thanx

bgansw
Bgansw is offline   Reply With Quote
Old 02-20-2012, 08:46 PM   #10
vaibhavvsk
Member
 
Location: Pune

Join Date: Sep 2011
Posts: 14
Default

Hello Alex,

I have also same kind of experience as with Bgansw

samtools view -F4 sample.bam > sample.mapped.sam

samtools view -f4 sample.bam > sample.unmapped.sam

unmapped file is empty.
__________________
Vaibhav Kulkarni
vaibhavvsk is offline   Reply With Quote
Old 02-21-2012, 09:51 AM   #11
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Poking around, I found this thread:

http://seqanswers.com/forums/showthread.php?t=5558

Where someone claims that bwasw doesn't return unmapped reads. You can probably confirm this, by counting reads in your fastq, and reads in the .bam

So, you can process the .sam file and the fastq file with a script, to get the read that are in the fastq, but not in the .sam. Or use an aligner that will output the unaligned reads. Bowtie and bwa aln of course for Illumina reads, but you must have longer reads since you aren't using bwa aln, so I'm not sure what's appropriate for those.
swbarnes2 is offline   Reply With Quote
Old 02-07-2013, 09:01 AM   #12
cfrias
Junior Member
 
Location: barcelona

Join Date: Jan 2011
Posts: 5
Unhappy How to extract multi-mapped reads from bwa's sam file?

Hi, I want to extract multi-mapped reads from sam file generated from bwa (bwsw).
I am working with 454 data and I have used bwa 0.5.9.

The problem is that the flag to indicate that the read is a multiple hit not appear
(256 or NH:i: ),
So, I can not extract the reads that has multiple hits

so I have two question?
## I'ts fine is all flags I can see in the sam file are 0, 4, 16?
## How can I extract the reads that have multiple Hits?

e.g: I have two reads mapping in the same contig

@SQ SN:Contig1276 LN:500
HKU61D301B85SS_nem_043 0 Contig1276 172 27 8S90M1I8M2I18M428S * 0 0 ATCTGTGGAAAAATCATCACCACTGGGGAGTGGAGGAAAGGTAAGATTGCTGAATGGAAGACGAGGTCTCTGCCATTGTCCGATGGCCACATCACCTGCGAAGAATATCAGTTTGAATGTAAAATAGTTTATTGTAAACCTCTGGTTCTCGGATTTTCTTGGTTTCCCTACCATATTAAATGATGGGAAACCAAGAAAATACCAAGAAACGAAGGTTTACAACAAACTGCTTGCAAAAGACGATTTCAGGGTTTTCTGTTTTGCCTTGCAATCACCCCTGTAAGTTCTCATTTATATTTGAATGCTTCACATATTCTTTGTGCATTTTTGTTATATTTTGTTAAATGAGCATGTTTTTTTCCCCCATGCAGAATATGACCAGCAATAACGTTATATGGAATGAGAGTAAGGAGAAAGCAGAAGGAAATCAGTTATCGTCATGGAATGTGGACATTTTTGTGCAAGTTGTGCGAGAAAATGGTGAGATTTTTCACATTATTATTCAGCTTATTTATTAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGGT IIIIIIDD62222DFIIEA??E@A10//42511200115;5400005:;;;::78444457=;<@;888<<<=ACCECDBDEFDDFF@>55<<@330001426?@????79799B@77333990/3----3.44335577:?;;??@<<7999@;;;?@@@>>>:11114/77@@@@A@666??AAB>A3333389>8886/------,1181359?BDDDFFEFIIFFEDDDDFB;;;;;EEEEFB@@@@@?<33325535///58;<;<551225558;;:9;788/- AS:i:76 XS:i:0 XF:i:3 XE:i:1 XN:i:0
HKU61D301B85SS_nem_043 0 Contig1276 400 19 204S30M1D46M275S * 0 0 ATCTGTGGAAAAATCATCACCACTGGGGAGTGGAGGAAAGGTAAGATTGCTGAATGGAAGACGAGGTCTCTGCCATTGTCCGATGGCCACATCACCTGCGAAGAATATCAGTTTGAATGTAAAATAGTTTATTGTAAACCTCTGGTTCTCGGATTTTCTTGGTTTCCCTACCATATTAAATGATGGGAAACCAAGAAAATACCAAGAAACGAAGGTTTACAACAAACTGCTTGCAAAAGACGATTTCAGGGTTTTCTGTTTTGCCTTGCAATCACCCCTGTAAGTTCTCATTTATATTTGAATGCTTCACATATTCTTTGTGCATTTTTGTTATATTTTGTTAAATGAGCATGTTTTTTTCCCCCATGCAGAATATGACCAGCAATAACGTTATATGGAATGAGAGTAAGGAGAAAGCAGAAGGAAATCAGTTATCGTCATGGAATGTGGACATTTTTGTGCAAGTTGTGCGAGAAAATGGTGAGATTTTTCACATTATTATTCAGCTTATTTATTAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGGT IIIIIIDD62222DFIIEA??E@A10//42511200115;5400005:;;;::78444457=;<@;888<<<=ACCECDBDEFDDFF@>55<<@330001426?@????79799B@77333990/3----3.44335577:?;;??@<<7999@;;;?@@@>>>:11114/77@@@@A@666??AAB>A3333389>8886/------,1181359?BDDDFFEFIIFFEDDDDFB;;;;;EEEEFB@@@@@?<33325535///58;<;<551225558;;:9;788/- AS:i:57 XS:i:0 XF:i:3 XE:i:1 XN:i:0

I think that I can filt it using a Perl script that checking the MAPQ(5 column)?
(In that case the higher is 27)
It is correct?

Thanks in advance,

Cris
cfrias is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO