SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Samtools "is recognized as '*'" "truncated file" error axiom7 Bioinformatics 3 11-26-2014 02:53 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM
Haplotype and "random" chromosomes popto Bioinformatics 7 02-24-2011 11:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 12:19 PM
Epi Review: "Epigenetics meets next-generation sequencing" (OPEN ACCESS!) ECO Epigenetics 0 12-23-2008 06:35 PM

Reply
 
Thread Tools
Old 12-20-2011, 10:53 AM   #1
cycomatto
Junior Member
 
Location: RI

Join Date: Dec 2011
Posts: 7
Default samtools: "sort by name" and fast random access

I understand that sorting a BAM file, e.g.
Code:
samtools sort my_file.bam my_file.sorted
will allow for faster random access. This makes sense, as viewing reads aligned to a certain region of the genome is much simpler if the reads are ordered by genome coordinates.

But if you sort a .bam file by read name, e.g.
Code:
samtools sort -no my_file.bam my_file.sorted_by_name
then how does this allow faster random access?

I ask because there is no samtools functionality for observing reads by name - the "view" family of commands only accepts genome regions (coordinates).

How would you random access a .bam file by read name?

Last edited by cycomatto; 12-21-2011 at 06:44 AM.
cycomatto is offline   Reply With Quote
Old 12-20-2011, 02:06 PM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

SAM/BAM files can be sorted by mapping position (normal), which allows efficient access to regions (e.g. chr1 from 1000 to 2000) via the BAI index, or sorted by read name (not commonly used).

If you want random access to reads by name, the BAM index file (BAI) is no help at all. You would need a separate index mapping read names to offsets. Sorting by read name would be helpful if you have multi-fragment reads (e.g. paired end reads), since then all the fragments would be together on disk so you'd only need one seek. If the reads were not sorted, you'd need to have a lookup table from read name to multiple offsets (one for each read fragment).

For BAM files you could use the same "virtual offset" approach as BAI uses (essentially a double offset, for a BGZF/GZIP block start, and the within block offset). You might find this blog post interesting about how indexing BGZF files (such as BAM files) can be done:
http://blastedbio.blogspot.com/2011/...tter-gzip.html
and http://seqanswers.com/forums/showthread.php?t=15347
maubp is offline   Reply With Quote
Old 12-21-2011, 06:47 AM   #3
cycomatto
Junior Member
 
Location: RI

Join Date: Dec 2011
Posts: 7
Default

Quote:
Originally Posted by maubp View Post
Sorting by read name would be helpful if you have multi-fragment reads (e.g. paired end reads), since then all the fragments would be together on disk so you'd only need one seek.
This is precisely my situation (given one side of a paired-end read, find its mate).


I am unclear why the following won't work: FIRST sorting by name, THEN index this name-sorted .bam.

Doesn't the "index" operation perform this offset for you? So if our .bam is already sorted by name, the the "index" operation will create the virtual offsets you wrote about?
cycomatto is offline   Reply With Quote
Old 12-21-2011, 07:09 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by cycomatto View Post
Doesn't the "index" operation perform this offset for you? So if our .bam is already sorted by name, the the "index" operation will create the virtual offsets you wrote about?
The indexing provided by samtools/picard using the *.bai file ONLY works on BAM files sorted by mapping position.

I was talking about how as a programmer you can use your own alternative indexing, where you map read names to BAM/BGZF 'virtual offsets'.
maubp is offline   Reply With Quote
Old 12-22-2011, 04:23 AM   #5
cycomatto
Junior Member
 
Location: RI

Join Date: Dec 2011
Posts: 7
Default

So then, what is the point of sorting a .bam file by name? Why even include the option in samtools?
cycomatto is offline   Reply With Quote
Old 12-22-2011, 05:37 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by cycomatto View Post
So then, what is the point of sorting a .bam file by name? Why even include the option in samtools?
If you want to process the entire file sequentially and you need to operate on read pairs as a single unit having reads sorted by name (and hence read pairs will be adjacent in the file) greatly reduces complexity. An example of a program which works this way is htseq-count.
kmcarr is offline   Reply With Quote
Old 01-21-2013, 05:17 AM   #7
ajay87
Junior Member
 
Location: India

Join Date: Dec 2012
Posts: 4
Default

htseq-count also requires the file to be sorted
ajay87 is offline   Reply With Quote
Old 01-22-2013, 03:42 AM   #8
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

Quote:
Originally Posted by cycomatto View Post
So then, what is the point of sorting a .bam file by name? Why even include the option in samtools?
samtools fixmate
xied75 is offline   Reply With Quote
Reply

Tags
random access, read name, samtools, sort

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO