SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Splitting a BAM based on # of reads? kga1978 Bioinformatics 6 05-02-2013 10:55 AM
How to use Samtools to output a list of SNPs (RS number) from a BAM file Michael Zhou Bioinformatics 3 11-20-2012 12:21 PM
how to get number of records of bam file using picard jay2008 Bioinformatics 0 05-23-2011 04:11 PM
Splitting 454 paired reads in a FASTQ file sjackman Bioinformatics 5 09-10-2010 12:09 PM
Does splitting run reduce read number? AKroxy 454 Pyrosequencing 2 01-04-2010 12:04 PM

Reply
 
Thread Tools
Old 05-02-2013, 10:54 AM   #1
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default Splitting a BAM file by every x number of reads, not lines.

Hi everyone!

I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough.

What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files.

However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to.

Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I can't simply split the BAM file into 15 files, because without luck, I could end up with a 2 BAM files that have the alignments of a single read split between them, leading to the division not being correct.

How can I instruct UNIX to count a certain number of unique reads on the BAM file, output all the alignments to a new file, and continue with the rest of the BAM file, such that all reads have their n alignments contained in one single file (but shared with other reads)?

Thank you!
carmeyeii is offline   Reply With Quote
Old 05-02-2013, 12:01 PM   #2
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

I'll try to make myself more clear.

I need to split the huge BAM file into say 10 files, but the only constraint is that if one read's alignment is already in file a, all of its other alignments have to be in that file as well.

I don't discard multiple alignments, because I work with transposons and repeat sequences, so it's important to have all of a reads' alignments in the same sub-file, so I can appropiately count the number of annotated alignments and divide each one by the total number for that read.

I run the mapped .bam file through IntersectBed, which assigns a feature from a GFF file to each mapped read, i.e., to what gene (based on that gene's coordinates), that read corresponds, based on its coordinates given by TopHat.
carmeyeii is offline   Reply With Quote
Old 05-02-2013, 12:30 PM   #3
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

I'd probably do a presumably inefficient thing and convert back to SAM after sorting by read name, then just split every N/10 lines using head/tail, and finally do a manual check in the 10 cases to re-join any reads which i may have accidentally split. Then i could convert back to bam.

Like all of my solutions, it's lacking in elegance and might take some time, but it would work unless the .sam is also too large for your disk.
jparsons is offline   Reply With Quote
Old 05-03-2013, 09:04 AM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Going along with jparsons the initial step is to do a 'bam sort -n' in order to get the file sorted by read type. After that ... hum ... I don't know of a built-in program to split a file by the first column nor can I think of a set of programs. I suspect you are going to have to go with some custom program in Perl, Python, sed, awk all should work. Probably take 3 minutes to write and 10 minutes to debug.
westerman is offline   Reply With Quote
Old 05-03-2013, 12:28 PM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Split the .bam by tile using awk or whatever. That way, each read will only be in one .bam, with all its hits.
swbarnes2 is offline   Reply With Quote
Old 05-03-2013, 12:46 PM   #6
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

@swbarnes. An interesting idea (splitting by tile) and quick to implement however that method won't get 10 files. That may not be important -- and one could always combine files to get to the required number with just a little bit of extra work. I like it!
westerman is offline   Reply With Quote
Old 05-03-2013, 09:09 PM   #7
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Thanks, everyone for your suggestions.

I like the idea of splitting by tile, too! So clever.

So, if the alignments begin with something like this, I should find a way to use awk to split the [converted] .SAM file by the field in bold (tile ID).

Code:
HWI-ST975:104:C0W47ACXX:8:1101:8269:91631
However, to use awk, is there a way to get around the fact that I want to separate by the 5th field of the alignment using " : " as delimiter, when the rest of the alignment is delimited by tabs?

Or is it mandatory to first cut out the whole read ID field, and obtain a list of the tile numbers?

Thanks again,
Carmen
carmeyeii is offline   Reply With Quote
Old 05-06-2013, 08:11 AM   #8
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

I think i may have a perl solution to this, but I don't know the exact way to phrase the output. Can anybody help me out ?

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

`%Tiles` has n keys, where each key is a different `$Tile_Number.`

Each `$Tile_Number` opens a new hash that contains all lines whose `$Tile_Number` was the right number of the current key. The value of each of these new keys (lines) is just `1.`

`$Tiles{Tile_Number}($Line}=1` , where `$Tiles{Tile_Number}` has many $Line=1 entries.

I want to print each `$Tiles{$Tile_Number}` hash in a separate file, preferably, creating the file upon the creation of the `$Tile_Number` key, and printing as each new `$Tiles{$Tile_Number}{$Line}=1` is added, to save memory. The best would be to not print the final value (1), but I can do away with this, I guess..

How can I tell perl to open a new file for each key in the "master" hash and print all of its keys?

Thank you,
Carmen
carmeyeii is offline   Reply With Quote
Old 05-06-2013, 09:26 AM   #9
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by carmeyeii View Post

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.
You are storing all the lines of the gigantic .bam in memory? That doesn't seem wise. I think you should print them out as you process them.
swbarnes2 is offline   Reply With Quote
Old 05-06-2013, 10:08 AM   #10
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Quote:
Originally Posted by swbarnes2 View Post
You are storing all the lines of the gigantic .bam in memory? That doesn't seem wise. I think you should print them out as you process them.
No, as I suggested above,

" I want to print each `$Tiles{$Tile_Number}` hash in a separate file, preferably, creating the file upon the creation of the `$Tile_Number` key, and printing as each new `$Tiles{$Tile_Number}{$Line}=1` is added, to save memory. "

Carmen
carmeyeii is offline   Reply With Quote
Old 05-06-2013, 12:02 PM   #11
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

As swbarnes said, if you are indeed doing
Quote:
I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.
Then you are indeed saving the entire input bam file in memory. If you want to post your perl code then we can see exactly what you are doing and then can suggest specific improvements. In the meantime, as swbarnes suggested, just reading the bam file one line at a time and then printing out the line to the file to the current file would be best. Depending on how your bam file is sorted you may need to keep a hash of file pointers but that should be a very small hash.
westerman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO