Seqanswers Leaderboard Ad

**carmeyeii** · 05-02-2013, 11:01 AM

I'll try to make myself more clear.

I need to split the huge BAM file into say 10 files, but the only constraint is that if one read's alignment is already in file a, all of its other alignments have to be in that file as well.

I don't discard multiple alignments, because I work with transposons and repeat sequences, so it's important to have all of a reads' alignments in the same sub-file, so I can appropiately count the number of annotated alignments and divide each one by the total number for that read.

I run the mapped .bam file through IntersectBed, which assigns a feature from a GFF file to each mapped read, i.e., to what gene (based on that gene's coordinates), that read corresponds, based on its coordinates given by TopHat.

**jparsons** · 05-02-2013, 11:30 AM

I'd probably do a presumably inefficient thing and convert back to SAM after sorting by read name, then just split every N/10 lines using head/tail, and finally do a manual check in the 10 cases to re-join any reads which i may have accidentally split. Then i could convert back to bam.

Like all of my solutions, it's lacking in elegance and might take some time, but it would work unless the .sam is also too large for your disk.

**westerman** · 05-03-2013, 08:04 AM

Going along with jparsons the initial step is to do a 'bam sort -n' in order to get the file sorted by read type. After that ... hum ... I don't know of a built-in program to split a file by the first column nor can I think of a set of programs. I suspect you are going to have to go with some custom program in Perl, Python, sed, awk all should work. Probably take 3 minutes to write and 10 minutes to debug.

**swbarnes2** · 05-03-2013, 11:28 AM

Split the .bam by tile using awk or whatever. That way, each read will only be in one .bam, with all its hits.

**westerman** · 05-03-2013, 11:46 AM

@swbarnes. An interesting idea (splitting by tile) and quick to implement however that method won't get 10 files. That may not be important -- and one could always combine files to get to the required number with just a little bit of extra work. I like it!

**carmeyeii** · 05-03-2013, 08:09 PM

Thanks, everyone for your suggestions.

I like the idea of splitting by tile, too! So clever.

So, if the alignments begin with something like this, I should find a way to use awk to split the [converted] .SAM file by the field in bold (tile ID).

Code:

HWI-ST975:104:C0W47ACXX:8:[B][I]1101[/I][/B]:8269:91631

However, to use awk, is there a way to get around the fact that I want to separate by the 5th field of the alignment using " : " as delimiter, when the rest of the alignment is delimited by tabs?

Or is it mandatory to first cut out the whole read ID field, and obtain a list of the tile numbers?

Thanks again,
Carmen

**carmeyeii** · 05-06-2013, 07:11 AM

I think i may have a perl solution to this, but I don't know the exact way to phrase the output. Can anybody help me out ?

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

`%Tiles` has n keys, where each key is a different `$Tile_Number.`

Each `$Tile_Number` opens a new hash that contains all lines whose `$Tile_Number` was the right number of the current key. The value of each of these new keys (lines) is just `1.`

`$Tiles{Tile_Number}($Line}=1` , where `$Tiles{Tile_Number}` has many $Line=1 entries.

I want to print each `$Tiles{$Tile_Number}` hash in a separate file, preferably, creating the file upon the creation of the `$Tile_Number` key, and printing as each new `$Tiles{$Tile_Number}{$Line}=1` is added, to save memory. The best would be to not print the final value (1), but I can do away with this, I guess..

How can I tell perl to open a new file for each key in the "master" hash and print all of its keys?

Thank you,
Carmen

**swbarnes2** · 05-06-2013, 08:26 AM

Originally posted by carmeyeii View Post

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

You are storing all the lines of the gigantic .bam in memory? That doesn't seem wise. I think you should print them out as you process them.

**carmeyeii** · 05-06-2013, 09:08 AM

Originally posted by swbarnes2 View Post

You are storing all the lines of the gigantic .bam in memory? That doesn't seem wise. I think you should print them out as you process them.

No, as I suggested above,

" I want to print each `$Tiles{$Tile_Number}` hash in a separate file, preferably, creating the file upon the creation of the `$Tile_Number` key, and printing as each new `$Tiles{$Tile_Number}{$Line}=1` is added, to save memory. "

Carmen

**westerman** · 05-06-2013, 11:02 AM

As swbarnes said, if you are indeed doing

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

Then you are indeed saving the entire input bam file in memory. If you want to post your perl code then we can see exactly what you are doing and then can suggest specific improvements. In the meantime, as swbarnes suggested, just reading the bam file one line at a time and then printing out the line to the file to the current file would be best. Depending on how your bam file is sorted you may need to keep a hash of file pointers but that should be a very small hash.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Splitting a BAM file by every x number of reads, not lines.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News