Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting a BAM file by every x number of reads, not lines.

    Hi everyone!

    I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough.

    What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files.

    However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to.

    Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I can't simply split the BAM file into 15 files, because without luck, I could end up with a 2 BAM files that have the alignments of a single read split between them, leading to the division not being correct.

    How can I instruct UNIX to count a certain number of unique reads on the BAM file, output all the alignments to a new file, and continue with the rest of the BAM file, such that all reads have their n alignments contained in one single file (but shared with other reads)?

    Thank you!

  • #2
    I'll try to make myself more clear.

    I need to split the huge BAM file into say 10 files, but the only constraint is that if one read's alignment is already in file a, all of its other alignments have to be in that file as well.

    I don't discard multiple alignments, because I work with transposons and repeat sequences, so it's important to have all of a reads' alignments in the same sub-file, so I can appropiately count the number of annotated alignments and divide each one by the total number for that read.

    I run the mapped .bam file through IntersectBed, which assigns a feature from a GFF file to each mapped read, i.e., to what gene (based on that gene's coordinates), that read corresponds, based on its coordinates given by TopHat.

    Comment


    • #3
      I'd probably do a presumably inefficient thing and convert back to SAM after sorting by read name, then just split every N/10 lines using head/tail, and finally do a manual check in the 10 cases to re-join any reads which i may have accidentally split. Then i could convert back to bam.

      Like all of my solutions, it's lacking in elegance and might take some time, but it would work unless the .sam is also too large for your disk.

      Comment


      • #4
        Going along with jparsons the initial step is to do a 'bam sort -n' in order to get the file sorted by read type. After that ... hum ... I don't know of a built-in program to split a file by the first column nor can I think of a set of programs. I suspect you are going to have to go with some custom program in Perl, Python, sed, awk all should work. Probably take 3 minutes to write and 10 minutes to debug.

        Comment


        • #5
          Split the .bam by tile using awk or whatever. That way, each read will only be in one .bam, with all its hits.

          Comment


          • #6
            @swbarnes. An interesting idea (splitting by tile) and quick to implement however that method won't get 10 files. That may not be important -- and one could always combine files to get to the required number with just a little bit of extra work. I like it!

            Comment


            • #7
              Thanks, everyone for your suggestions.

              I like the idea of splitting by tile, too! So clever.

              So, if the alignments begin with something like this, I should find a way to use awk to split the [converted] .SAM file by the field in bold (tile ID).

              Code:
              HWI-ST975:104:C0W47ACXX:8:[B][I]1101[/I][/B]:8269:91631
              However, to use awk, is there a way to get around the fact that I want to separate by the 5th field of the alignment using " : " as delimiter, when the rest of the alignment is delimited by tabs?

              Or is it mandatory to first cut out the whole read ID field, and obtain a list of the tile numbers?

              Thanks again,
              Carmen

              Comment


              • #8
                I think i may have a perl solution to this, but I don't know the exact way to phrase the output. Can anybody help me out ?

                I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

                `%Tiles` has n keys, where each key is a different `$Tile_Number.`

                Each `$Tile_Number` opens a new hash that contains all lines whose `$Tile_Number` was the right number of the current key. The value of each of these new keys (lines) is just `1.`

                `$Tiles{Tile_Number}($Line}=1` , where `$Tiles{Tile_Number}` has many $Line=1 entries.

                I want to print each `$Tiles{$Tile_Number}` hash in a separate file, preferably, creating the file upon the creation of the `$Tile_Number` key, and printing as each new `$Tiles{$Tile_Number}{$Line}=1` is added, to save memory. The best would be to not print the final value (1), but I can do away with this, I guess..

                How can I tell perl to open a new file for each key in the "master" hash and print all of its keys?

                Thank you,
                Carmen

                Comment


                • #9
                  Originally posted by carmeyeii View Post

                  I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.
                  You are storing all the lines of the gigantic .bam in memory? That doesn't seem wise. I think you should print them out as you process them.

                  Comment


                  • #10
                    Originally posted by swbarnes2 View Post
                    You are storing all the lines of the gigantic .bam in memory? That doesn't seem wise. I think you should print them out as you process them.
                    No, as I suggested above,

                    " I want to print each `$Tiles{$Tile_Number}` hash in a separate file, preferably, creating the file upon the creation of the `$Tile_Number` key, and printing as each new `$Tiles{$Tile_Number}{$Line}=1` is added, to save memory. "

                    Carmen

                    Comment


                    • #11
                      As swbarnes said, if you are indeed doing
                      I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.
                      Then you are indeed saving the entire input bam file in memory. If you want to post your perl code then we can see exactly what you are doing and then can suggest specific improvements. In the meantime, as swbarnes suggested, just reading the bam file one line at a time and then printing out the line to the file to the current file would be best. Depending on how your bam file is sorted you may need to keep a hash of file pointers but that should be a very small hash.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      30 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      32 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X