Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split SAM/BAM files into thousands of contigs

    Hi,

    I am new to bioinformatics and programming, so this might be too easy to tackle for some of you. I am working on a newly sequenced genome with thousands of contigs. I have Illumina sequence data that I have mapped using bowtie and BWA. When I try to import that file into my genome viewer, Geneious, it crashes. So, I want to split my BAM files into many, so the program can handle them easily. My question is if there is an easy way to split bam files by each of the contigs. So, in effect, it should split one BAM file into thousands of mini-files, each corresponding to a particular contig. I am familiar with samtools view (e.g. samtools view in.bam chr1 > chr1.bam) and have used it to make it for individual cases. Is there a better way to automate that?
    Any help much appreciated.

  • #2
    No; I'd just write my own script that will read the line of the .bam, and then write it to the proper file for its contig.

    Comment


    • #3
      A unix oneliner should work right?

      Code:
      for i in {1..22};do samtools view -bh input.bam chr$i > chr$i.bam;done
      Last edited by vivek_; 06-05-2013, 01:16 PM.

      Comment


      • #4
        Originally posted by vivek_ View Post
        A unix oneliner should work right?

        Code:
        for i in {1..22};do samtools view -bh input.bam chr$i > chr$i.bam;done
        Depends, I've seen examples where the contigs weren't sequentially numbered (presumably due to some contigs becoming merged as latter data came in)

        Also, for a file with a large number of contigs (have a look at some of the mouse lines from the Sanger Institute), looping over the whole file many many times will get super slow. You could probably write a script to process the whole thing in one go in a fraction of the time.

        Comment


        • #5
          That's why you have the BAM index right, so you are not reading the entire file to export each coordinate?

          for the sequentiality issue, you can extract the contig names from the bam header into a file and loop over them:

          Code:
          samtools view -H input.bam | awk '{print $2}' | awk '{gsub(/SN\:/,""); print}'  > contigs.txt

          Comment


          • #6
            Originally posted by vivek_ View Post
            That's why you have the BAM index right, so you are not reading the entire file to export each coordinate?

            for the sequentiality issue, you can extract the contig names from the bam header into a file and loop over them:

            Code:
            samtools view -H input.bam | awk '{print $2}' | awk '{gsub(/SN\:/,""); print}'  > contigs.txt
            Watch out, you don't want the first ("@HD ...") nor the last ("@PG ...") line of the header.

            Try this instead:
            Code:
            samtools view -H all.bam | sed '1d;s/.*SN:\(.*\)\t.*/\1/;$d' > contigs.list
            Or, if you prefer awk:
            Code:
            samtools view -H all.bam | awk '/^@SQ/{gsub(/SN\:/,"");print $2}' > contigs.list
            or even (just for fun):
            Code:
            samtools idxstats all.bam | cut -f1 > contigs.list
            All those should give you the same list of contigs.

            Then,
            Code:
            for c in `cat contigs.list` ; do
            echo processing $c
            samtools view -bh all.bam $c > $c.bam
            done
            But I agree it might take a while...

            Comment


            • #7
              Thanks everyone. I will give these suggestions a try and let you know how it went.

              Comment


              • #8
                This is good method to split bam file. But I got a question.
                The following is idxstats of a bam file:
                chrM 16571 2073252 32042
                chr1 249250621 115733016 1937746
                chr2 243199373 104133908 2244387
                chr3 198022430 96577573 1501432
                chr4 191154276 89582368 1825761
                chr5 180915260 94818025 1486923
                chr6 171115067 84533173 1273600
                chr7 159138663 71186849 1531851
                chr8 146364022 65630236 1315785
                chr9 141213431 59368028 1184324
                chr10 135534747 63503839 2018103
                chr11 135006516 59963670 1373030
                chr12 133851895 63898721 1180836
                chr13 115169878 41939790 616704
                chr14 107349540 43647215 758336
                chr15 102531392 39227879 624791
                chr16 90354753 42298502 991456
                chr17 81195210 49043800 916042
                chr18 78077248 75701725 1614444
                chr19 59128983 26119207 755016
                chr20 63025520 32668117 644231
                chr21 48129895 19226969 547857
                chr22 51304566 15797809 277926
                chrX 155270560 74715396 1365662
                chrY 59373566 2021162 642042
                * 0 0 42640654

                How could I get the "*" 42640654 reads, those were not mapped to any contigs ?

                Comment


                • #9
                  Originally posted by jjlaisnoopy View Post
                  This is good method to split bam file. But I got a question.

                  * 0 0 42640654

                  How could I get the "*" 42640654 reads, those were not mapped to any contigs ?

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    I tried the parameter: -f 4
                    And then index the result bam file
                    Here is the idxstats from it:

                    chrM 16571 0 32042
                    chr1 249250621 0 1937746
                    chr2 243199373 0 2244387
                    chr3 198022430 0 1501432
                    chr4 191154276 0 1825761
                    chr5 180915260 0 1486923
                    chr6 171115067 0 1273600
                    chr7 159138663 0 1531851
                    chr8 146364022 0 1315785
                    chr9 141213431 0 1184324
                    chr10 135534747 0 2018103
                    chr11 135006516 0 1373030
                    chr12 133851895 0 1180836
                    chr13 115169878 0 616704
                    chr14 107349540 0 758336
                    chr15 102531392 0 624791
                    chr16 90354753 0 991456
                    chr17 81195210 0 916042
                    chr18 78077248 0 1614444
                    chr19 59128983 0 755016
                    chr20 63025520 0 644231
                    chr21 48129895 0 547857
                    chr22 51304566 0 277926
                    chrX 155270560 0 1365662
                    chrY 59373566 0 642042
                    * 0 0 42640654

                    Any suggestions ?

                    Comment


                    • #11
                      Are you asking about why the number is not adding up to 42640654? See possible explanation here: https://www.biostars.org/p/18949/
                      Last edited by GenoMax; 02-10-2015, 07:05 PM.

                      Comment


                      • #12
                        All I want is the Unmapped reads with no mate or an unmapped mate are assigned to chrom "*" , not include unmapped mate reads which assigned to chr1, chr2, ...
                        just reads assigned to
                        "*" 0 0 42640654

                        Comment


                        • #13
                          Just write a quick little script in python with pysam to do this. There isn't always a premade program to do everything.

                          Comment


                          • #14
                            A command line solution. See if this works:

                            Code:
                            $ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.fastq
                            Last edited by GenoMax; 02-11-2015, 06:38 AM. Reason: correction

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              A command line solution. See if this works:

                              Code:
                              $ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.bam
                              GenoMax, you probably mean to call the output file "outfile.fastq", right?
                              Code:
                              $ samtools view -h file.bam | awk -F'\t' '{OFS = "\n"; ORS = "\n";}{ if ($3 == "*" ) print "@"$1,$10,"+",$11}' > outfile.[B]fastq[/B]

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X