Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Query bam file and assemble reads

    Hi,

    I am quite a newbie in Bioinformatic and maybe asking something stupid.

    I am trying to generate a four-column file for assembling overlapped sequencing reads into longer contigs from a sorted bam file. The file needs to contain the following information in this format:

    chr#: start position of alignment - stop position of alignment: strand (+/-)

    Currently I am trying to awk the bam files to obtain the information, my code is here:

    samtools mpileup input.sorted.bam |\
    cut -d '\t' -f 1,2 |\
    awk -F '\t' 'BEGIN {chr="";start=-1;end=-1} {if(chr!=$1 || int($2)!=end+1) { if(chr!="") {printf("%s:%d-%d\n",chr,start,end);} chr=$1;start=int($2);end=int($2);} else { end=end+1;}} END {if(chr!="") {printf("%s:%d-%d\n",chr,start,end); } }'>output.txt

    It can work except with strand info. Actually, it ignores the strand info. If a read from + strand overlapped with a read from - strand, it will form a contig and that's not what I want. I want to assemble contigs in the same strand.

    How can I improve my code to take in strand info and make the assembly according to strand?

    Please help. Thank you very much.

    Dadi Gao

  • #2
    I think Bio:B::Sam is the way to go, if you know a bit perl programming. You can use its methods to access sam/bam for any information, including map positions and strands.

    Comment


    • #3
      Originally posted by macrowave View Post
      I think Bio:B::Sam is the way to go, if you know a bit perl programming. You can use its methods to access sam/bam for any information, including map positions and strands.
      Unfortunately I know nothing about perl
      Is there another way to do it directly from linux shell?

      Comment


      • #4
        Originally posted by dustar1986 View Post
        Unfortunately I know nothing about perl
        Is there another way to do it directly from linux shell?
        This would be a great time to learn perl/python/a-scripting-language.

        Comment


        • #5
          Maybe you can dig something out from the sam bitwise flags, I'm sure the strand information would be encoded there. For details, you can see the sam specifications here:
          http://samtools.sourceforge.net/SAM1.pdf. And there is an easy tool to explain the flags here: http://picard.sourceforge.net/explain-flags.html

          Comment


          • #6
            You might also try the bedtools suite - bamtobed and then mergebed. Not sure if that will deal with the strands,you might be able to filter the two strands at the bam output level.

            Comment


            • #7
              Originally posted by nilshomer View Post
              This would be a great time to learn perl/python/a-scripting-language.
              I can use python actually...

              Comment


              • #8
                Originally posted by macrowave View Post
                Maybe you can dig something out from the sam bitwise flags, I'm sure the strand information would be encoded there. For details, you can see the sam specifications here:
                http://samtools.sourceforge.net/SAM1.pdf. And there is an easy tool to explain the flags here: http://picard.sourceforge.net/explain-flags.html
                Thanks. But the flag only appear in samtools -view not in samtools -mpileup.
                I want to use the latter one coz it's easier for assembly.

                Comment


                • #9
                  Originally posted by cbeck View Post
                  You might also try the bedtools suite - bamtobed and then mergebed. Not sure if that will deal with the strands,you might be able to filter the two strands at the bam output level.
                  Thanks, I'll have a look at that.

                  Comment


                  • #10
                    If you install bedtools, you could use the "mergeBed" program to merge overlapping reads on the same strand as follows:
                    Code:
                    bamToBed -i aln.bam | mergeBed -i stdin -s > merged.intervals.bed
                    Alternatively, you could use the "genomeCoverageBed" tool to create strand-specific BEDGRAPH files, one for each strand. You could then combine them into a single file.
                    Code:
                    bamToBed -i aln.bam | \
                              awk '$6=="+" | \
                              genomeCoverageBed -i stdin -g chrom.sizes -bg \
                              > merged.intervals.fwd.bedg
                    
                    bamToBed -i aln.bam | \
                              awk '$6=="-" | \
                              genomeCoverageBed -i stdin -g chrom.sizes -bg \
                              > merged.intervals.rev.bedg
                    
                    cat merged.intervals.fwd.bedg merged.intervals.rev.bedg > merged.intervals.both.bedg

                    Comment


                    • #11
                      BAM file is compressed, you need do "samtools view -u -bh aln.bam > my.bam" first.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X