Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Query bam file and assemble reads

    Hi,

    I am quite a newbie in Bioinformatic and maybe asking something stupid.

    I am trying to generate a four-column file for assembling overlapped sequencing reads into longer contigs from a sorted bam file. The file needs to contain the following information in this format:

    chr#: start position of alignment - stop position of alignment: strand (+/-)

    Currently I am trying to awk the bam files to obtain the information, my code is here:

    samtools mpileup input.sorted.bam |\
    cut -d '\t' -f 1,2 |\
    awk -F '\t' 'BEGIN {chr="";start=-1;end=-1} {if(chr!=$1 || int($2)!=end+1) { if(chr!="") {printf("%s:%d-%d\n",chr,start,end);} chr=$1;start=int($2);end=int($2);} else { end=end+1;}} END {if(chr!="") {printf("%s:%d-%d\n",chr,start,end); } }'>output.txt

    It can work except with strand info. Actually, it ignores the strand info. If a read from + strand overlapped with a read from - strand, it will form a contig and that's not what I want. I want to assemble contigs in the same strand.

    How can I improve my code to take in strand info and make the assembly according to strand?

    Please help. Thank you very much.

    Dadi Gao

  • #2
    I think Bio:B::Sam is the way to go, if you know a bit perl programming. You can use its methods to access sam/bam for any information, including map positions and strands.

    Comment


    • #3
      Originally posted by macrowave View Post
      I think Bio:B::Sam is the way to go, if you know a bit perl programming. You can use its methods to access sam/bam for any information, including map positions and strands.
      Unfortunately I know nothing about perl
      Is there another way to do it directly from linux shell?

      Comment


      • #4
        Originally posted by dustar1986 View Post
        Unfortunately I know nothing about perl
        Is there another way to do it directly from linux shell?
        This would be a great time to learn perl/python/a-scripting-language.

        Comment


        • #5
          Maybe you can dig something out from the sam bitwise flags, I'm sure the strand information would be encoded there. For details, you can see the sam specifications here:
          http://samtools.sourceforge.net/SAM1.pdf. And there is an easy tool to explain the flags here: http://picard.sourceforge.net/explain-flags.html

          Comment


          • #6
            You might also try the bedtools suite - bamtobed and then mergebed. Not sure if that will deal with the strands,you might be able to filter the two strands at the bam output level.

            Comment


            • #7
              Originally posted by nilshomer View Post
              This would be a great time to learn perl/python/a-scripting-language.
              I can use python actually...

              Comment


              • #8
                Originally posted by macrowave View Post
                Maybe you can dig something out from the sam bitwise flags, I'm sure the strand information would be encoded there. For details, you can see the sam specifications here:
                http://samtools.sourceforge.net/SAM1.pdf. And there is an easy tool to explain the flags here: http://picard.sourceforge.net/explain-flags.html
                Thanks. But the flag only appear in samtools -view not in samtools -mpileup.
                I want to use the latter one coz it's easier for assembly.

                Comment


                • #9
                  Originally posted by cbeck View Post
                  You might also try the bedtools suite - bamtobed and then mergebed. Not sure if that will deal with the strands,you might be able to filter the two strands at the bam output level.
                  Thanks, I'll have a look at that.

                  Comment


                  • #10
                    If you install bedtools, you could use the "mergeBed" program to merge overlapping reads on the same strand as follows:
                    Code:
                    bamToBed -i aln.bam | mergeBed -i stdin -s > merged.intervals.bed
                    Alternatively, you could use the "genomeCoverageBed" tool to create strand-specific BEDGRAPH files, one for each strand. You could then combine them into a single file.
                    Code:
                    bamToBed -i aln.bam | \
                              awk '$6=="+" | \
                              genomeCoverageBed -i stdin -g chrom.sizes -bg \
                              > merged.intervals.fwd.bedg
                    
                    bamToBed -i aln.bam | \
                              awk '$6=="-" | \
                              genomeCoverageBed -i stdin -g chrom.sizes -bg \
                              > merged.intervals.rev.bedg
                    
                    cat merged.intervals.fwd.bedg merged.intervals.rev.bedg > merged.intervals.both.bedg

                    Comment


                    • #11
                      BAM file is compressed, you need do "samtools view -u -bh aln.bam > my.bam" first.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      59 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      57 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      56 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X