Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to get a summary for a special chromosome position in BAM file

    I want to get a summary of a special chromosome position in a BAM file, for example:
    I have chr1:100,000, how many times I got T, and how many times I got G...
    Is there any code for this question? Or any software I can use? Thanks!

  • #2
    Caveat: I hope somebody comes up with something better!

    Parsing the output of samtools mpileup may be the answer.
    Here's a script to do it, just edit for samtools path ($SAMT) and pass arguments 1) chromsome 2) location 3) bamfile.
    Problems may be : "one off" error. Didn't check (sorry) and may not really need UPSTREAM.

    Here's the bash script ...
    #####----------------------- START CODE

    #set SAMT to your samtools on next line !!!
    export SAMT=/h1/finneyr/samtools-0.1.18/samtools

    # this file is called job.mptest and is in current directory ...
    # example call ./job.mptest chr1 1011282 /tcga_next_gen/NG_buckets/bucket22/bam/C484.TCGA-06-0119-10A-01D-1490-08.5.bam

    export CHR=$1
    export PLACE=$2
    export UPSTREAM=$(($PLACE-1))
    export BAMFILE=$3

    RANGE=${CHR}:${UPSTREAM}-${PLACE}

    $SAMT mpileup -r $RANGE $BAMFILE 2> /dev/null | tail -1 | cut -f5 | sed -e "s/.\{1\}/&\n/g" | tr "[:lower:]" "[:upper:]" | grep "[A,C,G,T]" | sort | uniq -c
    # sed command adds newline every 1 char, tr command converts lower to upper

    ####------------------- end code

    example usage and output :
    ./job.mptest chr1 1011278 /tcga_next_gen/NG_buckets/bucket22/bam/C484.TCGA-06-0119-10A-01D-1490-08.5.bam
    15 A
    9 G
    Last edited by Richard Finney; 10-26-2012, 03:11 PM.

    Comment


    • #3
      samtools pileup

      Comment


      • #4
        Originally posted by Richard Finney View Post
        Caveat: I hope somebody comes up with something better!

        Parsing the output of samtools mpileup may be the answer.
        Here's a script to do it, just edit for samtools path ($SAMT) and pass arguments 1) chromsome 2) location 3) bamfile.
        Problems may be : "one off" error. Didn't check (sorry) and may not really need UPSTREAM.

        Here's the bash script ...
        #####----------------------- START CODE

        #set SAMT to your samtools on next line !!!
        export SAMT=/h1/finneyr/samtools-0.1.18/samtools

        # this file is called job.mptest and is in current directory ...
        # example call ./job.mptest chr1 1011282 /tcga_next_gen/NG_buckets/bucket22/bam/C484.TCGA-06-0119-10A-01D-1490-08.5.bam

        export CHR=$1
        export PLACE=$2
        export UPSTREAM=$(($PLACE-1))
        export BAMFILE=$3

        RANGE=${CHR}:${UPSTREAM}-${PLACE}

        $SAMT mpileup -r $RANGE $BAMFILE 2> /dev/null | tail -1 | cut -f5 | sed -e "s/.\{1\}/&\n/g" | tr "[:lower:]" "[:upper:]" | grep "[A,C,G,T]" | sort | uniq -c
        # sed command adds newline every 1 char, tr command converts lower to upper

        ####------------------- end code

        example usage and output :
        ./job.mptest chr1 1011278 /tcga_next_gen/NG_buckets/bucket22/bam/C484.TCGA-06-0119-10A-01D-1490-08.5.bam
        15 A
        9 G
        Thanks Richard! However, when I ran the code, I did not get the output.
        What I did is: I created a job.mptest file, and copy-paste your code in that. And change the samtools path. Then copy-paste this file into the bam file folder, and run the code you listed there. But there is no output. Is there anything wrong with my operation? Thanks so much!

        Comment


        • #5
          remove the " 2> /dev/null " to see error messages from samtools.

          get rid of the RED colored text as is done here ...
          [FONT="Courier New"]
          $SAMT mpileup -r $RANGE $BAMFILE 2> /dev/null | tail -1 | cut -f5 | sed -e "s/.\{1\}/&\n/g" | tr "[:lower:]" "[:upper:]" | grep "[A,C,G,T]" | sort | uniq -c

          Comment


          • #6
            Richard, Thank you very much!!!
            I finally found that my index file is not appropriate. If a bam file is aaa.bam, my index file is aaa.bai. However, after I changed to aaa.bam.bai, it works very well. Thanks!

            Comment


            • #7
              Hi, Richard,
              I then found another question with your code. Here is my code on one chromosome position:

              samtools mpileup -r chr1:205131282-205131283 /home/twotwo/1234.bam
              [mpileup] 1 samples in 1 input files
              <mpileup> Set max per-file depth to 8000
              file idx 34926928 chr1 205131282 N 42 TTTTTTTTtTtTTTTTTTTTTTTttTTtTTttTtTTTtTTTt @BDB<BDAG?I@B@HCIGJIJJIEG>JAIG?DJBHF?5FFFB
              chr1 205131283 N 43 TTTTTTTT$tTtTTTTTTTTTTTTttTTtTTttTtTTTtTTTt^]t CADDA>DCICICECEDIGJJIGJ?GDJEJGBBJ?GHF<HFFAB

              The following is the result. For the position of chr1:205131283, after I ran all of your command(with the shell script after the bam file), I got 43 T. However, if I used the igvtools to view it, I got 43T and 3A. It seemed the bottom line should also be counted because it has 3 A. However, if the bottom line was counted, there are some C and G, which is not showing in the IGVtools. How can I modify it to get 3A??? Thank you very much for your help!





              Originally posted by Richard Finney View Post
              remove the " 2> /dev/null " to see error messages from samtools.

              get rid of the RED colored text as is done here ...
              [FONT="Courier New"]
              $SAMT mpileup -r $RANGE $BAMFILE 2> /dev/null | tail -1 | cut -f5 | sed -e "s/.\{1\}/&\n/g" | tr "[:lower:]" "[:upper:]" | grep "[A,C,G,T]" | sort | uniq -c

              Comment


              • #8
                I'm looking at this. The parsing of the 5th field is naive and does not properly count the $,^,<,and> characters. It appears to be inadequate for all possible data for this field. "mpileup" is discussed at http://www.biostars.org/show/tag/mpileup/

                Others have problems using this approach : http://blog.nextgenetics.net/?e=56

                Dan Koboldt at Massgenomics blog talks about mpileup : ( from http://massgenomics.org/2012/03/5-th...s-mpileup.html )

                The 6th field is quality, not sequence.

                I have no idea how three As are showing up on IGV. Do you think IGV is right in this case?

                Comment


                • #9
                  Try adding the -A flag to mpileup (so "samtools mpileup -A . . . "). I've found that if you want to see exactly what you see in IGV, then you should include anomalous reads. Also, mpileup (as well as pileup and depth) have a ceiling cutoff for maximum depth at a position, so it is also helpful to specify a large ceiling if you think you'll hit the maximum (which I believe is 8000/number of samples).

                  Edit: Another thing worth checking - make sure you aren't off by 1 due to 0-offset or 1-offset. I believe mpileup and IGV are both 1-offset, so that should be fine but just something to keep in mind.

                  Justin
                  Last edited by BAMseek; 01-25-2013, 02:26 PM.

                  Comment


                  • #10
                    Originally posted by Richard Finney View Post
                    I'm looking at this. The parsing of the 5th field is naive and does not properly count the $,^,<,and> characters. It appears to be inadequate for all possible data for this field. "mpileup" is discussed at http://www.biostars.org/show/tag/mpileup/

                    Others have problems using this approach : http://blog.nextgenetics.net/?e=56

                    Dan Koboldt at Massgenomics blog talks about mpileup : ( from http://massgenomics.org/2012/03/5-th...s-mpileup.html )

                    The 6th field is quality, not sequence.

                    I have no idea how three As are showing up on IGV. Do you think IGV is right in this case?
                    Hi, Richard,
                    These 3A are also considered mutation with other software, like the varscan... So I think the IGV is right.... However, I really cannot see it in the samtools. very confusing.

                    Comment


                    • #11
                      Originally posted by BAMseek View Post
                      Try adding the -A flag to mpileup (so "samtools mpileup -A . . . "). I've found that if you want to see exactly what you see in IGV, then you should include anomalous reads. Also, mpileup (as well as pileup and depth) have a ceiling cutoff for maximum depth at a position, so it is also helpful to specify a large ceiling if you think you'll hit the maximum (which I believe is 8000/number of samples).

                      Edit: Another thing worth checking - make sure you aren't off by 1 due to 0-offset or 1-offset. I believe mpileup and IGV are both 1-offset, so that should be fine but just something to keep in mind.

                      Justin
                      Thanks Justin. I put -A in the code. However, it still gave me 43T (no 3A). How can I check the 0-offset or 1-offset?

                      Comment


                      • #12
                        Originally posted by twotwo View Post
                        Thanks Justin. I put -A in the code. However, it still gave me 43T (no 3A). How can I check the 0-offset or 1-offset?
                        Looks like from your mpileup output that the calls are

                        chr1 205131282 T
                        chr1 205131283 T

                        This lines up with IGV, so looks like that is fine. The only other thing I can think to check would be if those As are at the end of the read and are actually soft-clipped. If you have "show soft clipping" enabled in IGV, then it would show those bases, but they aren't counted in the total base counts. Maybe you could post a picture of what it looks like in IGV.

                        Comment


                        • #13
                          Originally posted by BAMseek View Post
                          Looks like from your mpileup output that the calls are

                          chr1 205131282 T
                          chr1 205131283 T

                          This lines up with IGV, so looks like that is fine. The only other thing I can think to check would be if those As are at the end of the read and are actually soft-clipped. If you have "show soft clipping" enabled in IGV, then it would show those bases, but they aren't counted in the total base counts. Maybe you could post a picture of what it looks like in IGV.
                          Please see the attached. All of the As are in some orange background. However, the summary window showed it is 43T and 3 A.
                          Attached Files

                          Comment


                          • #14
                            I think that the coloring means that the paired read was aligned to a different chromosome. If you hadn't already put in the -A flag for anomalous reads, I would have guessed that to be the cause of the difference.

                            Comment


                            • #15
                              Originally posted by BAMseek View Post
                              I think that the coloring means that the paired read was aligned to a different chromosome. If you hadn't already put in the -A flag for anomalous reads, I would have guessed that to be the cause of the difference.
                              If I put the -A, the command is:
                              samtools mpileup -A -r chr1:205131282-205131283 /home/twotwo/1234.cleaned.bam 2> /dev/null | tail -1 | cut -f5 | sed -e "s/.\{1\}/&\n/g" | tr "[:lower:]" "[:upper:]" | grep "T" | uniq -c

                              However, I still get 43T only, no 3A. Is there any problem with my command? Thanks!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X