Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best way to determine distribution of read start sites covering a base?

    Hey all,

    Basically I want to be able to for any given base, determine how many uniqe reads cover it, how many reads appearing in duplicate cover it, how many reads appearing in triplicate cover it, etc.

    IE, for a base at say chr1, position 1000, if there are 10 100bp reads covering it starting at:

    1. chr1:920
    2. chr1:935
    3. chr1:950
    4. chr1:950
    5. chr1:950
    6. chr1:978
    7. chr1:980
    8. chr1:989
    9. chr1:989
    10. chr1:996

    I want to know that the base 1000 is covered by reads starting at 5 sites uniquely (1,2,6,7,10 above), 1 site in duplicate (8,9 above), and 1 site in triplicate (3,4,5 above). I only want to consider one strand at a time. Thus, the output could be like:

    Chr,Base,Start_sites_covering_1x,Start_sites_covering_2x,Start_sites_covering_3x,etc
    1,1000,5,1,1

    Does anybody have any insight on an efficient way to do this? Are there any tools designed to do this, and if not, any thoughts on the easiest tools to use to write a program to pull this off?

    I would extremely appreciate any help.

  • #2
    Originally posted by Heisman View Post
    Hey all,

    Basically I want to be able to for any given base, determine how many uniqe reads cover it, how many reads appearing in duplicate cover it, how many reads appearing in triplicate cover it, etc.

    IE, for a base at say chr1, position 1000, if there are 10 100bp reads covering it starting at:

    1. chr1:920
    2. chr1:935
    3. chr1:950
    4. chr1:950
    5. chr1:950
    6. chr1:978
    7. chr1:980
    8. chr1:989
    9. chr1:989
    10. chr1:996

    I want to know that the base 1000 is covered by reads starting at 5 sites uniquely (1,2,6,7,10 above), 1 site in duplicate (8,9 above), and 1 site in triplicate (3,4,5 above). I only want to consider one strand at a time. Thus, the output could be like:

    Chr,Base,Start_sites_covering_1x,Start_sites_covering_2x,Start_sites_covering_3x,etc
    1,1000,5,1,1

    Does anybody have any insight on an efficient way to do this? Are there any tools designed to do this, and if not, any thoughts on the easiest tools to use to write a program to pull this off?

    I would extremely appreciate any help.
    Hi-

    Just a thought... If produce the pileup http://samtools.sourceforge.net/pileup.shtml of your alignment, then at each position you can count how many reads start there by counting the occurrences of the ^ sign in column 5. (^ marks the start of a read and it is followed by the quality score).

    Then you could summarize the counts by region in the way you show by using bedtools groupby

    You would have to stitch the whole thing together with some python/perl/awk/whatever but it should be quite quick and easy.

    Does it get you any closer to what you need?

    Dario

    Comment


    • #3
      Originally posted by dariober View Post
      Hi-

      Just a thought... If produce the pileup http://samtools.sourceforge.net/pileup.shtml of your alignment, then at each position you can count how many reads start there by counting the occurrences of the ^ sign in column 5. (^ marks the start of a read and it is followed by the quality score).

      Then you could summarize the counts by region in the way you show by using bedtools groupby

      You would have to stitch the whole thing together with some python/perl/awk/whatever but it should be quite quick and easy.

      Does it get you any closer to what you need?

      Dario
      This is very promising! I didn't realize I could utilize a pileup for this purpose. I can basically create a pileup from a .bam file, create a new column with the counts of '^', then extract regions I'm interested for any given base (ex: base n-99...n for 100bp reads), and then just actually use uniq -c based on the count column and get the output that I need. I don't think I'll need to use bedtools groupby but I wasn't even aware of that until this post and it looks quite useful for other purposes, so thanks for that as well.

      One question, can you tell from the pileup with '^' if the read is on the forward or reverse strand? If not, I can easily just separate forward/reverse strand reads initially from the .bam file and then do the analysis.

      Comment


      • #4
        Originally posted by Heisman View Post
        This is very promising! I didn't realize I could utilize a pileup for this purpose. I can basically create a pileup from a .bam file, create a new column with the counts of '^', then extract regions I'm interested for any given base (ex: base n-99...n for 100bp reads)
        Just take care that the ^ is followed by the mapping quality which might be (not sure) a ^ sign itself. Cases where you have ^^ should then be counted as one.

        Also note that you can pass a bed file of regions of interest to mpileup via the -l option. This way you extract your regions while creating the pileup instead of afterwards.

        One question, can you tell from the pileup with '^' if the read is on the forward or reverse strand? If not, I can easily just separate forward/reverse strand reads initially from the .bam file and then do the analysis.
        I guess you could use the information in the pileup to tell which strand the read comes from. But I suspect it's going to be easier to just separate them
        at the beginning from the bam file.

        Comment


        • #5
          Originally posted by dariober View Post
          Just take care that the ^ is followed by the mapping quality which might be (not sure) a ^ sign itself. Cases where you have ^^ should then be counted as one.

          Also note that you can pass a bed file of regions of interest to mpileup via the -l option. This way you extract your regions while creating the pileup instead of afterwards.



          I guess you could use the information in the pileup to tell which strand the read comes from. But I suspect it's going to be easier to just separate them
          at the beginning from the bam file.
          Excellent, thank you.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          29 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X