Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plot of NGS paired end distance at each genome position

    Hi,

    could anybody help in the following: standard Illumina paired end 300 bp NGS library is mapped on human genome. I would like FOR EACH GENOMIC POSITION to plot the average size of fragments that map across that position. The reason is, we see a huge decrease of fragments from the regular 300 down to ~100 in difficult GC rich areas.

    Thank you,
    Daniel

  • #2
    I don't know of any premade packages to do that. However, you could use pysam or Rsamtools to iterate through alignments crossing each position and average the TLEN values. I don't recall if the underlying bam_fetch function will return both mates in a pair, but it so make sure to include only one (based on sign of TLEN), and if not you'll have to handle things accordingly.

    You could also do this in C or java if you prefer, though it seems most people tend to prefer python or R.

    Comment


    • #3
      To save a bit of coding time trying to get around the intricacies of pysam, I would use samtools view and awk to generate the numbers:
      Code:
      samtools view input.bam | \
        awk -v 'OFS=,' 'BEGIN{print "RName,RPos,TLen"} {if($9>0){print $3,$4,$9}}' | \
        gzip > template_lengths.csv.gz
      Then R to generate the graph:
      Code:
      data.df <- read.csv("template_lengths.csv.gz");
      pdf("MyTemplateLengthPlots.pdf",width=20,height=8);
      for(rName in levels(data.df$RName)){
        sub.df <- subset(data.df, RName == rName);
        plot(sub.df$Rpos, sub.df$TLen, main = rName);
      }
      graphics.off()
      Last edited by gringer; 04-02-2014, 02:18 AM.

      Comment


      • #4
        You want to plot 3.2 billion data points? That is going to be a problem. You should look into some other way of representing that data.

        If you do want to plot with R then I reccomend the ggbio package from bioconductor.
        Last edited by Jeremy; 04-02-2014, 07:44 PM.

        Comment


        • #5
          ALE (Assembly Likelihood Estimator) may be useful here. It can detect areas where the apparent insert size differs from usual, and generates a histogram that you can view, as well as probabilities that the reference represents the reads. It accepts an assembly (fasta) file and set of constituent mapped reads (sam), to generate output.

          Comment


          • #6
            Additional information

            Thank you for the suggestions. Let me give you some more detail, maybe somebody encountered similar problem. We are trying to fill in some gaps in the chicken genome sequence. Those gaps do contain genes but are extremely GC rich. EVEN WITH ~1000x COVERAGE (!) OF THE CHICKEN GENOME we cannot extend our sequence into the gaps by using the paired reads. What happens that as we get closer to the boundary of the difficult sequence, the paired distances diminish and at the border the paired reads basically overlap each other, so there is not chance to walk into the gap. Just by inspecting few genomic regions, I can see this happening repeatedly. So I wanted to visualize this better by plotting the paired distances (not on the entire genome, just on selected genomic regions, on the order of several kbp).

            Comment


            • #7
              You may simply need longer "reads". For example, PacBio reads, or MiSeq 300bpx2 with an average insert of around 500bp merged with bbmerge. Since you already have 300bp reads, it may be useful to merge them into consensi before using them, assuming your insert size is appropriate. Of course, reads with low-complexity sequence will not merge very well. Fortunately... low-complexity sequence (like ACACACAC...ACAC) is not very important for most purposes.

              Have you tried ALE? If it doesn't work I can write something that will plot the pairing distance as a function of position, but ALE should already do this extremely well.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X