Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plot of NGS paired end distance at each genome position

    Hi,

    could anybody help in the following: standard Illumina paired end 300 bp NGS library is mapped on human genome. I would like FOR EACH GENOMIC POSITION to plot the average size of fragments that map across that position. The reason is, we see a huge decrease of fragments from the regular 300 down to ~100 in difficult GC rich areas.

    Thank you,
    Daniel

  • #2
    I don't know of any premade packages to do that. However, you could use pysam or Rsamtools to iterate through alignments crossing each position and average the TLEN values. I don't recall if the underlying bam_fetch function will return both mates in a pair, but it so make sure to include only one (based on sign of TLEN), and if not you'll have to handle things accordingly.

    You could also do this in C or java if you prefer, though it seems most people tend to prefer python or R.

    Comment


    • #3
      To save a bit of coding time trying to get around the intricacies of pysam, I would use samtools view and awk to generate the numbers:
      Code:
      samtools view input.bam | \
        awk -v 'OFS=,' 'BEGIN{print "RName,RPos,TLen"} {if($9>0){print $3,$4,$9}}' | \
        gzip > template_lengths.csv.gz
      Then R to generate the graph:
      Code:
      data.df <- read.csv("template_lengths.csv.gz");
      pdf("MyTemplateLengthPlots.pdf",width=20,height=8);
      for(rName in levels(data.df$RName)){
        sub.df <- subset(data.df, RName == rName);
        plot(sub.df$Rpos, sub.df$TLen, main = rName);
      }
      graphics.off()
      Last edited by gringer; 04-02-2014, 02:18 AM.

      Comment


      • #4
        You want to plot 3.2 billion data points? That is going to be a problem. You should look into some other way of representing that data.

        If you do want to plot with R then I reccomend the ggbio package from bioconductor.
        Last edited by Jeremy; 04-02-2014, 07:44 PM.

        Comment


        • #5
          ALE (Assembly Likelihood Estimator) may be useful here. It can detect areas where the apparent insert size differs from usual, and generates a histogram that you can view, as well as probabilities that the reference represents the reads. It accepts an assembly (fasta) file and set of constituent mapped reads (sam), to generate output.

          Comment


          • #6
            Additional information

            Thank you for the suggestions. Let me give you some more detail, maybe somebody encountered similar problem. We are trying to fill in some gaps in the chicken genome sequence. Those gaps do contain genes but are extremely GC rich. EVEN WITH ~1000x COVERAGE (!) OF THE CHICKEN GENOME we cannot extend our sequence into the gaps by using the paired reads. What happens that as we get closer to the boundary of the difficult sequence, the paired distances diminish and at the border the paired reads basically overlap each other, so there is not chance to walk into the gap. Just by inspecting few genomic regions, I can see this happening repeatedly. So I wanted to visualize this better by plotting the paired distances (not on the entire genome, just on selected genomic regions, on the order of several kbp).

            Comment


            • #7
              You may simply need longer "reads". For example, PacBio reads, or MiSeq 300bpx2 with an average insert of around 500bp merged with bbmerge. Since you already have 300bp reads, it may be useful to merge them into consensi before using them, assuming your insert size is appropriate. Of course, reads with low-complexity sequence will not merge very well. Fortunately... low-complexity sequence (like ACACACAC...ACAC) is not very important for most purposes.

              Have you tried ALE? If it doesn't work I can write something that will plot the pairing distance as a function of position, but ALE should already do this extremely well.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Working...
              X