SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Wierd overlapping paired ends with bimodal paired end distance distance distribution Mouth_Breather Illumina/Solexa 12 08-24-2012 12:06 PM
Sorting Paired-end files, and automating Mean Inner Distance between Mate Pairs prussiap RNA Sequencing 0 06-14-2012 11:35 AM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 03:06 AM
Compute paired-end distance distribution? krobison Bioinformatics 11 11-12-2009 09:30 AM
NGS of paired-end tags (PET) for transcriptome and genome analyses Melissa Literature Watch 0 04-24-2009 06:46 AM

Reply
 
Thread Tools
Old 04-02-2014, 02:34 AM   #1
Retro
Member
 
Location: Pennsylvania

Join Date: Apr 2011
Posts: 27
Default Plot of NGS paired end distance at each genome position

Hi,

could anybody help in the following: standard Illumina paired end 300 bp NGS library is mapped on human genome. I would like FOR EACH GENOMIC POSITION to plot the average size of fragments that map across that position. The reason is, we see a huge decrease of fragments from the regular 300 down to ~100 in difficult GC rich areas.

Thank you,
Daniel
Retro is offline   Reply With Quote
Old 04-02-2014, 02:50 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I don't know of any premade packages to do that. However, you could use pysam or Rsamtools to iterate through alignments crossing each position and average the TLEN values. I don't recall if the underlying bam_fetch function will return both mates in a pair, but it so make sure to include only one (based on sign of TLEN), and if not you'll have to handle things accordingly.

You could also do this in C or java if you prefer, though it seems most people tend to prefer python or R.
dpryan is offline   Reply With Quote
Old 04-02-2014, 03:15 AM   #3
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

To save a bit of coding time trying to get around the intricacies of pysam, I would use samtools view and awk to generate the numbers:
Code:
samtools view input.bam | \
  awk -v 'OFS=,' 'BEGIN{print "RName,RPos,TLen"} {if($9>0){print $3,$4,$9}}' | \
  gzip > template_lengths.csv.gz
Then R to generate the graph:
Code:
data.df <- read.csv("template_lengths.csv.gz");
pdf("MyTemplateLengthPlots.pdf",width=20,height=8);
for(rName in levels(data.df$RName)){
  sub.df <- subset(data.df, RName == rName);
  plot(sub.df$Rpos, sub.df$TLen, main = rName);
}
graphics.off()

Last edited by gringer; 04-02-2014 at 03:18 AM.
gringer is offline   Reply With Quote
Old 04-02-2014, 08:40 PM   #4
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

You want to plot 3.2 billion data points? That is going to be a problem. You should look into some other way of representing that data.

If you do want to plot with R then I reccomend the ggbio package from bioconductor.

Last edited by Jeremy; 04-02-2014 at 08:44 PM.
Jeremy is offline   Reply With Quote
Old 04-02-2014, 09:34 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

ALE (Assembly Likelihood Estimator) may be useful here. It can detect areas where the apparent insert size differs from usual, and generates a histogram that you can view, as well as probabilities that the reference represents the reads. It accepts an assembly (fasta) file and set of constituent mapped reads (sam), to generate output.
Brian Bushnell is offline   Reply With Quote
Old 04-03-2014, 02:53 AM   #6
Retro
Member
 
Location: Pennsylvania

Join Date: Apr 2011
Posts: 27
Default Additional information

Thank you for the suggestions. Let me give you some more detail, maybe somebody encountered similar problem. We are trying to fill in some gaps in the chicken genome sequence. Those gaps do contain genes but are extremely GC rich. EVEN WITH ~1000x COVERAGE (!) OF THE CHICKEN GENOME we cannot extend our sequence into the gaps by using the paired reads. What happens that as we get closer to the boundary of the difficult sequence, the paired distances diminish and at the border the paired reads basically overlap each other, so there is not chance to walk into the gap. Just by inspecting few genomic regions, I can see this happening repeatedly. So I wanted to visualize this better by plotting the paired distances (not on the entire genome, just on selected genomic regions, on the order of several kbp).
Retro is offline   Reply With Quote
Old 04-08-2014, 07:41 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You may simply need longer "reads". For example, PacBio reads, or MiSeq 300bpx2 with an average insert of around 500bp merged with bbmerge. Since you already have 300bp reads, it may be useful to merge them into consensi before using them, assuming your insert size is appropriate. Of course, reads with low-complexity sequence will not merge very well. Fortunately... low-complexity sequence (like ACACACAC...ACAC) is not very important for most purposes.

Have you tried ALE? If it doesn't work I can write something that will plot the pairing distance as a function of position, but ALE should already do this extremely well.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO