SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cost calculator for NGS experiments - looking for beta testers FlexGen_Joris General 3 07-05-2012 06:06 AM
Velvet Assembler: expected coverage versus estimated coverage versus effective covera DMCH Bioinformatics 1 11-30-2011 04:21 AM
"nucleotide coverage" to genome feature coverage sheremey Bioinformatics 3 11-02-2010 11:24 AM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 10:14 AM
coverage neoanderson 454 Pyrosequencing 3 04-05-2010 05:16 AM

Reply
 
Thread Tools
Old 10-01-2009, 03:23 AM   #1
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Question Coverage calculator and visualizer

Hi users,

I am trying to get an idea of the coverage of my Illumina GA2 experiment on a 3Mb fragment, and I thought of counting reads aligned at each position of my reference sequence and then plotting to get a visual graph. Does this actually make any sense? Is there any tool that does this already or do I have to get my poor scripting abilities in use here? For example maqview shows alignment but it is difficult to see coverage and actually it is difficult to see the coverage for the whole fragment. And I usually see "reduced" figures i narticles showing general coverage.

I have read of people setting windows of a certain width for calculating coverage but I think counting at each position might be more informative

Any hints or pointers would be great!

Cheers,

Dave
dnusol is offline   Reply With Quote
Old 10-01-2009, 06:03 AM   #2
AlexB
Member
 
Location: Netherlands

Join Date: Sep 2009
Posts: 18
Default

Hi Dave,
what I did for our bacterial genomes is similar to your suggestion. Just counted the number of contrib sequences to each base in the consensus. I then read this list into Artemis for viewing as a user plot. Works great (at least for our bacterial genomes. Might become slow on larger genomes). If you pm me I can share the very simple script I used to extract it from a 454 read assembly.
Alex
Attached Images
File Type: jpg Coverage_artemis1.jpg (16.5 KB, 179 views)

Last edited by AlexB; 10-01-2009 at 06:17 AM.
AlexB is offline   Reply With Quote
Old 10-04-2009, 06:20 PM   #3
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

Quote:
Originally Posted by dnusol View Post
Hi users,
I am trying to get an idea of the coverage of my Illumina GA2 experiment on a 3Mb fragment, and I thought of counting reads aligned at each position of my reference sequence and then plotting to get a visual graph.
Yes, we construct a .userplot file (text file with one number per line) to overlay our genome using Artemis software: http://www.sanger.ac.uk/Software/Artemis/

The .userplot files themselves are generated using the Nesoni package we have developed: http://www.bioinformatics.net.au/software.nesoni.shtml (main author is my colleague Paul Harrison).
Torst is offline   Reply With Quote
Old 10-14-2009, 06:10 AM   #4
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

I agree with the other posters about Artemis. You get all the features, it s zoomable and very informative.

I did find some limitations with the input of coverage.
In my case it was just a single number of the coverage at each bp of the reference sequence. This worked perfectly for small genomes, up to about 2 million bp. However Artemis wouldn t read in the 6m coverage I needed for my 6m bp genome, as a parse error occurred.
Increasing memory to 1800mb didn t help ..

Does anyone have any tips on how to manage memory ??
Alternatively, can the user input file be hacked to contain position information
so as only to show positions of interest, as opposed to values for the whole genome ?

We have now moved onto JBrowse for mapping Illumina reads.

However in the past I have used R scripts as well. Simple histograms work very well, albeit without annotation. There are some packages to read features in but I haven t done this yet.

Colin
colindaven is offline   Reply With Quote
Old 10-14-2009, 11:12 PM   #5
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

I have managed to load 5Mb of sequence on a 8Gb RAM linux machine and it didnīt seem to complain. Another question is differences in coverage between different positions. I have places where I have more than 1000 reads aligning and places with no alignment, so the plot created is not very informative. I also tried log-transforming data, and that seems to aleviate differences.

Regarding memory usage, you can increase memory available for Artemis. Just check the FAQs

Colin, do you think R could work for plotting coverage for a 5Mb sequence? Yoou probably have to create bins of a certain length, in order to plot a histogram of that size. How would you select bin size? Do you have any suggestions? I would also like to try that. I am thinking of results presentations and how to show in a slide a summary of coverage of my 5Mb experiment

Dave
dnusol is offline   Reply With Quote
Old 10-14-2009, 11:47 PM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

For the tasks people are talking about here it might be worth looking at SeqMonk. I didn't suggest this before as it's been more focussed on complete eukaryotic genome analysis, and the original question asked about a short(ish) fragment, but you can certainly work with bacterial genomes and they're pretty easy to set up if you have an annotated sequence file.

The program will cope with ~50million reads on a basic desktop, and if you have a machine with plenty of RAM you should be OK going to well over 100million reads. It has facilities for quantitating data in a number of ways, but can certainly cope with windowed counts (raw or log transformed) over any size and interval of window. It can also produce histograms, boxwhisker plots and all other manner of eye-candy.

If you're currently using Artemis then it probably wouldn't be too hard to switch over to SeqMonk, and get something with much of the same functionality with regards to zooming and sequence annotation, but with more specialised functionality when it comes to adding in next gen sequencing data.
simonandrews is offline   Reply With Quote
Old 10-15-2009, 04:42 AM   #7
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

Dave : R is very easy for this. I'd recommend it for a first look and nice genome wide histograms, but as I say not for more detailed analysis.

If you're interested in an example drop me a mail at
colindaven (AT) hotmail (dot) com

and I'll send you an example script.

R takes care of setting bin sizes etc by itself or you can change things very easily with the nclass parameter in hist(). Its best just to try different values yourself, its very simple.

Simon - thanks, I ll have a look at Seqmonk.

Colin
colindaven is offline   Reply With Quote
Old 10-15-2009, 05:04 AM   #8
snownebula
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 9
Default

I just thought that I would mention that in the MOSAIK suite, there is a program that not only creates a coverage data file that you can use anywhere but it will also use gnuplot and automatically create the graphs for you and save them as pdf files.

The program, MosaikCoverage, is insanely quick too.

Cheers,

// Michael
snownebula is offline   Reply With Quote
Old 10-26-2009, 08:45 AM   #9
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi all, I tried Artemis and R to do some plots of my data. I will give MOSAIK a go also. Meanwhile I found a weird point of high coverage in the dataset I was given I wish you can help me interpret. This is an Illumina single-read resequencing experiment using sequence capture from agilent for a 5Mb region. The total lenght captured is about 2.5Mb. I find a fragment with huge coverage in all the samples run, but I am not sure how to interpret it. At the beginnig I thought of it as an artifact of sequence capture, but the position is rather wide (20kb) with coverage maximum ranging from x15k to x35k coverage depending on the sample. I have checked for known CNVs at this point in databases but have not found any that match the position. Can anyone suggest any reasoning for this peak? I enclose the plot for one of the samples. Note the median coverage is about x70 for all the sequence, but here we can see very hig coverage. (X axis is position, Y is coverage)

Thanks for your help
Attached Images
File Type: jpg Peak.jpg (14.0 KB, 89 views)
dnusol is offline   Reply With Quote
Old 10-26-2009, 09:07 AM   #10
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

I suspect it's a segmental duplication or a region full of Alus or LINE elements. Can you give the actual coordinates? This would help resolve the issue.
quinlana is offline   Reply With Quote
Old 10-27-2009, 12:42 AM   #11
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

That is a very interesting suggestion. Since the experiment was designed as to not capture repetitive regions, ideally there shouldnīt be any ALUs or LINES/SINES, thus my initial idea of an error in the design of the capture baits. The coordinate of the highest peak is 128,080,364 in chr7. It would be great if you could check that position.
dnusol is offline   Reply With Quote
Old 10-27-2009, 05:24 AM   #12
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Quote:
Originally Posted by dnusol View Post
That is a very interesting suggestion. Since the experiment was designed as to not capture repetitive regions, ideally there shouldnīt be any ALUs or LINES/SINES, thus my initial idea of an error in the design of the capture baits. The coordinate of the highest peak is 128,080,364 in chr7. It would be great if you could check that position.
Hi,
Assuming you are referring to human and genome build hg18, it is clear from looking at the browser that your spike in coverage is caused by segmental duplications. In other words, you region has many other highly-similar regions in the genome. Thus, any probes designed to capture DNA in this region will also capture DNA from all of the other homologous regions in the genome. Consequently, you observe a massive spike in coverage. If unfamiliar, segmental duplications (also known as low-copy repeats) are typically defined as multi-copy regions (other than transposons) that are at least 1kb in size and have at least 90% sequence identity to one or more other sequences in the genome.

Attached is a browser shot from the UCSC browser that shows the coordinates of the other homologous regions in the genome (gray and yellow blocks, color indicates similarity).
Attached Images
File Type: jpg chr7-128,069,114-128,091,613.jpg (19.6 KB, 79 views)
quinlana is offline   Reply With Quote
Old 10-27-2009, 07:10 AM   #13
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Dear Quinlana,

you are totally right! I hadnīt checked that. That is a problem indeed.

Thanks for your help

Dave
dnusol is offline   Reply With Quote
Old 11-08-2013, 01:25 AM   #14
nageena
Junior Member
 
Location: germany

Join Date: Nov 2013
Posts: 1
Default Artemis RNA seq analysis

Hello

I have to analyze my RNA seq data by using artemis software. I have index file of my reference bacterial genome and BAM files of my tested samples. I need help how to interpret it. I opened the files and for one gene the coverage is not uniform. What does it mean? What is on Y- axis on coverage plot.?? I have send one image as an example. Can any one guide ma in detail....
Attached Images
File Type: png artemis.png (16.7 KB, 8 views)
nageena is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO