Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coverage calculator and visualizer

    Hi users,

    I am trying to get an idea of the coverage of my Illumina GA2 experiment on a 3Mb fragment, and I thought of counting reads aligned at each position of my reference sequence and then plotting to get a visual graph. Does this actually make any sense? Is there any tool that does this already or do I have to get my poor scripting abilities in use here? For example maqview shows alignment but it is difficult to see coverage and actually it is difficult to see the coverage for the whole fragment. And I usually see "reduced" figures i narticles showing general coverage.

    I have read of people setting windows of a certain width for calculating coverage but I think counting at each position might be more informative

    Any hints or pointers would be great!

    Cheers,

    Dave

  • #2
    Hi Dave,
    what I did for our bacterial genomes is similar to your suggestion. Just counted the number of contrib sequences to each base in the consensus. I then read this list into Artemis for viewing as a user plot. Works great (at least for our bacterial genomes. Might become slow on larger genomes). If you pm me I can share the very simple script I used to extract it from a 454 read assembly.
    Alex
    Attached Files
    Last edited by AlexB; 10-01-2009, 06:17 AM.

    Comment


    • #3
      Originally posted by dnusol View Post
      Hi users,
      I am trying to get an idea of the coverage of my Illumina GA2 experiment on a 3Mb fragment, and I thought of counting reads aligned at each position of my reference sequence and then plotting to get a visual graph.
      Yes, we construct a .userplot file (text file with one number per line) to overlay our genome using Artemis software: http://www.sanger.ac.uk/Software/Artemis/

      The .userplot files themselves are generated using the Nesoni package we have developed: http://www.bioinformatics.net.au/software.nesoni.shtml (main author is my colleague Paul Harrison).

      Comment


      • #4
        I agree with the other posters about Artemis. You get all the features, it s zoomable and very informative.

        I did find some limitations with the input of coverage.
        In my case it was just a single number of the coverage at each bp of the reference sequence. This worked perfectly for small genomes, up to about 2 million bp. However Artemis wouldn t read in the 6m coverage I needed for my 6m bp genome, as a parse error occurred.
        Increasing memory to 1800mb didn t help ..

        Does anyone have any tips on how to manage memory ??
        Alternatively, can the user input file be hacked to contain position information
        so as only to show positions of interest, as opposed to values for the whole genome ?

        We have now moved onto JBrowse for mapping Illumina reads.

        However in the past I have used R scripts as well. Simple histograms work very well, albeit without annotation. There are some packages to read features in but I haven t done this yet.

        Colin

        Comment


        • #5
          I have managed to load 5Mb of sequence on a 8Gb RAM linux machine and it didn´t seem to complain. Another question is differences in coverage between different positions. I have places where I have more than 1000 reads aligning and places with no alignment, so the plot created is not very informative. I also tried log-transforming data, and that seems to aleviate differences.

          Regarding memory usage, you can increase memory available for Artemis. Just check the FAQs

          Colin, do you think R could work for plotting coverage for a 5Mb sequence? Yoou probably have to create bins of a certain length, in order to plot a histogram of that size. How would you select bin size? Do you have any suggestions? I would also like to try that. I am thinking of results presentations and how to show in a slide a summary of coverage of my 5Mb experiment

          Dave

          Comment


          • #6
            For the tasks people are talking about here it might be worth looking at SeqMonk. I didn't suggest this before as it's been more focussed on complete eukaryotic genome analysis, and the original question asked about a short(ish) fragment, but you can certainly work with bacterial genomes and they're pretty easy to set up if you have an annotated sequence file.

            The program will cope with ~50million reads on a basic desktop, and if you have a machine with plenty of RAM you should be OK going to well over 100million reads. It has facilities for quantitating data in a number of ways, but can certainly cope with windowed counts (raw or log transformed) over any size and interval of window. It can also produce histograms, boxwhisker plots and all other manner of eye-candy.

            If you're currently using Artemis then it probably wouldn't be too hard to switch over to SeqMonk, and get something with much of the same functionality with regards to zooming and sequence annotation, but with more specialised functionality when it comes to adding in next gen sequencing data.

            Comment


            • #7
              Dave : R is very easy for this. I'd recommend it for a first look and nice genome wide histograms, but as I say not for more detailed analysis.

              If you're interested in an example drop me a mail at
              colindaven (AT) hotmail (dot) com

              and I'll send you an example script.

              R takes care of setting bin sizes etc by itself or you can change things very easily with the nclass parameter in hist(). Its best just to try different values yourself, its very simple.

              Simon - thanks, I ll have a look at Seqmonk.

              Colin

              Comment


              • #8
                I just thought that I would mention that in the MOSAIK suite, there is a program that not only creates a coverage data file that you can use anywhere but it will also use gnuplot and automatically create the graphs for you and save them as pdf files.

                The program, MosaikCoverage, is insanely quick too.

                Cheers,

                // Michael

                Comment


                • #9
                  Hi all, I tried Artemis and R to do some plots of my data. I will give MOSAIK a go also. Meanwhile I found a weird point of high coverage in the dataset I was given I wish you can help me interpret. This is an Illumina single-read resequencing experiment using sequence capture from agilent for a 5Mb region. The total lenght captured is about 2.5Mb. I find a fragment with huge coverage in all the samples run, but I am not sure how to interpret it. At the beginnig I thought of it as an artifact of sequence capture, but the position is rather wide (20kb) with coverage maximum ranging from x15k to x35k coverage depending on the sample. I have checked for known CNVs at this point in databases but have not found any that match the position. Can anyone suggest any reasoning for this peak? I enclose the plot for one of the samples. Note the median coverage is about x70 for all the sequence, but here we can see very hig coverage. (X axis is position, Y is coverage)

                  Thanks for your help
                  Attached Files

                  Comment


                  • #10
                    I suspect it's a segmental duplication or a region full of Alus or LINE elements. Can you give the actual coordinates? This would help resolve the issue.

                    Comment


                    • #11
                      That is a very interesting suggestion. Since the experiment was designed as to not capture repetitive regions, ideally there shouldn´t be any ALUs or LINES/SINES, thus my initial idea of an error in the design of the capture baits. The coordinate of the highest peak is 128,080,364 in chr7. It would be great if you could check that position.

                      Comment


                      • #12
                        Originally posted by dnusol View Post
                        That is a very interesting suggestion. Since the experiment was designed as to not capture repetitive regions, ideally there shouldn´t be any ALUs or LINES/SINES, thus my initial idea of an error in the design of the capture baits. The coordinate of the highest peak is 128,080,364 in chr7. It would be great if you could check that position.
                        Hi,
                        Assuming you are referring to human and genome build hg18, it is clear from looking at the browser that your spike in coverage is caused by segmental duplications. In other words, you region has many other highly-similar regions in the genome. Thus, any probes designed to capture DNA in this region will also capture DNA from all of the other homologous regions in the genome. Consequently, you observe a massive spike in coverage. If unfamiliar, segmental duplications (also known as low-copy repeats) are typically defined as multi-copy regions (other than transposons) that are at least 1kb in size and have at least 90% sequence identity to one or more other sequences in the genome.

                        Attached is a browser shot from the UCSC browser that shows the coordinates of the other homologous regions in the genome (gray and yellow blocks, color indicates similarity).
                        Attached Files

                        Comment


                        • #13
                          Dear Quinlana,

                          you are totally right! I hadn´t checked that. That is a problem indeed.

                          Thanks for your help

                          Dave

                          Comment


                          • #14
                            Artemis RNA seq analysis

                            Hello

                            I have to analyze my RNA seq data by using artemis software. I have index file of my reference bacterial genome and BAM files of my tested samples. I need help how to interpret it. I opened the files and for one gene the coverage is not uniform. What does it mean? What is on Y- axis on coverage plot.?? I have send one image as an example. Can any one guide ma in detail....
                            Attached Files

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X