Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

    for example for a solexa export file:
    Code:
    require(ShortRead)
    aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
    # Filtering of reads e.g.:
    aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
    #Remove duplicated reads
    aln<-aln[!srduplicated(aln)]
    #Coverage
    cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'
    One can then use the package rtracklayer to export it as a wig
    Code:
    require(rtracklayer)
    export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")
    you might need to change the chromosome names afterwards if your original names already contained chr.

    Comment


    • #17
      Originally posted by westerman View Post
      From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

      % coverage is how well the genome is actually covered after all mapping and assembly is done.

      As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

      So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
      Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


      One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


      I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
      Maybe I misunderstood what coverage is about, but...

      Using coverage to determine how many times you COULD cover a total genome is not really usefull, is it? (Unless you want to brag during coffee breaks, of course) You would like to know how many times the sequenced part of your genome actually IS covered, right? Giving you an idea of how many reads confirm a found base at a position. And coverage gives you an estimate of just that; how many reads do you have covering the parts you actually sequenced.

      Then shouldn't the result be 15X coverage? Because actual covering of the genome was 66.6% which contains 100Mb. So coverage would be 1.5Gb/100Mb which makes it 15X?
      So on average your sequenced part of the genome has 15X coverage, while in theory you could cover the entire genome 10X.

      Adding to that, I think that using the average coverage is not really adequate in describing the trustworthiness of your data. Using coverage distribution seems to make more sense to me. What if you have 50% covered only 5 times, and 50% covered 55 times? Still gives you 30X coverage, but it's not a good representation.

      Sorry for being such a pain...
      Cheers, dePhi

      Comment


      • #18
        Hi, there,

        I have a related question about coverage calculation for RNASeq data. The average exon coverage may be computed as the sum of # reads mapped to each exonic position divided by exon length. However, as dePhi said, "using the average coverage is not really adequate in describing the trustworthiness of your data." Drawing a distribution graph along the transcript length is helpful. It is easy to get the distribution graph for an individual transcript. But to get a summary graph for all transcripts, coverage values have to be scaled to transcript length and summed over all transcripts. I'm having trouble to do that. Could someone help me on that, probably some simple R code would do it?

        thanks a lot

        Xiang

        Comment


        • #19
          I wrote an app called "ReadCoverage" which I think does exactly what you want for defined regions (e.g. exons, repeatmasked regions). Also generates track data. It's part of http://useq.sourceforge.net . Here's the command line menu. There's also a GUI if you prefer that approach.

          **************************************************************************************
          ** Read Coverage: June 2009 **
          **************************************************************************************
          Generates read coverage stair-step xxx.bar graph files for visualization in IGB. Will
          also calculate per base coverage stats for a given file of interrogated regions.

          Options:
          -p Point Data directories, full path, comma delimited. Should contain chromosome
          specific xxx.bar.zip or xxx_-_.bar files.
          -s Save directory, full path.
          -i (Optional) Full path file name for a tab delimited bed file (chr start stop ...)
          containing interrogated regions to use in calculating a per base coverage
          statistics. Interbase coordinates assumed.
          -b Just calculate stats, skip coverage graph generation.

          Example: java -Xmx1500M -jar pathTo/USeq/Apps/ReadCoverage -p
          /Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedHitTrckEts1 -i
          /CapSeqDesign/interrogatedExonsChrX.bed

          **************************************************************************************

          Comment


          • #20
            Thanks. Could you please let me know how you scale the base coverage for transcripts with different length?

            Comment


            • #21
              It calculates a per base read coverage over the regions you provide. There is no scaling. It also generates a histogram of the fraction of interrogated bases with 1,2,3,4,.... or more reads. Good for assessing what % of your array capture and sequencing hit 10 or more reads. It also gives an estimate of what your coverage would be with additional sequencing.

              Comment


              • #22
                Does readAligned support Paired End export.txt files?

                Hello,
                I figured I'd ask, since this might be faster than me figuring it out - does readAligned from the ShortRead package allow data from PE export files to be plotted together?

                Thanks a lot!


                Originally posted by arendon View Post
                I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.

                for example for a solexa export file:
                Code:
                require(ShortRead)
                aln<-readAligned("/path/to/file","filename_export.txt",type="SolexaExport")
                # Filtering of reads e.g.:
                aln <- aln[alignData(aln)$filtering=="Y" & !is.na(strand(aln)) ]
                #Remove duplicated reads
                aln<-aln[!srduplicated(aln)]
                #Coverage
                cvg<-coverage(aln,extend=60L) #in this case reads are extended 60 bp 3'
                One can then use the package rtracklayer to export it as a wig
                Code:
                require(rtracklayer)
                export.ucsc(as(cvg,"RangedData")),"test.wig",subformat="wig")
                you might need to change the chromosome names afterwards if your original names already contained chr.

                Comment


                • #23
                  Originally posted by arendon View Post
                  I would like to suggest the ShortRead and IRange packages in R/Bioconductor. If you have enough memory to load a lane in memory 8GB should be enough, they provide excellent functions to compute per base coverage and many other things.
                  And a question related to the one above -- does readAligned also support (or plans to support) the SAM format?

                  Comment


                  • #24
                    Hey ohofmann,

                    At the moment readAligned does not support SAM format. The developers are planning to include it for the next release. Take a look at https://stat.ethz.ch/pipermail/bioc-...ly/000474.html

                    Also, it is very worthwhile to have a look at http://htseq.ucr.edu They have a nice tutorial on all things R and seq.

                    Comment


                    • #25
                      Somehow missed Martin's post, strange. Thanks for the pointer! And yes, the UCR pages are a great starting point.

                      -- Oliver

                      Comment


                      • #26
                        Originally posted by westerman View Post
                        That is what I get for posting early Monday morning.

                        'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.



                        I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.
                        thank you for your answer

                        I'm new in this field but I was searching about x covarage to calculate it for my reads and found this qoute. I would like to ask you that in your example 300M* 50 is 15G, so what is the point for this caculation?

                        Comment


                        • #27
                          Using Tablet, a genome viewer program, you can export a file in which the number of reads at each base is printed out. However, the base number is not printed out, just the number of reads.

                          Is there a graphing program that could take this txt file with the number of reads at each base and graph it as a line graph ? Or, if I added the appropriate base number along with the number of reads at each base, then is there a graphics program which could print out a line graph from this txt file ?

                          Comment


                          • #28
                            Hi all,
                            I would like to know the coverage of a 454 tissue specific transcriptome of a non-model specie, so I have not a reference sequence, neither EST nor genome sequence. I understood coverage as number of reads covering a given region (as dePhi said a year ago), that is, reads / bp. If I have a mean of 6 reads per contig and a mean of 209 bp per read, it is correct to say that I have a ~0.03X (6/209) coverage??
                            Thanks a lot!

                            Comment


                            • #29
                              jordi:

                              6 reads per contig with an average read length of 209 will tell you only that you have 1254 bases of data in a contig. What you really need is know is how many base *should* be covered. For example if your contig is suppose to be 1000 bases then your coverage is 1254/1000 or about 1.25 coverage. If your contig is suppose to be 300 bases then you have 1254/300 or about 4x coverage.

                              Unfortunately it appears that you have no idea of how many bases you are suppose to be covering. So you will have to take a wild guess. Fortunately inter-species transcriptomes tend to be a bit more uniform in size than inter-species genomes. I suggest you take the transcriptome size of a related species and use that. It won't be exact but it probably will not be off by more than a factor of 2.

                              To be clear what you want to calculate is:

                              transcriptome_size divided by (#_of_reads times avg_length_of_reads)

                              Comment


                              • #30
                                Originally posted by jamal View Post

                                I'm new in this field but I was searching about x covarage to calculate it for my reads and found this qoute. I would like to ask you that in your example 300M* 50 is 15G, so what is the point for this caculation?
                                And finally catching up with Jamal's question. Hum. It looks like I made another Monday morning calculation mistake. 300M reads times 50 would be indeed be 15 Gbases or 15000 Mbases. Then the 'X' coverage in my example is 15000/150 or 100x coverage.

                                Oh well, numbers being wrong or not, my original post was about 'X coverage' (where you divide how many bases you sequenced by the size of the genome/transcriptome/whatever you wish to sequence) versus 'percentage coverage' (which is how many bases you end up after mapping/assembly by the size of the genome/transcriptome/whatever).

                                This doesn't seem to be complex concept to me. Of course determining the size of what you wish to sequence can be complex. And even after you get the average coverage -- what does it really mean? The average does not help if your region of interest does not have good coverage in that region. Variance of coverage can be a helpful metric.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X