SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to calculate P-value&FDR lynn012 RNA Sequencing 2 09-18-2011 10:51 PM
Calculate p-value of SNP cybercot Bioinformatics 0 04-20-2011 09:55 PM
How to calculate the sequencing coverage from bioscope result coonya SOLiD 0 12-28-2010 10:14 PM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 10:14 AM
how calculate MAF ? dmartinez Bioinformatics 1 09-23-2010 09:39 AM

Reply
 
Thread Tools
Old 11-24-2011, 03:36 AM   #41
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

You're going to get a huge variation in expression for all transcripts, such that calculating a 'global deviation' doesn't really make sense. If you do want to do it globally, I'd recommend log-transforming the coverage values before working out mean coverage and deviation, because that transformation seems to better fit a normal curve.

On a per-gene (or per-isoform) basis, I've been calculating coverage using CV (i.e. SD / mean) as an indication of variation. With about 20 genes that I did this for, the CV ranged from about 40% to 400%, but anything above about 140% appeared to be a mis-mapped transcript.

It might be useful to only do this for expressed regions (e.g. coverage > 15) to attempt to control for these bad mappings.

Last edited by gringer; 11-24-2011 at 03:38 AM.
gringer is offline   Reply With Quote
Old 11-24-2011, 10:04 PM   #42
qqtwee
Member
 
Location: Beijing

Join Date: Feb 2011
Posts: 16
Default cufflinks Error

Hello,all
when I use cufflinks without a gtf file for bacterial RNA-seq, I get the follow result:
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
,that is all in the result file.
I don't know the reason why there are not values.anyone can help me?
thanks. best wishes.
Attached Images
File Type: jpg HDCM7G]UUWNN4NFQU4VKGGX.jpg (18.6 KB, 24 views)
qqtwee is offline   Reply With Quote
Old 12-13-2011, 04:49 AM   #43
SEQond
Member
 
Location: Italy

Join Date: Jul 2010
Posts: 27
Default

Quote:
Originally Posted by fhb View Post
Hi Dr. Quinlan,

Thanks for the bedtools. I am using the genomeCoverageBed, but I would like to use the coverageBed.
I am not sure if you have posted this somewhere else, but I would appreciate if you could provide this code that you offered that creates a BED file with intervals of a given size.
Thanks in advance,
Best
Fernando
If possible I would like that too.

Thanks
SEQond is offline   Reply With Quote
Old 12-14-2011, 03:03 AM   #44
SEQond
Member
 
Location: Italy

Join Date: Jul 2010
Posts: 27
Default

By utilising BEDtools and UCSC GB I am trying to get a picture like this (histogram of the bedgraph in wiggle form)

so far i have used a SORTED bam on which genomeCoverageBed was run with -bg -ibam options and -g mm8.fa.fai as the index

The resulting bedgraph was uploaded on UCSC GB and produced this "stripe" track which has numbers 1 2 3 and os on , before each stripe ( where do I find the meaning of the numbers?)

Then added a first line of the bedgraph file like this
track type=bedGraph name="set11bamSorted" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20

and uploaded the new file.

This time I am getting an error
"Error File 'set11_fixed.bedgraph' - Error line 1224705 of custom track: chromEnd larger than chrom chr13 size (114176901 > 114142980)"

Why should I get this error this time , while the contents of the bedgraph file are exactly the same as before? Is this the correct way to get the wiggle histogram like the one on the first picture ?

Below you may find the code that I used to get the original bedgraph file

genomeCoverageBed -bg -ibam ~/set11/tophatOut11/set11TopHataccepted_hits.Sorted.bam -g /opt/bowtie/indexes/mm8.fa.fai > ~/set11/set11.bedgraph &
SEQond is offline   Reply With Quote
Old 06-04-2013, 09:45 AM   #45
mrood
Junior Member
 
Location: Madison, WI

Join Date: Feb 2013
Posts: 5
Default

I realize this is quite an old forum but I am currently trying to calculate the coverage per bp of my NGS data. I am new to NGS and bioinformatics so my apologies if this does not make sense...

For quality control I would like to determine a logical min and max coverage to exclude from my downstream analyses; however, I cannot seem to find a "gold standard" for such purposes. It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism. Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping. As such, I would like to exclude these regions before performing genome-wide population genetic analyses. Does anyone have any suggestions on what cutoffs to use and how to go about doing so?

Thanks in advance!
mrood is offline   Reply With Quote
Old 06-04-2013, 02:46 PM   #46
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

If duplications / deletions are rare enough, then median coverage should be fine. A median statistic will typically deal with the spikes and troughs that are an issue for using mean as a descriptive statistic.
gringer is offline   Reply With Quote
Old 06-04-2013, 06:25 PM   #47
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by mrood View Post
It seems to me that if you are mapping to a reference genome and there are regions that have more than twice the average coverage that it is probably the result of a duplication or something in the genome of the sequenced organism.
Natural variation in sequencing coverage could easily make 2 fold differences in coverage, or more.

Quote:
Likewise, if it has very poor coverage the organism likely does not have that region in its genome and it is likely the result of improper mapping.
Or, the region is there, but so divergent from your reference that reads are mapping poorly, or the region could be GC rich or something, causing few reads to be generated there.
swbarnes2 is offline   Reply With Quote
Old 08-29-2013, 05:35 AM   #48
recombinationhotspot
Junior Member
 
Location: London

Join Date: Jun 2013
Posts: 2
Default

I am trying to calculate the average coverage for a given region , e.g 200 bps where my reads are aligned. Is there any software that can do that without actually having to write any commands. Please note that I have no bioinformatics background and don't have access to a linux, etc operating system. The best solution I have until now is to use Savant genome browser and convert the .bam files into .bam.cov.tdf files which shows me the maximum coverage.
recombinationhotspot is offline   Reply With Quote
Old 08-29-2013, 05:52 PM   #49
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

Quote:
Is there any software that can do that without actually having to write any commands
Er, you want a program to run that means you don't have to run a program? That's a difficult request.

I suppose you could try using Galaxy, which hides all that pesky "running commands" stuff from you. That has a feature coverage tool, but requires input files to be in BED format, and presumably there are other tools closer to what you desire. From this email:

Quote:
To calculate coverage, please see the tool "Regional Variation ->
Feature coverage". Query and target must both be in Interval/BED format.
Query data in Interval/BED format is possible in most of the dataflow
paths through the tools and from external sources. The reference genome
file will likely need to be imported and formatted.
gringer is offline   Reply With Quote
Old 09-08-2014, 02:28 AM   #50
rathankar
Member
 
Location: chennai, Bangalore

Join Date: Oct 2011
Posts: 10
Default calculating coverage depth

Quote:
Originally Posted by westerman View Post
From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)


One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.


I use the haploid genome size or more specifically the C-value times 965Mbases/pg.
Hi

I went thru this post and understood how do we express coverage depth. But I need a small clarification. Does this coverage depth involve mutations in the reads [i mean non matching positions with respect to reference sequence], since it only takes the number of bases in the sample and the no. of bases in the reference sequence.

2. if a read matches at more than one location, then will the coverage depth not increase. is there a way to reduce that error?
__________________
Sr. Application Scientist, Apsara Innovations, Bangalore
E-Mail: rathankar@gmail.com
rathankar is offline   Reply With Quote
Old 09-08-2014, 07:52 AM   #51
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

1. The SNP/Indels are usually not a big part of the genome; I doubt it they would throw off the calculations by a percent.

2. Count the read only once; i.e., choose the best match or if multiple best matches then just choose one match by random.

Really, unless you are working with a well characterized organism (e.g., human) then the numbers are going to be 'squishy' in any case. They are mainly there to give you an idea of how good your sequencing is. In other words if you calculate that you had 50x coverage (which is a nice de-novo assembly target) but only get 10% coverage to a closely related organism then that tells you something.
westerman is offline   Reply With Quote
Old 08-19-2015, 07:35 AM   #52
criscruz
Junior Member
 
Location: Peru

Join Date: Aug 2015
Posts: 7
Default

Hi everyone;

I just have read your post but I still have my doubt in mind.

I'm working with Ion PGM to generate whole genome sequences of some RNA viruses. Then I want to do a phylognetic tree with the consensus sequences of each virus I could identify. So the question is, how many coverage or reads per base I need in order to make a good consensus sequences to my phylogenetic analysis.
I don't want to see or analyze the variant or quasespecies. So I need just the minimal necessary

thanks for your time
my best

Cris
criscruz is offline   Reply With Quote
Old 08-19-2015, 07:39 AM   #53
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

50-100x coverage per genome is good. Much more than that and you will start getting misassemblies.

For virus I suggest using Mira. It is a good small genome assembler than can handle a lot of potential misassemblies.
westerman is offline   Reply With Quote
Old 08-20-2015, 07:23 AM   #54
criscruz
Junior Member
 
Location: Peru

Join Date: Aug 2015
Posts: 7
Default

thanks westerman
criscruz is offline   Reply With Quote
Reply

Tags
coverage, depth

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO