SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbler and read coverage aloliveira Bioinformatics 11 02-19-2014 06:49 AM
Average Read Coverage for 454 paired end read data lisa1102 Core Facilities 8 10-18-2011 08:40 AM
Will single-read be enough coverage? Turnerac0987 Bioinformatics 6 10-07-2011 05:58 AM
About the read depth of coverage El Mariachi Illumina/Solexa 2 12-30-2010 12:22 AM
1/2 read and coverage? Triticum 454 Pyrosequencing 3 09-08-2009 07:13 AM

Reply
 
Thread Tools
Old 06-04-2013, 03:02 PM   #1
migs54
Junior Member
 
Location: LA

Join Date: Nov 2012
Posts: 7
Default Read coverage for each gene

Hi,

I'm looking for a method to give the read coverage for each gene. Our current method for making an RPKM cutoff is to manually look at our bedgraphs on UCSC Genome Browser and visually assess coverage. <----- Very tedious and subject to a number of flaws.

Ideally I would like to get a value for each gene that indicates how many bases are covered. i.e. 100% of all bases covered or 0%. Is there already something like this in BEDtools or an R package?

I use tophat to map and Seqmonk to count reads then manually calculate RPKM.

Thanks

Miguel
migs54 is offline   Reply With Quote
Old 06-04-2013, 03:45 PM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Bedtools genomecov should do this.

http://bedtools.readthedocs.org/en/l...genomecov.html

The default options produce the following histogram output / fields, which should be easily modifiable for your purposes:
Quote:
1. chromosome (or entire genome)
2. depth of coverage from features in input file
3. number of bases on chromosome (or genome) with depth equal to column 2.
4. size of chromosome (or entire genome) in base pairs
5. fraction of bases on chromosome (or entire genome) with depth equal to column 2.
If you set the maximum coverage to 1 (-max 1), then you can easily get the coverage fraction for all bases covered at any depth. If you set the max coverage to 20 (or something more suitable for your needs), then you get the covered fraction for well-covered bases.
gringer is offline   Reply With Quote
Old 06-04-2013, 04:59 PM   #3
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I've used 'bedtools coverage' for this in the past but it's a little tricky if you're working with an alternatively spliced transcriptome. For example you can compare your alignments to a GTF with this command and for every feature you get 4 values:

Quote:
1) The number of features in A that overlapped the B interval.
2) The number of bases in B that had non-zero coverage.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage.
Where A is your alignments file and B is the GTF file. The thing is you have to parse this output fetching all of the rows corresponding to 'exon' features from the GTF and then group them based on the 'transcript_id' GTF field (which is included in the output) and finally using the values they provide with each exon feature you can compute the total transcript coverage ratio. If you can write perl or python you can do this. Even if you get that far it's still a little deceptive because you don't now which isoform(s) are actually expressed so therein lies the real issue. It does, however, produce a number telling you how well covered the isoforms are.

There's another kinda wacky idea you can try. The above method isn't much different than looking at the coverage of alignments to a transcriptome (rather than a genome) allowing all possible alignments to be reported since the above case will take a single aligned read to the genome and count it towards all features it overlaps. Alternatively you may build an aligner index for the transcriptome reference, align with bowtie using the -a option, make a small bed file that describes each transcript feature in the transcriptome reference and then use 'bedtools coverage' to access the coverage of all features that the reads may align to. This is getting a little messy but for rough estimates it should be OK especially if you can group the output into genes by joining in some gene names and sorting on that column.

Here's kinda how it would work provided you have a genome fasta (genome.fa) file, a GTF gene annotation (genes.gtf) and some reads in a fastq file (reads.fq). You also need to have Tophat installed on your computer (for its gffread utility), bowtie1 and bedtools.

Code:
# make the transcriptome sequence file
gffread -w transcriptome.fa -g genome.fa genes.gtf

# index with samtools
samtools faidx transcriptome.fa

# build bowtie index for the transcriptome
bowtie-build transcriptome_ref transcriptome.fa

# align your reads with bowtie
bowtie -a --best -S transcriptome_ref reads.fq | samtools view -bS -o aligned.bam - 

# parse the fasta index to make a bed file describing the transcripts
cut -f1,2 transcriptome.fa.fai | perl -slane 'print join("\t", $F[0], 0, $F[1]);' > transcriptome_fa.bed

# use bedtools to compute coverage of features
bedtools coverage -abam aligned.bam -b transcriptome_fa.bed > feature_coverages.bed
as long as its understood that this process in no way whatsoever is producing correct counts at the isoform level but you can use this to get an idea of how well the isoforms are potentially covered by your reads. call me crazy but it does work.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 09:39 PM   #4
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

while I'm at it you could even incorporate the results of eXpress or RSEM into this transcript coverage pipeline since those seek to unambiguously assign each read. For eXpress you could use the same bowtie index built from my previous example and then run these commands:

Code:
# express needs alignments with all possible (and reasonable) 
# mapping locations
bowtie -aS -n 2 -e 999 transcriptome_ref reads.fq | samtools view -bS -o aligned.bam - 

# run express making a BAM file with all of the uniquely mapped reads
express --output-align-samp -B 1 transcriptome.fa aligned.bam
the '--output-align-samp' option instructs eXpress to produce a BAM file named 'hits.1.samp.bam' which contains only the single alignments for each read selected by their EM algorithm as the alignment with the highest probability of being correct. in other words the alignments in this file should produce counts from the 'bedtools coverage' command that match up with the 'est_counts' column in the 'results.xprs' file produced by eXpress. You can use this BED file in the same 'bedtools coverage' command I put into my previous post. Now you'll have coverages without any redundant alignments.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-05-2013, 04:47 PM   #5
migs54
Junior Member
 
Location: LA

Join Date: Nov 2012
Posts: 7
Default

Thanks for the suggestions. I ended up using the following.

coverageBed -abam CCE.bam -s -b mm9.refSeq.Wholegene.bed > CCEcoverage.txt

The final column gave percentage coverage and it matches well with the bedgraphs I have on UCSC.

Alternative splicing is not a primary concern at this point so this should be fine for now. If that changes I'll be sure to have a headache trying to figure out! I'll come back to this post for sure.

Thanks!
migs54 is offline   Reply With Quote
Old 04-21-2015, 03:37 PM   #6
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Quote:
Originally Posted by migs54 View Post
Thanks for the suggestions. I ended up using the following.

coverageBed -abam CCE.bam -s -b mm9.refSeq.Wholegene.bed > CCEcoverage.txt

The final column gave percentage coverage and it matches well with the bedgraphs I have on UCSC.

Alternative splicing is not a primary concern at this point so this should be fine for now. If that changes I'll be sure to have a headache trying to figure out! I'll come back to this post for sure.

Thanks!
Just to remind that this method will not give the coverage of the exons of one gene.

coverageBed -abam CCE.bam -s -b mm9.refSeq.Wholegene.bed > CCEcoverage.txt

The percentage coverage given by this command is the gene body, including introns, coverage. And further, if you donn't use "-split" parameter, the region in "-abam" will use all the span that include "Ns". The detailed explanation can be found at section 1.3.19.
pengchy is offline   Reply With Quote
Old 04-21-2015, 03:44 PM   #7
migs54
Junior Member
 
Location: LA

Join Date: Nov 2012
Posts: 7
Default

Hi Pengchy,

Yes you're right. This experiment was chromatin RNA seq, so included introns when calculating coverage.
migs54 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO