SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Raw read counts for RNAseq biofreak Introductions 13 01-16-2013 05:28 AM
How can one get raw read counts from RPKM values gen2prot General 6 06-24-2011 11:08 AM
Raw read counts for RNAseq biofreak RNA Sequencing 2 06-15-2011 05:56 AM
miRNA read counts regarding DE vebaev RNA Sequencing 16 05-19-2011 11:52 AM
DESeq: Read counts vs. BP counts burkard Bioinformatics 0 08-05-2010 11:52 PM

Reply
 
Thread Tools
Old 04-08-2010, 08:36 AM   #1
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default read counts per transcript, edgeR, tophat

Hi everyone,

I'm using Illumina RNAseq to test for differential expression between two conditions. I have successfully mapped reads with tophat/bowtie, and used cufflinks/cuffcompare to test for differential expression.

A number of methods (edgeR, DEGseq, DESseq), however, ask for raw read counts rather FPKM.

Is there an easy way to go from tophat output (say the bedgraph or .sam) to raw read counts for each transcript (say using a gtf that I would supply)?

Thanks!
chrisbala is offline   Reply With Quote
Old 04-08-2010, 01:26 PM   #2
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

If I understand your question correctly, BEDTools has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
Code:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt
There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.
quinlana is offline   Reply With Quote
Old 04-08-2010, 10:55 PM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Here is the script that I made for this purpose: http://www-huber.embl.de/users/ander...doc/count.html

Simon
Simon Anders is offline   Reply With Quote
Old 04-09-2010, 07:24 AM   #4
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Yeah, I have used bedtools for similar analysis and it is quite intuitive

Quote:
Originally Posted by quinlana View Post
If I understand your question correctly, BEDTools has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
Code:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt
There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.
bioinfosm is offline   Reply With Quote
Old 04-09-2010, 10:54 AM   #5
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Quote:
Originally Posted by Simon Anders View Post
Here is the script that I made for this purpose: http://www-huber.embl.de/users/ander...doc/count.html

Simon
Simon, are there any specific reason to use Python2.5, because the cluster I can use only have Python2.4. And I trid to build the source codes using Python 2.4. When I ran "python setup.py build" I got the warning message which doesn't matter.

/usr/lib64/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'requires'
warnings.warn(msg)

However, when I ran "python setup.py install --prefix=...", I got the error message

byte-compiling $HOME/librarys/lib64/python2.4/site-packages/HTSeq/StepVector.py to StepVector.pyc
File $HOME/librarys/lib64/python2.4/site-packages/HTSeq/StepVector.py", line 390
start = index.start if index.start is not None else self.start_index()
^
SyntaxError: invalid syntax

The reason seems to that "Conditional expressions" is not supported in Python2.4, how can I change the codes without upgrading to Python 2.5

Thanks

Ying

Last edited by Auction; 04-09-2010 at 12:46 PM.
Auction is offline   Reply With Quote
Old 04-10-2010, 02:11 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

When I started developing HTSeq, I had to make a decision on which Python version to require. I refrained from using any Python 2.6 features as many people still work with 2.5, but I didn't expect many 2.4 installations to be still around.

Python 2.5 brought a few nifty feature but I think I didn't use too many of them, and most are only syntactic sugar anyway. I guess that not too many changes would be needed to make it run on 2.4

But wouldn't it be much easier if you convinced your cluster admin to update the system? (Or install it yourself in your home directory: This takes only a couple of minute and Python typically builds from source out of the box.)

Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.
Simon Anders is offline   Reply With Quote
Old 04-10-2010, 06:24 AM   #7
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Quote:
Originally Posted by Simon Anders View Post

Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.
Simon:
You would be surprised to know that one of the clusters I use needs Python 2.3!! Hope fully they update it soon. Fortunately we have another cluster but some one else has to help me load HTseq on it as I do not have access to it.

Siva
Siva is offline   Reply With Quote
Old 04-14-2010, 10:37 AM   #8
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Quote:
Originally Posted by Simon Anders View Post
When I started developing HTSeq, I had to make a decision on which Python version to require. I refrained from using any Python 2.6 features as many people still work with 2.5, but I didn't expect many 2.4 installations to be still around.

Python 2.5 brought a few nifty feature but I think I didn't use too many of them, and most are only syntactic sugar anyway. I guess that not too many changes would be needed to make it run on 2.4

But wouldn't it be much easier if you convinced your cluster admin to update the system? (Or install it yourself in your home directory: This takes only a couple of minute and Python typically builds from source out of the box.)

Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.
The reason I stick on Python 2.4 is that several clusters I can use all use Rock 5.2+Python2.4. By the way, I upgraded my workstation to Python2.6 and installed HTSeq, but it doesn't indicate support for paired-ends. Are there specific options for paired-ends? Thanks
Auction is offline   Reply With Quote
Old 04-14-2010, 10:42 AM   #9
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Quote:
Originally Posted by quinlana View Post
If I understand your question correctly, BEDTools has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
Code:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt
There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.
I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio:B::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks
Auction is offline   Reply With Quote
Old 04-14-2010, 12:28 PM   #10
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Quote:
Originally Posted by Auction View Post
I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio:B::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks
Hi,
You are correct, coverageBed processes every BAM entry in the BAM file provided. The rationale here is that one can use samtools to filter your data before passing to coverageBed. For example, to only count proper pairs, do the following (passing the output of samtools to coverageBed via a pipe, hence the "stdin" for the -abam parameter:

Code:
$ samtools -f 0x2 <BAM> | coverageBed -abam stdin -b transcripts.bed > transcripts.proper.cov
Does this help?
quinlana is offline   Reply With Quote
Old 04-14-2010, 01:50 PM   #11
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Quote:
Originally Posted by quinlana View Post
Hi,
You are correct, coverageBed processes every BAM entry in the BAM file provided. The rationale here is that one can use samtools to filter your data before passing to coverageBed. For example, to only count proper pairs, do the following (passing the output of samtools to coverageBed via a pipe, hence the "stdin" for the -abam parameter:

Code:
$ samtools -f 0x2 <BAM> | coverageBed -abam stdin -b transcripts.bed > transcripts.proper.cov
Does this help?
I see, thanks.
Auction is offline   Reply With Quote
Old 04-15-2010, 06:57 AM   #12
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default HT-Seq

Hi Simon,

Sorry to trouble you with with what is surely a silly question. I've just installed HT-Seq from source on my Mac. I've just tried to run a couple of basic commands to see if things were working:

python -m HTSeq.scripts.qa s_3_sequence.txt

but I'm getting an error message:

Code:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/runpy.py", line 85, in run_module
    loader = get_loader(mod_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 456, in get_loader
    return find_loader(fullname)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 466, in find_loader
    for importer in iter_importers(fullname):
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 422, in iter_importers
    __import__(pkg)
ImportError: No module named HTSeq.scripts
I'm running the script from within the HT-Seq directory. I placed the HT-Seq folder within my Applications directory.

Any idea what I've done wrong?
chrisbala is offline   Reply With Quote
Old 04-15-2010, 07:33 AM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by chrisbala View Post
I've just installed HT-Seq from source on my Mac. I've just tried to run a couple of basic commands to see if things were working:

python -m HTSeq.scripts.qa s_3_sequence.txt

but I'm getting an error message:
It seems that Python does not know were you put the HTSeq modules.

Once you have unpacked the HTSeq tarball, you should run the commands

Code:
   python setup.py build
   python setup.py install
The second command puts it somewhere where Python should find it.

If this does not work, use only the first command. It makes a subdirectory 'build', and in there something like 'lib.macos' (or called similarly), and in there, 'HTSeq'. You can just tell Python to search for HTSeq there by writing

Code:
export PYTHONPATH=$PYTHONPATH:/<the_path_to_your_directory/build/lib.macos/
and Python will find it there.

Do not run the script in the directory that you have just unpacked, go somewhere else, as Python otherwise tends to get confused.
Simon Anders is offline   Reply With Quote
Old 04-15-2010, 08:45 AM   #14
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default almost there

Thanks Simon,

THIngs seem to be working...

Will be try DESeq as soon as I get these counts finalized.

One question: for the counting script, you ask for a gff, and elsewhere you use gtf. I've noticed in tophat as well that is some cases gffs are requested (or output) and sometimes gtf.

Is there are particular aspect of gff that makes it more suitable for this counting algorithm?

just curious...
chrisbala is offline   Reply With Quote
Old 04-15-2010, 09:10 AM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Chris,

GTF is just a tightening of the GFF specification, i.e., every GTF file is also a GFF file. The tightening is mainly with respect to the "feature type" column: in GTF, it has to be one of "exon", "CDS", "stop_codon", etc., while in GFF, it is not clearly said which words should be used.

In the htseq-count script, I give the user the option to tell the script how the features are named, and hence, it works with GFF as well.

See here:
GFF: http://www.sanger.ac.uk/resources/so.../gff/spec.html
GTF: http://mblab.wustl.edu/GTF22.html

Simon

Last edited by Simon Anders; 04-15-2010 at 09:11 AM. Reason: add links
Simon Anders is offline   Reply With Quote
Old 04-17-2010, 05:10 PM   #16
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

Hi Simon,

Can I ask you a question that if I have paired end reads who are overlapping with each other, for a overlapped region, does HTSeq count it as single coverages if reads are from the same fragment (but different ends) or count it as covered twice?

Thanks!

Quote:
Originally Posted by Simon Anders View Post
Here is the script that I made for this purpose: http://www-huber.embl.de/users/ander...doc/count.html

Simon
wenhuang is offline   Reply With Quote
Old 04-18-2010, 12:16 AM   #17
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by wenhuang View Post
Can I ask you a question that if I have paired end reads who are overlapping with each other, for a overlapped region, does HTSeq count it as single coverages if reads are from the same fragment (but different ends) or count it as covered twice?
If you use the full Python framework (as opposed to the the simple htseq-count script) you can adjust such stuff as you need them.

At the moment, however, HTSeq does not support paired-end reads at all. I'll add this functionality very soon, hopefully next week.

Your data would be very useful as test data. Could you maybe send me your SAM file (or parts of it)?

By the way, last time I tried, Bowtie was unable to deal with overlapping ends from the same fragment and did not align these at all. Have you used another software, or did they improve that?

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 04-18-2010, 09:11 AM   #18
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

Hi Simon,

Thanks for your reply. I ended up using tophat and it has no problem aligning overlapping reads by specifying a negative -r as suggested by Cole Trapnell.

I attached an example SAM file below. It is from a highly expressed gene on bovine chromosome one (ATPO). The Ensembl annotation is as below.

I am very eager to see what you get.

Thanks a lot.


chr1 protein_coding exon 720699 720764 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 720729 720764 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding start_codon 720729 720731 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding exon 722101 722151 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "2"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 722101 722151 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "2"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 723098 723208 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "3"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 723098 723208 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "3"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 725362 725491 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "4"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 725362 725491 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "4"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 726174 726286 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "5"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 726174 726286 . + 2 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "5"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 727512 727598 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "6"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 727512 727598 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "6"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 727876 728057 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 727876 727986 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding stop_codon 727987 727989 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";



Quote:
Originally Posted by Simon Anders View Post
If you use the full Python framework (as opposed to the the simple htseq-count script) you can adjust such stuff as you need them.

At the moment, however, HTSeq does not support paired-end reads at all. I'll add this functionality very soon, hopefully next week.

Your data would be very useful as test data. Could you maybe send me your SAM file (or parts of it)?

By the way, last time I tried, Bowtie was unable to deal with overlapping ends from the same fragment and did not align these at all. Have you used another software, or did they improve that?

Cheers
Simon
Attached Files
File Type: gz ATPO.paired.sam.tar.gz (202.5 KB, 9 views)
wenhuang is offline   Reply With Quote
Old 04-19-2010, 09:28 AM   #19
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi wenhuang

I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 04-19-2010, 10:40 AM   #20
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

Hi Simon,

Thank you for your work. Unfortunately, although I am able to get the yeast data provided on your website to work with htseq-count, I am not able to get my own data to work. The SAM file I used is two posts ago and I attach the GTF file to this post.

The error message I got was as below.

Again, thank you for your help.

Traceback (most recent call last):
File "/usr/local/bin/htseq-count", line 5, in <module>
pkg_resources.run_script('HTSeq==0.4.1', 'htseq-count')
File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/pkg_resources.py", line 489, in run_script

File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/pkg_resources.py", line 1207, in run_script
path = self.module_path
File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/EGG-INFO/scripts/htseq-count", line 5, in <module>
HTSeq.scripts.count.main()
File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/HTSeq/scripts/count.py", line 156, in main
opts.mode, opts.featuretype, opts.idattr, opts.quiet )
File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/HTSeq/scripts/count.py", line 68, in count_reads_in_features
if iv.chrom not in features.step_vectors:
AttributeError: 'NoneType' object has no attribute 'chrom'


Quote:
Originally Posted by Simon Anders View Post
Hi wenhuang

I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

Cheers
Simon
Attached Files
File Type: txt ATPO.gtf.txt (3.1 KB, 25 views)
wenhuang is offline   Reply With Quote
Reply

Tags
cufflinks, degseq, deseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO