PDA

View Full Version : read counts per transcript, edgeR, tophat


chrisbala
04-08-2010, 09:36 AM
Hi everyone,

I'm using Illumina RNAseq to test for differential expression between two conditions. I have successfully mapped reads with tophat/bowtie, and used cufflinks/cuffcompare to test for differential expression.

A number of methods (edgeR, DEGseq, DESseq), however, ask for raw read counts rather FPKM.

Is there an easy way to go from tophat output (say the bedgraph or .sam) to raw read counts for each transcript (say using a gtf that I would supply)?

Thanks!

quinlana
04-08-2010, 02:26 PM
If I understand your question correctly, BEDTools (http://bedtools.googlecode.com) has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt

There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.

Simon Anders
04-08-2010, 11:55 PM
Here is the script that I made for this purpose: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Simon

bioinfosm
04-09-2010, 08:24 AM
Yeah, I have used bedtools for similar analysis and it is quite intuitive

If I understand your question correctly, BEDTools (http://bedtools.googlecode.com) has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt

There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.

Auction
04-09-2010, 11:54 AM
Here is the script that I made for this purpose: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Simon

Simon, are there any specific reason to use Python2.5, because the cluster I can use only have Python2.4. And I trid to build the source codes using Python 2.4. When I ran "python setup.py build" I got the warning message which doesn't matter.

/usr/lib64/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'requires'
warnings.warn(msg)

However, when I ran "python setup.py install --prefix=...", I got the error message

byte-compiling $HOME/librarys/lib64/python2.4/site-packages/HTSeq/StepVector.py to StepVector.pyc
File $HOME/librarys/lib64/python2.4/site-packages/HTSeq/StepVector.py", line 390
start = index.start if index.start is not None else self.start_index()
^
SyntaxError: invalid syntax

The reason seems to that "Conditional expressions" is not supported in Python2.4, how can I change the codes without upgrading to Python 2.5

Thanks

Ying

Simon Anders
04-10-2010, 03:11 AM
When I started developing HTSeq, I had to make a decision on which Python version to require. I refrained from using any Python 2.6 features as many people still work with 2.5, but I didn't expect many 2.4 installations to be still around.

Python 2.5 brought a few nifty feature but I think I didn't use too many of them, and most are only syntactic sugar anyway. I guess that not too many changes would be needed to make it run on 2.4

But wouldn't it be much easier if you convinced your cluster admin to update the system? (Or install it yourself in your home directory: This takes only a couple of minute and Python typically builds from source out of the box.)

Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.

Siva
04-10-2010, 07:24 AM
Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.

Simon:
You would be surprised to know that one of the clusters I use needs Python 2.3!!:eek: Hope fully they update it soon. Fortunately we have another cluster but some one else has to help me load HTseq on it as I do not have access to it.

Siva

Auction
04-14-2010, 11:37 AM
When I started developing HTSeq, I had to make a decision on which Python version to require. I refrained from using any Python 2.6 features as many people still work with 2.5, but I didn't expect many 2.4 installations to be still around.

Python 2.5 brought a few nifty feature but I think I didn't use too many of them, and most are only syntactic sugar anyway. I guess that not too many changes would be needed to make it run on 2.4

But wouldn't it be much easier if you convinced your cluster admin to update the system? (Or install it yourself in your home directory: This takes only a couple of minute and Python typically builds from source out of the box.)

Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.

The reason I stick on Python 2.4 is that several clusters I can use all use Rock 5.2+Python2.4. By the way, I upgraded my workstation to Python2.6 and installed HTSeq, but it doesn't indicate support for paired-ends. Are there specific options for paired-ends? Thanks

Auction
04-14-2010, 11:42 AM
If I understand your question correctly, BEDTools (http://bedtools.googlecode.com) has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt

There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.

I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio::DB::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks

quinlana
04-14-2010, 01:28 PM
I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio::DB::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks

Hi,
You are correct, coverageBed processes every BAM entry in the BAM file provided. The rationale here is that one can use samtools to filter your data before passing to coverageBed. For example, to only count proper pairs, do the following (passing the output of samtools to coverageBed via a pipe, hence the "stdin" for the -abam parameter:

$ samtools -f 0x2 <BAM> | coverageBed -abam stdin -b transcripts.bed > transcripts.proper.cov

Does this help?

Auction
04-14-2010, 02:50 PM
Hi,
You are correct, coverageBed processes every BAM entry in the BAM file provided. The rationale here is that one can use samtools to filter your data before passing to coverageBed. For example, to only count proper pairs, do the following (passing the output of samtools to coverageBed via a pipe, hence the "stdin" for the -abam parameter:

$ samtools -f 0x2 <BAM> | coverageBed -abam stdin -b transcripts.bed > transcripts.proper.cov

Does this help?

I see, thanks.

chrisbala
04-15-2010, 07:57 AM
Hi Simon,

Sorry to trouble you with with what is surely a silly question. I've just installed HT-Seq from source on my Mac. I've just tried to run a couple of basic commands to see if things were working:

python -m HTSeq.scripts.qa s_3_sequence.txt

but I'm getting an error message:

Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/runpy.py", line 85, in run_module
loader = get_loader(mod_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 456, in get_loader
return find_loader(fullname)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 466, in find_loader
for importer in iter_importers(fullname):
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 422, in iter_importers
__import__(pkg)
ImportError: No module named HTSeq.scripts

I'm running the script from within the HT-Seq directory. I placed the HT-Seq folder within my Applications directory.

Any idea what I've done wrong?

Simon Anders
04-15-2010, 08:33 AM
I've just installed HT-Seq from source on my Mac. I've just tried to run a couple of basic commands to see if things were working:

python -m HTSeq.scripts.qa s_3_sequence.txt

but I'm getting an error message:


It seems that Python does not know were you put the HTSeq modules.

Once you have unpacked the HTSeq tarball, you should run the commands

python setup.py build
python setup.py install

The second command puts it somewhere where Python should find it.

If this does not work, use only the first command. It makes a subdirectory 'build', and in there something like 'lib.macos' (or called similarly), and in there, 'HTSeq'. You can just tell Python to search for HTSeq there by writing

export PYTHONPATH=$PYTHONPATH:/<the_path_to_your_directory/build/lib.macos/

and Python will find it there.

Do not run the script in the directory that you have just unpacked, go somewhere else, as Python otherwise tends to get confused.

chrisbala
04-15-2010, 09:45 AM
Thanks Simon,

THIngs seem to be working...

Will be try DESeq as soon as I get these counts finalized.

One question: for the counting script, you ask for a gff, and elsewhere you use gtf. I've noticed in tophat as well that is some cases gffs are requested (or output) and sometimes gtf.

Is there are particular aspect of gff that makes it more suitable for this counting algorithm?

just curious...

Simon Anders
04-15-2010, 10:10 AM
Hi Chris,

GTF is just a tightening of the GFF specification, i.e., every GTF file is also a GFF file. The tightening is mainly with respect to the "feature type" column: in GTF, it has to be one of "exon", "CDS", "stop_codon", etc., while in GFF, it is not clearly said which words should be used.

In the htseq-count script, I give the user the option to tell the script how the features are named, and hence, it works with GFF as well.

See here:
GFF: http://www.sanger.ac.uk/resources/software/gff/spec.html
GTF: http://mblab.wustl.edu/GTF22.html

Simon

wenhuang
04-17-2010, 06:10 PM
Hi Simon,

Can I ask you a question that if I have paired end reads who are overlapping with each other, for a overlapped region, does HTSeq count it as single coverages if reads are from the same fragment (but different ends) or count it as covered twice?

Thanks!

Here is the script that I made for this purpose: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Simon

Simon Anders
04-18-2010, 01:16 AM
Can I ask you a question that if I have paired end reads who are overlapping with each other, for a overlapped region, does HTSeq count it as single coverages if reads are from the same fragment (but different ends) or count it as covered twice?


If you use the full Python framework (as opposed to the the simple htseq-count script) you can adjust such stuff as you need them.

At the moment, however, HTSeq does not support paired-end reads at all. I'll add this functionality very soon, hopefully next week.

Your data would be very useful as test data. Could you maybe send me your SAM file (or parts of it)?

By the way, last time I tried, Bowtie was unable to deal with overlapping ends from the same fragment and did not align these at all. Have you used another software, or did they improve that?

Cheers
Simon

wenhuang
04-18-2010, 10:11 AM
Hi Simon,

Thanks for your reply. I ended up using tophat and it has no problem aligning overlapping reads by specifying a negative -r as suggested by Cole Trapnell.

I attached an example SAM file below. It is from a highly expressed gene on bovine chromosome one (ATPO). The Ensembl annotation is as below.

I am very eager to see what you get.

Thanks a lot.


chr1 protein_coding exon 720699 720764 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 720729 720764 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding start_codon 720729 720731 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding exon 722101 722151 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "2"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 722101 722151 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "2"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 723098 723208 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "3"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 723098 723208 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "3"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 725362 725491 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "4"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 725362 725491 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "4"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 726174 726286 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "5"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 726174 726286 . + 2 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "5"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 727512 727598 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "6"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 727512 727598 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "6"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding exon 727876 728057 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
chr1 protein_coding CDS 727876 727986 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
chr1 protein_coding stop_codon 727987 727989 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";



If you use the full Python framework (as opposed to the the simple htseq-count script) you can adjust such stuff as you need them.

At the moment, however, HTSeq does not support paired-end reads at all. I'll add this functionality very soon, hopefully next week.

Your data would be very useful as test data. Could you maybe send me your SAM file (or parts of it)?

By the way, last time I tried, Bowtie was unable to deal with overlapping ends from the same fragment and did not align these at all. Have you used another software, or did they improve that?

Cheers
Simon

Simon Anders
04-19-2010, 10:28 AM
Hi wenhuang

I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

Cheers
Simon

wenhuang
04-19-2010, 11:40 AM
Hi Simon,

Thank you for your work. Unfortunately, although I am able to get the yeast data provided on your website to work with htseq-count, I am not able to get my own data to work. The SAM file I used is two posts ago and I attach the GTF file to this post.

The error message I got was as below.

Again, thank you for your help.

Traceback (most recent call last):
File "/usr/local/bin/htseq-count", line 5, in <module>
pkg_resources.run_script('HTSeq==0.4.1', 'htseq-count')
File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/pkg_resources.py", line 489, in run_script

File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/pkg_resources.py", line 1207, in run_script
path = self.module_path
File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/EGG-INFO/scripts/htseq-count", line 5, in <module>
HTSeq.scripts.count.main()
File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/HTSeq/scripts/count.py", line 156, in main
opts.mode, opts.featuretype, opts.idattr, opts.quiet )
File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/HTSeq/scripts/count.py", line 68, in count_reads_in_features
if iv.chrom not in features.step_vectors:
AttributeError: 'NoneType' object has no attribute 'chrom'


Hi wenhuang

I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

Cheers
Simon

chrisbala
04-19-2010, 01:22 PM
Hi Simon,

I just want to clarify something about the gtf/gff formating and the counting by HTseq

My gff looks like this:

chr1 taeGut1_ensGene gene 8373168 8392590 . + . gene_id=ENSTGUT00000007148;Name=
chr1 taeGut1_ensGene mRNA 8373168 8392590 . + . gene_id=ENSTGUT00000007148;Name=;Parent=ENSTGUT00000007148
chr1 taeGut1_ensGene exon 8373168 8373328 . + . gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
chr1 taeGut1_ensGene exon 8374845 8375006 . + . gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
chr1 taeGut1_ensGene exon 8383527 8383695 . + . gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
chr1 taeGut1_ensGene exon 8392390 8392590 . + . gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148


THe HTseq documentation specifies that the default is counting by exon, yet the output from for HTseq corresponds to the ensembl #s. (One number is output per ensembl id). Is this because I do not have unique ids for each exon? Or is it actually counting across the transcript? IN order to ACTUALLY count by exon, would I just have to include unique ids for each? (and remove the mRNA and and gene lines?) Does anyone have an easy way to to assign unique ids for each exon? thanks!

Simon Anders
04-19-2010, 01:50 PM
Hi Chris,

the default that you mention -- gene IDs but exon features -- is meant to count how many reads fall onto each gene. I put this as default as it seemed to me to be the most common use case.

Many users might prefer to count by transcript but this is actually a rather involved problem. If a read maps to an exon shared by many transcripts how do you decide which one to use? If you have a good algorithm to decide this, HTSeq gives you the tools to implement it conveniently, but I don't have one handy (am I'm eager to see what the cufflink people came up with once their paper is out).

I also have a script to count by exons, which I can give you if you need it. Giving each exon in the GTF file a unique ID is not a problem: just concatenate the transcript ID with the exon number. The problem is rather that each exon appears multiple times in an Ensembl GTF file, namely once for each transcript in which it appears. If you attempt to collapse all these multiple copies of an exon to a single entity, you'll notice that it is quite common for an exon to have have variants: Outer exons may have varying outer borders due to alternative transcription starts and ends (i.e., UTRs of varying length), and sometimes introns are not spliced out, i.e., an exon-intron-exon sequence in one transcript is just one long exon in the other one.

Hence, your job is first to decide what is, for your specific application, the right way to count these cases. Once you know this, coding it in Python with HTSeq is easy.

Simon

Simon Anders
04-19-2010, 01:59 PM
Thank you for your work. Unfortunately, although I am able to get the yeast data provided on your website to work with htseq-count, I am not able to get my own data to work. The SAM file I used is two posts ago and I attach the GTF file to this post.

I've fixed the bug.

chrisbala
04-19-2010, 02:19 PM
Hi Simon,

You might be able to tell, but I'm working with a "new" and fairly poorly annotated genome. So we actually don't have much info at all about alternative transcripts in ensembl. So I'm willing to be a little sloppy about how reads are assigned to exons shared among transcripts (assign to both). In fact it is because I suspect that assigning reads is going to be really tricky that I just want to test for differential expression of exons to identify some candidates for alternative splicing. I'm trying out cufflinks predictions as well...

But if you think this makes sense, I'd be thankful to use your script. If there any key format specs just let me know. would you post it here? Or somewhere where I can download?

Thanks,

Chris

blackgore
05-04-2010, 08:50 AM
Hi Simon,

this sounds like a very simple question, but I have to ask anyway...

The gene counts file in the DESeq vignette gives one gene count per gene, per lane, or seems to at least. Can I assume that this is a summed count over all exons of that gene? if so, then is there a need to worry about alternative splicings, where the same gene contains different exons? Do they all get summed to give a final, single overall value for the gene count, and is that ok...? Or am I missing something very basic?

Simon Anders
05-04-2010, 09:54 AM
The gene counts file in the DESeq vignette gives one gene count per gene, per lane, or seems to at least. Can I assume that this is a summed count over all exons of that gene? if so, then is there a need to worry about alternative splicings, where the same gene contains different exons? Do they all get summed to give a final, single overall value for the gene count, and is that ok...? Or am I missing something very basic?

The short answer:

No, you described it accurately.

When a gene's splicing ratios change, its expression strength surely will not stay constant either, so I am not overly worried about mistaking changes in splicing for changes in expression.

---

The long answer:

I wonder whether your being "worried" about alternative splicing might be due to a confusion about who has 'burden of proof".

You see, DESeq performs hypothesis testing against the null hypothesis of no change. Our aim is to reject the null hypothesis, i.e., to get convinced that, for a given gene, expression is different in the two experimental condition. Due to noise, we worry that an apparent difference is just a fluke of this noise, and hence, the expression difference has to be either strong enough to stand out against the background noise, or we need many replicates too keep the noise down.

Changes in isoform proportion are much harder to detect than changes of overall expression levels. So, if the concentration of transcripts of a gene goes down by a quarter, this might easily stick out from noise, even if we have only two or three replicates. However, if the overall expression levels stays as it is, but the ratio of isoform A to isoform B changes from 3:2 to 2:3, this causes more subtle changes in the counts and, in my opinion, two or so lanes per condition will usually be insufficient to be sure that such an observation is a biological effect and not just noise.

This is why we recommend to first go for counts by gene, summing over all exons. You will miss out on alternative splicing events but you will find something.

Strictly speaking, you might mistake alternative splicing for differential expression: if a gene contains a cassette exon that is present in nearly all transcripts in condition A and absent in nearly all transcripts in condition B, you will notice a difference in overall counts. Hence, what DESeq gives you is genes with changes in expression, which will typically be changes in overall expression strength but may also include changes in splicing patterns.

To distinguish this, you can also count by exon rather than by gene, and check for each exon whether its expression changes with experimental condition. If all exons of a gene change the same way, overall expression has changed but splicing ratios have maybe not, but if one exon falls out, it may be a cassette exon whose presence is influenced by your experimental condition.

Using DESeq with such a "counts by exon" table works fine as well. However, you have less counts per exon than per gene, and hence, you have less power to detect anything.

blackgore
06-28-2010, 06:08 AM
sorry Simon, I had meant to reply much sooner than this - thanks for your response, it was altogether very useful! :)

jlfmssm
06-28-2010, 10:09 PM
I try to run coverageBed like this:
/home/ay55/BEDTools/bin/genomeCoverageBed -ibam 1382_1_sorted.bam -i hg18.bed.sorted -g hg18.genome > hg18_seq.cov.txt

no error at all, but when I open hg18_seq.cov.txt, I got this
read block failed - CheckBlockHeader() returned false
Could not read header type

Any idea?




I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio::DB::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks

skycreative
06-29-2010, 12:06 AM
Simon:
You would be surprised to know that one of the clusters I use needs Python 2.3!!:eek: Hope fully they update it soon. Fortunately we have another cluster but some one else has to help me load HTseq on it as I do not have access to it.

Siva

could you create the ENV on your home directory and add the path to the $PATH?

tangx_2010
06-04-2011, 11:57 PM
Dear Simon,
Very haoppy to use HTSeq and DEseq developped by you. I have several questions to ask you about.
1. How does HTSeq count pair-end reads by genes when both singleton and paired reads are presented? If the count is based on fragment, then paired two reads are counted by one, and singleton also counted by one, right?
2. How does HTSeq treats multiple mapped reads? Are they just counted several times by different genes?

sincerely tangx

Hi wenhuang

I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

Cheers
Simon

Simon Anders
06-13-2011, 02:41 PM
1. How does HTSeq count pair-end reads by genes when both singleton and paired reads are presented? If the count is based on fragment, then paired two reads are counted by one, and singleton also counted by one, right?

Exactly. A paired read is only counted if both ends map to the same gene, and then, it is counted once, not twice, to this gene.


2. How does HTSeq treats multiple mapped reads? Are they just counted several times by different genes?

This is a bit an issue, as different aligners have different ways of reporting multiple hits, and the SAM specification is frustratingly unclear on this.

In general, a multiply aligned read should be discarded. (Imagine, genes A and B have partial sequence identity. If A is differentially expressed and B is not, any read originating from A that matches to both A and B will let B appear as differentially expressed, too, if it is counted for both. Hence, the prudent strategy is to only count reads that map uniquely to a gene.)

For now, HTSeq looks for the "NH" optional flag. If it indicates that more than one alignment is reported, the read is not counted. If you use the "--minaqual" option, you can also cause all reads with low alignment quality to be skipped, which is another way how some aligners tag multiple alignments. If neither of the two works for you, you should pre-filter the SAM file. It is easy to write such a filtering script with HTSeq.

jameslz
06-16-2011, 10:56 PM
Exactly. A paired read is only counted if both ends map to the same gene, and then, it is counted once, not twice, to this gene.



This is a bit an issue, as different aligners have different ways of reporting multiple hits, and the SAM specification is frustratingly unclear on this.

In general, a multiply aligned read should be discarded. (Imagine, genes A and B have partial sequence identity. If A is differentially expressed and B is not, any read originating from A that matches to both A and B will let B appear as differentially expressed, too, if it is counted for both. Hence, the prudent strategy is to only count reads that map uniquely to a gene.)

For now, HTSeq looks for the "NH" optional flag. If it indicates that more than one alignment is reported, the read is not counted. If you use the "--minaqual" option, you can also cause all reads with low alignment quality to be skipped, which is another way how some aligners tag multiple alignments. If neither of the two works for you, you should pre-filter the SAM file. It is easy to write such a filtering script with HTSeq.

I have a question:
The samtools flagstat result of the bam:

8236963 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
8236963 + 0 mapped (100.00%:-nan%)
8236963 + 0 paired in sequencing
4144407 + 0 read1
4092556 + 0 read2
6061528 + 0 properly paired (73.59%:-nan%)
7724492 + 0 with itself and mate mapped
512471 + 0 singletons (6.22%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

the line : 7724492 + 0 with itself and mate mapped (with itself : that both the forward and reverse are mapped, http://i.seqanswers.com/questions/80/interpreting-samtools-flagstat-output)

dose HTSeq handle such paired reads?

tangx_2010
06-21-2011, 01:19 AM
Exactly. A paired read is only counted if both ends map to the same gene, and then, it is counted once, not twice, to this gene.

Thank you for your reply. You are such a good man with patience. And there is another problem. I am runing HTSeq with pair-end and strand-specific RNA-seq data. It's very strange that most of the reads (> 90%) are counted as ambiguous. However, when I take the data as non strand-specific, HTSeq works very well. Waiting for your reply.
sincerely tangx

edge
06-22-2011, 08:49 PM
Hi quinlana,

I try the following command:

time samtools -f 0x2 /tophat_out/accepted_hits.bam | coverageBed -abam /tophat_out/accepted_hits.bam -b transcripts.gtf > transcripts.proper.cov &

Unfortunately, it gives the following error message:

BgzfStream ERROR: read block failed - invalid block header
BamHeader ERROR: could not read magic number


Do you mind to explain a little bit more about the correct command?
Kindly point out the error that I did in my example.
Thanks.

edge
06-22-2011, 08:52 PM
Hi Auction,

Is it this should be the correct command in order to identify the raw count of each transcript ?

samtools view -f 0x2 /tophat_out/accepted_hits.bam | coverageBed -abam /tophat_out/accepted_hits.bam -b transcripts.gtf > transcripts.proper.cov

Simon Anders
07-03-2011, 05:48 AM
Thank you for your reply. You are such a good man with patience. And there is another problem. I am runing HTSeq with pair-end and strand-specific RNA-seq data. It's very strange that most of the reads (> 90%) are counted as ambiguous. However, when I take the data as non strand-specific, HTSeq works very well. Waiting for your reply.
sincerely tangx
This is weird. Are you sure, you mean 'ambiguous', and not 'no_feature'?

oliviera
02-02-2012, 02:40 AM
In general, a multiply aligned read should be discarded. (Imagine, genes A and B have partial sequence identity. If A is differentially expressed and B is not, any read originating from A that matches to both A and B will let B appear as differentially expressed, too, if it is counted for both. Hence, the prudent strategy is to only count reads that map uniquely to a gene.)

Dear Simon,
I hope this thread will still find you well.
I am mapping reads from small RNAseq and would like to use HTseq/ DEseq. As miRNA map very often to more than one genomic location I would like to know your suggestions to deal with these multiple mapped reads with HTSeq. For what I saw on your previous feedback, you suggest to remove the multiple mapped reads but in my case that would mean that 80% of the reads are discarded... Could you please advise some methodology to use HTseq with miRNA data?

Olivier

Simon Anders
02-05-2012, 01:27 PM
Sorry, I've never worked myself with miRNA-Seq data. From what I've heard, a common approach is to first compile a list of all miRNA sequences in the genome, then find for each read a miRNA in the list. In other words: skip the alignment against the genome completely, instead just compare reads with miRNAs. Of course, htseq-count is not suitable for this, but with some Python knowledge, you can write your own script with HTSeq.

Baoqing
06-09-2013, 01:39 PM
Hi, Simon

I was trying to install HTseq on my local ubuntu machine by following your instruction below:
http://www-huber.embl.de/users/anders/HTSeq/doc/install.html#install

However, there are some error messages there. could you help me to get around this?

Thanks for your help!

bq@bq-VirtualBox:~/Desktop/apps/HTSeq-0.5.4p3$ ls
build_it doc HTSeq.egg-info MANIFEST.in README setup.cfg src
clean HTSeq LICENSE PKG-INFO scripts setup.py VERSION
bq@bq-VirtualBox:~/Desktop/apps/HTSeq-0.5.4p3$ python setup.py install --user
Could not import 'setuptools', falling back to 'distutils'.
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/HTSeq
copying HTSeq/__init__.py -> build/lib.linux-x86_64-2.7/HTSeq
copying HTSeq/_HTSeq_internal.py -> build/lib.linux-x86_64-2.7/HTSeq
copying HTSeq/StepVector.py -> build/lib.linux-x86_64-2.7/HTSeq
copying HTSeq/_version.py -> build/lib.linux-x86_64-2.7/HTSeq
creating build/lib.linux-x86_64-2.7/HTSeq/scripts
copying HTSeq/scripts/__init__.py -> build/lib.linux-x86_64-2.7/HTSeq/scripts
copying HTSeq/scripts/qa.py -> build/lib.linux-x86_64-2.7/HTSeq/scripts
copying HTSeq/scripts/count.py -> build/lib.linux-x86_64-2.7/HTSeq/scripts
running build_ext
building 'HTSeq._HTSeq' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c src/_HTSeq.c -o build/temp.linux-x86_64-2.7/src/_HTSeq.o -w
gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/src/_HTSeq.o -o build/lib.linux-x86_64-2.7/HTSeq/_HTSeq.so
building 'HTSeq._StepVector' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/StepVector_wrap.cxx -o build/temp.linux-x86_64-2.7/src/StepVector_wrap.o -w
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for Ada/C/ObjC but not for C++ [enabled by default]
g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/src/StepVector_wrap.o -o build/lib.linux-x86_64-2.7/HTSeq/_StepVector.so
running build_scripts
creating build/scripts-2.7
copying and adjusting scripts/htseq-qa -> build/scripts-2.7
copying and adjusting scripts/htseq-count -> build/scripts-2.7
changing mode of build/scripts-2.7/htseq-qa from 664 to 775
changing mode of build/scripts-2.7/htseq-count from 664 to 775
running install_lib
creating /home/bq/.local/lib/python2.7/site-packages/HTSeq
copying build/lib.linux-x86_64-2.7/HTSeq/_StepVector.so -> /home/bq/.local/lib/python2.7/site-packages/HTSeq
copying build/lib.linux-x86_64-2.7/HTSeq/_HTSeq.so -> /home/bq/.local/lib/python2.7/site-packages/HTSeq
copying build/lib.linux-x86_64-2.7/HTSeq/StepVector.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq
copying build/lib.linux-x86_64-2.7/HTSeq/_version.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq
copying build/lib.linux-x86_64-2.7/HTSeq/_HTSeq_internal.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq
creating /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts
copying build/lib.linux-x86_64-2.7/HTSeq/scripts/qa.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts
copying build/lib.linux-x86_64-2.7/HTSeq/scripts/count.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts
copying build/lib.linux-x86_64-2.7/HTSeq/scripts/__init__.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts
copying build/lib.linux-x86_64-2.7/HTSeq/__init__.py -> /home/bq/.local/lib/python2.7/site-packages/HTSeq
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/StepVector.py to StepVector.pyc
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/_version.py to _version.pyc
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/_HTSeq_internal.py to _HTSeq_internal.pyc
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts/qa.py to qa.pyc
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts/count.py to count.pyc
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/scripts/__init__.py to __init__.pyc
byte-compiling /home/bq/.local/lib/python2.7/site-packages/HTSeq/__init__.py to __init__.pyc
running install_scripts
creating /home/bq/.local/bin
copying build/scripts-2.7/htseq-qa -> /home/bq/.local/bin
copying build/scripts-2.7/htseq-count -> /home/bq/.local/bin
changing mode of /home/bq/.local/bin/htseq-qa to 775
changing mode of /home/bq/.local/bin/htseq-count to 775
running install_egg_info
Writing /home/bq/.local/lib/python2.7/site-packages/HTSeq-0.5.4p3.egg-info
bq@bq-VirtualBox:~/Desktop/apps$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>>

Simon Anders
06-09-2013, 02:39 PM
I don't see any error messages.

"Could not import 'setuptools', falling back to 'distutils'." -- Well, that's what a fall-back is for: If you don't have setuptools, it uses distutils which works as well.

Baoqing
06-09-2013, 03:04 PM
sorry, just start to pick up the python, any reminder message freak me out.

Just set up the path right for htseq-count

then run it with my bam files:
samtools view accepted_hits.bam | htseq-count - gene.gtf > counts.txt

While the program was running, there was a warning "claimed to have an aligned mate which could be found. (Is the sam file properly sorted?)

Does this warning "complaining that my file is not pair mated, or do i have to add other arguments for this line to run?

Thanks a lot for your help!

Best,

rboettcher
06-10-2013, 01:35 AM
Your bam/sam file needs to be name sorted for HTSeq-count, see "samtools sort -n".

Regards

Baoqing
06-10-2013, 07:26 AM
Thank you, I did sort my bam files though. I have the program to finished running, and it turned out a dataset was produced, does that mean everything is good? Should I just ignore the warning message?

Best,

dpryan
06-10-2013, 08:59 AM
Thank you, I did sort my bam files though. I have the program to finished running, and it turned out a dataset was produced, does that mean everything is good? Should I just ignore the warning message?

Best,

See this thread (http://seqanswers.com/forums/showthread.php?t=30380), that's likely what's going on.

Baoqing
06-10-2013, 12:29 PM
awesome. The name sorted bam file worked perfect! Really appreciate it!

gwilymh
01-19-2014, 03:06 PM
If I understand your question correctly, BEDTools (http://bedtools.googlecode.com) has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

For example:
$ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt

There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.

Just to clarify, is the depth of coverage of each feature synonymous with the raw counts of reads?