SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
STAR: ultrafast universal RNA-seq aligner alexdobin Bioinformatics 218 04-02-2018 06:59 PM
Primer design Universal tailed Amplicon paloma84 454 Pyrosequencing 2 11-26-2012 12:33 PM
Program edit an ace file to simultaneously extract read information from all contig? cllorens Bioinformatics 3 06-27-2012 09:20 AM
PubMed: Target-Enrichment Through Amplification of Hairpin-Ligated Universal Targets Newsbot! Literature Watch 0 03-25-2011 07:10 AM
PubMed: Development and quantitative analyses of a universal rRNA-subtraction protoco Newsbot! Literature Watch 1 03-12-2010 06:41 PM

Reply
 
Thread Tools
Old 06-01-2013, 03:48 AM   #21
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,

I just had another question about featureCounts and the strand specificity parameter.
We have a strand-specific RNA-seq dataset, on which we initially used HTseq-count, so as you know HTseq-count has three options for the strand-specific parameter: --stranded no or --stranded yes or --stranded reverse and in our case the appropriate options is actually the latter: --stranded reverse. So now we want to use featureCounts (as we found it very quick and accurate) on this dataset, however the current standed options are: unstranded by default or -s for stranded. So after testing we understood that featureCounts unstranded correspond to HTseq-count --stranded no and that featureCounts -s correspond to HTseq-count --stranded yes. So as I said above the option that we want for this dataset is really --stranded reverse, which we were not able to find in featureCounts.
So I was wondering if I missed something in the featureCounts users guide and this option is already available? Or if it is not possible to do that, I was wondering if you have any plan to incorporate such option?

Thanks a lot for your help.
Regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-01-2013, 04:45 PM   #22
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas,

We have just modified featureCounts to make it be able to perform reverse-stranded read counting. The latest version can be found in Subread v1.3.3-p5 (http://subread.sourceforge.net).

Providing a parameter '-s 2' to featureCounts will instruct it to perform reverse-stranded read counting.

Thank you for your suggestion and I hope it works for your dataset.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-02-2013, 05:41 AM   #23
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,
This is great, thanks a lot for being so helpful and available.
I will give this a try on our dataset.
Thanks a lot.
Regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-03-2013, 12:55 AM   #24
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Hi Wei,

I have tested your test-dataset and it indeed gave the same results for both HTseq-count as FeatureCounts.
I think the difference is made by the NH tag in my BAM/SAM file. When I tried a single read that mapped 5 times it gives a different result.
According the documentation of HTseq-count it excludes all reads with more than one reported alignment (whenever the NH tag is available), which I believe FeatureCounts does not.
Using the mapping quality filter in FeatureCounts, it will produce the same result, since reads with more than one reported alignment have a lower mapping quality.

I have used Tophat for the alignment of my reads, followed by HTseq-count as well as FeatureCounts. I've attached a test set with 5 multimapped hits from the data I got from TopHat.
The commands I used were:

HTseq-count:
Code:
htseq-count MMread.sam annotation.gtf -s no -q > genecounts_MMread.txt
FeatureCounts:
Code:
featureCounts -a annotation.gtf -t exon -g 'gene_id' -i MMread.sam -T 5 -o feature_counts_MMread.txt
Thanks for the quick responses!

Iris
Attached Files
File Type: zip testdataMM.zip (1.6 KB, 6 views)
iris_aurelia is offline   Reply With Quote
Old 06-03-2013, 07:34 PM   #25
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Iris,

Thanks for looking into this deeply and providing a reproducible example.

You were correct that featureCounts did not exclude multi-mapping reads from the summarization. We have just added a '-M' option to featureCounts to enable it to disregard multi-mapping reads during read summarization. When '-M' is provided, featureCounts will use the 'NH' tag to identify multi-mapping reads and then exclude them.

These changes have been incorporated into the latest version (1.3.4) of the Subread package. You can download it from http://subread.sourceforge.net . We have tested it and it worked fine.

Please let me know if you run into any problems.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-03-2013, 09:57 PM   #26
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by shi View Post
Dear Iris,

Thanks for looking into this deeply and providing a reproducible example.

You were correct that featureCounts did not exclude multi-mapping reads from the summarization. We have just added a '-M' option to featureCounts to enable it to disregard multi-mapping reads during read summarization. When '-M' is provided, featureCounts will use the 'NH' tag to identify multi-mapping reads and then exclude them.

These changes have been incorporated into the latest version (1.3.4) of the Subread package. You can download it from http://subread.sourceforge.net . We have tested it and it worked fine.

Please let me know if you run into any problems.

Best wishes,

Wei

Dear Iris,

Sorry, we have just flipped the meaning of the '-M' option. So now when running featureCounts in its default setting, multi-mapping reads will be excluded. The '-M' option is only useful if you want to count those multi-mapping reads (this is certainly not your case here).

The latest version is still 1.3.4 (the usage info for featureCounts and users guide has been updated).

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-04-2013, 12:03 AM   #27
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Hi Wei,

Thanks, I'm definitely gonna try this out, since featureCounts is much faster than HTseq-count.
I'll let you know if I run into any other problems!

Iris
iris_aurelia is offline   Reply With Quote
Old 06-11-2013, 10:15 AM   #28
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,


We have now tried featureCounts v1.3.5-p1 to perform unstranded, stranded and reverse-stranded read counting on our new strand-specific dataset. While the unstranded parameter for featureCounts gives results very similar to HTseq-count, it seems that when it comes to stranded or reverse-stranded parameters the results between featureCounts and HTseq-count are quite divergent (see pdf file enclosed). I am actually expecting results similar to the one provided by HTseq-count.

I may be wrong but I believe that at the moment, featureCounts is assigning reads only to genes present on the + strand when using the stranded (s 1) parameter. And when using the reverse stranded (s 2) parameter, reads will be assigned only to genes present on the strand.

However, it does not correspond to what I would call strand-specific read counts (I cannot talk for other RNA-seq users, but other RNA-seq users in our lab agree with me). Indeed, s 1 and s 2 parameters should in fact be one single parameter since users want read counts summarisation for genes present on + and strands. Strand-specific is not the difference between genes present on the + or strand.

What strand-specific actually means is that:
- (1) A read mapping to the + strand (for example read with flag 0 or 99 [its mate having flag of 147]) can be assigned to a gene present on the + strand (in this case it is a sense read)
- (2) A read mapping to the strand (for example read with flag 16 or 83 [its mate having flag of 163]) can be assigned to a gene present on the - strand (in this case it is also a sense read)
- (3) A read mapping to the + strand (for example read with flag 0 or 99 [its mate having flag of 147]) can be assigned to a gene present on the - strand (in this case it is an antisense read)
- (4) A read mapping to the strand (for example read with flag 16 or 83 [its mate having flag of 163]) can be assigned to a gene present on the + strand (in this case it is also an antisense read)

So I believe that, firstly when using the unstranded parameter, gene counts summarisation should contain summarisation for all reads corresponding to these 4 possibilities listed above; secondly when using the stranded (s 1) parameter, gene counts summarisation should contain summarisation for all reads corresponding to possibility (1) and (2); and thirdly when using the reverse-stranded (s 2) parameter, gene counts summarisation should contain summarisation for all reads corresponding to possibility (3) and (4).

So it also means that if a read is overlapping a gene present on the + strand and a gene present on the strand, this read should be considered as overlapped_genes (and not counted unless specified by O parameter on command line). In other terms, when you add number of reads with accepted-genes obtained from s 1 and s 2, you should obtain very close to the number obtained from s 0, indeed a read counted in s 1 should not also be counted in s 2.

Thanks a lot for all your help with this, I hope it was understandable (not the easiest to explain via email, otherwise let me know). And if I may suggest, it will be great if you can check these unstranded, stranded and reverse-stranded parameters for read summarisation.
Thanks a lot,
Nicolas
Attached Files
File Type: pdf Comparison_count.pdf (170.0 KB, 34 views)
Nicolas Nalpas is offline   Reply With Quote
Old 06-11-2013, 05:05 PM   #29
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas,

Thanks for reporting the problem in much details. I certainly agree with your explanation of the stranded and unstranded read summarization and this is actually the counting scheme we implemented in featureCounts program. So I'm a bit confused with what you have observed.

Is it possible that you can provide us the first 10 reads you used for summarization and also your command for running featureCounts? This will help us to figure out what went wrong. Thanks.


Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-12-2013, 01:31 AM   #30
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,

Thanks for your reply.
Enclosed is a file containing my first 5,000 mapped reads (forgot to say in my earlier post that I used STAR to align my reads against Bos taurus 3.1.71 genome).
I checked again featureCounts on this file and I was able to reproduce my results of yesterday (using Bos taurus 3.1.71 annotation).
Also here is my commands for unstranded, stranded and reverse stranded:
Code:
featureCounts -a /workspace/storage/genomes/bostaurus/UMD3.1.71/annotation_file/Bos_taurus.UMD3.1.71.gtf -t exon -g gene_id -i 5000_reads.sam -o test.unstranded -s 0 -R -p -B
featureCounts -a /workspace/storage/genomes/bostaurus/UMD3.1.71/annotation_file/Bos_taurus.UMD3.1.71.gtf -t exon -g gene_id -i 5000_reads.sam -o test.stranded -s 1 -R -p -B
featureCounts -a /workspace/storage/genomes/bostaurus/UMD3.1.71/annotation_file/Bos_taurus.UMD3.1.71.gtf -t exon -g gene_id -i 5000_reads.sam -o test.reverse -s 2 -R -p -B
I hope this help, thanks a lot for all your help.
Regards,
Nicolas
Attached Files
File Type: gz 5000_reads.sam.gz (408.8 KB, 16 views)
Nicolas Nalpas is offline   Reply With Quote
Old 06-12-2013, 04:15 PM   #31
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas,

Many thanks for sending us your read data which helped us to identify the problem. The problem was that in your SAM file about half the fragments had their forward read appearing before their reverse read and the other half had their forward read appearing after their reverse read. This is not what featureCounts expects to see since most aligners always put the forward read before the reverse read in their mapping output.

The change of read order does not affect unstranded read counting with featureCounts, but it resulted in around half of the fragments not being counted when stranded read counting was performed, which is what you observed.

Nonetheless, we are now modifying featureCounts to deal with this situation and a patch will be released soon.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-12-2013, 04:23 PM   #32
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I've written several utilities for post-processing on sam/bam files and I've learned that aligners definitely don't all report things in the same way. Another one is how they report singleton mates. Seems like there are a couple conventions for this.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-12-2013, 07:54 PM   #33
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas,

We have just fixed this issue. Please try on the latest version 1.3.5-p3 (http://subread.sourceforge.net).

We have tested it and it worked fine in our end. Let me know if you ran into other problems.


Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-13-2013, 02:00 AM   #34
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,
I have just tried the new version of featureCounts, it works perfectly (see below).

Code:
## UNSTRANDED ##	accepted_gene	multi_mapping	Notfound_gene	Overlapped_genes
featureCounts with SAM containing only unique	 3,181,215 	 -   	 794,232 	 11,469 
Htseq-count with SAM containing only unique	 3,174,243 	 -   	 810,359 	 2,314
				
## STRANDED ##	accepted_gene	multi_mapping	Notfound_gene	Overlapped_genes
featureCounts with SAM containing only unique	 2,784,181 	 -   	 1,198,726 	 4,009 
Htseq-count with SAM containing only unique	 2,783,943 	 -   	 1,202,275 	 698
				
## STRANDED REVERSE ##	accepted_gene	multi_mapping	Notfound_gene	Overlapped_genes
featureCounts with SAM containing only unique	 421,127 	 -   	 3,564,919 	 870 
Htseq-count with SAM containing only unique	 420,840 	 -   	 3,566,004 	 72
I am definitely switching to featureCounts for my new analyses, since it really is an amazing software.
Thanks again for all your help and for always being available.
Best wishes,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-13-2013, 04:20 AM   #35
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas,

Thanks for letting me know it worked for you. That's great. We really appreciate you putting up with our bugs and helping us to improve the program. I hope you find it useful for your analyses in the future.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-14-2013, 03:10 AM   #36
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Wei,

I am wondering about some results I have had with tests I am running. I have been looking at appropriate quality cutoff values using Trimmomatic to trim reads based on quality score. I have tested multiple Q scores. Run parameters in featureCounts:

Code:
featureCounts -a /home/bin/my.gtf \
-i /home/data/quality/Q20/Aligned.out.sam \ 
-o /home/data/quality/Q20/Q20.featco.trimmo.counts \
-T 12 -p -P -C -R -D 500000
I run:

Code:
cut -f 2 *.counts.reads | sort | uniq -c > *.diags
giving me:

Code:
::::::::::::::
Q20.diags
::::::::::::::
  87805 ACCEPTED_2VOTE_GENE
14507634 ACCEPTED_GENE
4345122 MULTI_MAPPING
2315513 NOTFOUND_GENE
8709356 OVERLAPPED_GENES
  20001 PAIR_DISTANCE
::::::::::::::
Q40.diags
::::::::::::::
 149517 ACCEPTED_2VOTE_GENE
22640462 ACCEPTED_GENE
3425942 MULTI_MAPPING
4112768 NOTFOUND_GENE
  96425 OVERLAPPED_GENES
   2464 PAIR_DISTANCE
Can you see any reason that the untrimmed (ie Q40) alignment would have such a vastly reduced number of OVERLAPPED_GENES? Both input SAMs made in the same way using STAR with same parameters.

Anyone else with any ideas also very welcome to contribute!
bruce01 is offline   Reply With Quote
Old 06-14-2013, 05:35 AM   #37
yangliao
Member
 
Location: Melboune, VIC

Join Date: May 2013
Posts: 13
Default

Sorry that I misunderstood the issue.

Last edited by yangliao; 06-14-2013 at 06:18 AM.
yangliao is offline   Reply With Quote
Old 06-14-2013, 06:55 AM   #38
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Quote:
Originally Posted by yangliao View Post
Sorry that I misunderstood the issue.
Still, I will try the Q filtering of featureCounts as it will remove a preprocessing step of the data. Thanks for your input.
bruce01 is offline   Reply With Quote
Old 06-16-2013, 05:01 PM   #39
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear bruce01,

The featureCounts results just tell you that after trimming you got a lot more fragments mapped to regions where there is more than one gene residing. In fact, for the untrimmed reads you got about 23 million fragments successfully assigned (the groups 'ACCEPTED_GENE' and 'ACCEPTED_2VOTE_GENE'), whereas for the trimmed reads you got only about 15 million fragments assigned. So you lost 8 million assigned fragments. I bet these 8 million reads were almost all affected by your trimming and their mapping locations were changed. But the strange thing here is why they were all remapped to the regions where there are overlapping genes? You will have to ask the developers of the aligner you used about this.

Using Q20 to filter out read bases seems to result in the removal of a lot of good quality bases. This will make the mapping of RNA-seq data harder because reads become shorter, especially for those exon-spanning reads, from my point of view. We do not perform read trimming when mapping RNA-seq reads. Part of the reason for this is that we use the Subread aligner which can tolerate quite a few sequencing errors.


Hope this helps.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-24-2013, 06:16 AM   #40
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,

Really sorry to bother you again, I have just one question as a matter of interest regarding featureCounts.

How do you handle your read count summarisation, when is a read assigned to a feature exactly? I know that HTseq-count has three summarisation modes (see pictures enclosed): union, intersection_strict, intersection_nonempty; so I am just interested to know what is yours (sorry I was not able to find the information in the users guide and paper).



Thanks a lot for all your help.
Best wishes,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO