SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
STAR: ultrafast universal RNA-seq aligner alexdobin Bioinformatics 218 04-02-2018 05:59 PM
Primer design Universal tailed Amplicon paloma84 454 Pyrosequencing 2 11-26-2012 11:33 AM
Program edit an ace file to simultaneously extract read information from all contig? cllorens Bioinformatics 3 06-27-2012 08:20 AM
PubMed: Target-Enrichment Through Amplification of Hairpin-Ligated Universal Targets Newsbot! Literature Watch 0 03-25-2011 06:10 AM
PubMed: Development and quantitative analyses of a universal rRNA-subtraction protoco Newsbot! Literature Watch 1 03-12-2010 05:41 PM

Reply
 
Thread Tools
Old 11-05-2013, 12:27 PM   #61
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

What Bruce suggested is the current way of getting assignment stats from using featureCounts. But apparently it is not user-friendly. It is on our to-do list to output assignment stats (with no need to turn on -R option) when read summarization is done, but it will take a week or two to implement it. We will make a new release in a day or so, but it is not going to be included in that release.

Best,
Wei
shi is offline   Reply With Quote
Old 11-12-2013, 09:22 PM   #62
bw.
Member
 
Location: San Francisco, CA

Join Date: Mar 2012
Posts: 21
Default

Thank you Wei. I switched to using featureCounts.
Its especially great that it now takes multiple .bam files.

Whatever statistics can be provided in future versions will be helpful.
bw. is offline   Reply With Quote
Old 11-14-2013, 08:17 PM   #63
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

We have just released Subread 1.4.2. The featureCounts program included in this release outputs an assignment summary file (*.summary) along with the read count file.

BTW, our featureCounts paper was just published on Bioinformatics. Here is the link to it:

http://bioinformatics.oxfordjournals...6F&keytype=ref

Wei
shi is offline   Reply With Quote
Old 12-16-2013, 11:54 AM   #64
AntonioMaceo
Junior Member
 
Location: San Francisco

Join Date: Dec 2013
Posts: 1
Default SAM/BAM parse error

Hi Wei,
I may have found a bug. It seems that the last line of my BAM file header is being parsed as the fist read, since it is quite long this causes the program to exit (line 697 in core.c). When I remove the last @PG line from the header and reheader the bam file your code runs to completion. Please see the attached header.
cheers,
Aaron
Attached Files
File Type: txt header.txt (3.8 KB, 12 views)
AntonioMaceo is offline   Reply With Quote
Old 12-17-2013, 02:03 AM   #65
chibouki
Junior Member
 
Location: France

Join Date: Dec 2011
Posts: 3
Default

Hello,
I think I may have a problem with my gtf file, which look like this:
##gff-version 3
##source-version geneious 6.1.6
seq_ref Geneious gene 1 109 . . . Name=exon1;created by=User

I received this error message:
Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'exon'
The attributes included in your GTF annotation are 'Name=exon1;created by=User;modified by=User'

I hope you could help me
chibouki is offline   Reply With Quote
Old 12-17-2013, 01:00 PM   #66
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by AntonioMaceo View Post
Hi Wei,
I may have found a bug. It seems that the last line of my BAM file header is being parsed as the fist read, since it is quite long this causes the program to exit (line 697 in core.c). When I remove the last @PG line from the header and reheader the bam file your code runs to completion. Please see the attached header.
cheers,
Aaron
Dear Aaron,

Thanks for reporting this. You are correct that there was a bug in featureCounts in dealing with long header lines. We have fixed this and released a patched version of Subread package (1.4.3-p1).

Best wishes,
Wei
shi is offline   Reply With Quote
Old 12-17-2013, 01:06 PM   #67
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by chibouki View Post
Hello,
I think I may have a problem with my gtf file, which look like this:
##gff-version 3
##source-version geneious 6.1.6
seq_ref Geneious gene 1 109 . . . Name=exon1;created by=User

I received this error message:
Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'exon'
The attributes included in your GTF annotation are 'Name=exon1;created by=User;modified by=User'

I hope you could help me

Your GTF file has an incorrect format. Have a look at this page for GTF format:

http://mblab.wustl.edu/GTF2.html
shi is offline   Reply With Quote
Old 12-17-2013, 01:17 PM   #68
chibouki
Junior Member
 
Location: France

Join Date: Dec 2011
Posts: 3
Default

Finally, we solve the problem by cutting the last column of sam file.

But we are disappointing, it seems that -d and -D options are only for pair-end?
chibouki is offline   Reply With Quote
Old 12-17-2013, 04:11 PM   #69
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

I don't understand how you solved the problem by changing the sam file? The problem is with your GTF annotation file, not your sam file.
shi is offline   Reply With Quote
Old 12-17-2013, 11:01 PM   #70
chibouki
Junior Member
 
Location: France

Join Date: Dec 2011
Posts: 3
Default

I know, I stilled receive an error message about the column nine of the gtf (I may have changed it since I posted here for the first time...) but it worked online after cutting the sam
chibouki is offline   Reply With Quote
Old 12-17-2013, 11:40 PM   #71
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Do not change your sam file, otherwise you may get unexpected results. Just change the 9th column of your gtf from for example

Name=exon1;created by=User;modified by=User

to

gene_id "exon1" (space delimited)
shi is offline   Reply With Quote
Old 01-31-2014, 12:54 PM   #72
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Hi!
Im trying your package, using UCSC hg19 genes.GTF.

Im getting this error:

|| Load annotation file genes.gtf ... ||
|| Number of features is 0 ||
|| WARNING no features were loaded in format GTF. ||
|| annotation format can be specified using '-F'.

This is my code:

featureCounts(files=BAMs,file., annot.ext=gtf,isGTFAnnotationFile=TRUE,useMetaFeatures=TRUE,GTF.featureType=featureGENE,GTF.attrType=attributeGENE,nthreads=8, reportReads=TRUE)

"BAMs" are a string character with the file names
"gtf" is the gene.gtf
"attributeGENE" is "gene_id"

EDIT:
Had to delete this: GTF.attrType=attributeGENE
guess it was unnecessary

Last edited by sindrle; 01-31-2014 at 04:53 PM.
sindrle is offline   Reply With Quote
Old 01-31-2014, 05:22 PM   #73
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Here is my comparison so far:

HS 68,2 56,9 57,2 64,7 59,7 63,8 60,9
FC 62.5 56.8 57.1 64.6 59.5 63.6 60.7

HTseq was run with "intersection strict", so maybe thats why it counts a little bit more reads I guess.

I like that featureCounts is implemented in R, its easy to run
On my Macbook the running time is about the same as for HTseq, but maybe its because I had the option "reportReads = TRUE".

I also like that it reports gene length and may handle multiple input files, it also names the output according to the input automatically.

What will make or break it is how easily its implemented with the DEseq2 workflow.. HTSeq is very easy to use.

Last edited by sindrle; 01-31-2014 at 05:55 PM.
sindrle is offline   Reply With Quote
Old 02-01-2014, 12:38 AM   #74
yangliao
Member
 
Location: Melboune, VIC

Join Date: May 2013
Posts: 13
Default

Dear Sindrle,

Could you please provide a couple of lines in "UCSC hg19 genes.GTF"? There are some GTF or GFF files not in the format that is currently supported by featureCounts.

Yang

Quote:
Originally Posted by sindrle View Post
Hi!
Im trying your package, using UCSC hg19 genes.GTF.

Im getting this error:

|| Load annotation file genes.gtf ... ||
|| Number of features is 0 ||
|| WARNING no features were loaded in format GTF. ||
|| annotation format can be specified using '-F'.

This is my code:

featureCounts(files=BAMs,file., annot.ext=gtf,isGTFAnnotationFile=TRUE,useMetaFeatures=TRUE,GTF.featureType=featureGENE,GTF.attrType=attributeGENE,nthreads=8, reportReads=TRUE)

"BAMs" are a string character with the file names
"gtf" is the gene.gtf
"attributeGENE" is "gene_id"

EDIT:
Had to delete this: GTF.attrType=attributeGENE
guess it was unnecessary
yangliao is offline   Reply With Quote
Old 02-01-2014, 02:42 AM   #75
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Im a bit disappointed, after finishing read summary after 5.5 hours, R just crashed.. R version 3.0.2 Rsubread version 1.12.6.

Maybe I can read the "reported output", the .featureCount files thats 2x the size of the BAMs (!)

Heres the GTF:

chr1 unknown CDS 69091 70005 . + 0 gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown exon 69091 70008 . + . gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown start_codon 69091 69093 . + . gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown stop_codon 70006 70008 . + . gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown exon 134773 139696 . - . gene_id "LOC729737"; gene_name "LOC729737"; transcript_id "NR_039983"; tss_id "TSS18541";
chr1 unknown exon 139790 139847 . - . gene_id "LOC729737"; gene_name "LOC729737"; transcript_id "NR_039983"; tss_id "TSS18541";
sindrle is offline   Reply With Quote
Old 02-01-2014, 04:11 AM   #76
yangliao
Member
 
Location: Melboune, VIC

Join Date: May 2013
Posts: 13
Default

Dear Sindrle,

Sorry for the crash of R -- did it crash within a function of Rsubread or somewhere else? The ".featureCount" files are indeed very big -- they contain the assignment result of each single read (or fragment) in the input SAM or BAM files, and is uncompressed. That is why they are usually larger than the input BAM files.

The format of the GTF file seemed perfectly correct. I tried to put the first couple of lines you provided into a GTF file, and ran featureCounts on a paired-end BAM file. The result seemed correct -- featureCounts loaded 3 features and 2 meta-features from the GTF file, and assigned 286 reads to the 2 meta-features. The R object returned from featureCounts has the correct numbers on each meta-feature.

The only problem is that your command line has an additional parameter "GTF.featureType=featureGENE". What was the value of variable featureGENE? The value should be "exon" for the GTF file you're using.

Cheers,

Yang


Code:
        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.12.6

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o A.bam                                          ||
||                                                                            ||
||             Output file : ./.Rsubread_featureCounts_pid24234               ||
||             Annotations : in.gtf (GTF)                                     ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file in.gtf ...                                            ||
||    Number of features is 3                                                 ||
||    Number of meta-features is 2                                            ||
||    Number of chromosomes is 1                                              ||
||                                                                            ||
|| Process BAM file A.bam...                                                  ||
||    Assign reads to features...                                             ||
||    Total number of reads is : 1124980                                      ||
||    Number of successfully assigned reads is : 286 (0.0%)                   ||
||    Running time : 0.05 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//




Quote:
Originally Posted by sindrle View Post
Im a bit disappointed, after finishing read summary after 5.5 hours, R just crashed.. R version 3.0.2 Rsubread version 1.12.6.

Maybe I can read the "reported output", the .featureCount files thats 2x the size of the BAMs (!)

Heres the GTF:

chr1 unknown CDS 69091 70005 . + 0 gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown exon 69091 70008 . + . gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown start_codon 69091 69093 . + . gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown stop_codon 70006 70008 . + . gene_id "OR4F5"; gene_name "OR4F5"; p_id "P1230"; transcript_id "NM_001005484"; tss_id "TSS14428";
chr1 unknown exon 134773 139696 . - . gene_id "LOC729737"; gene_name "LOC729737"; transcript_id "NR_039983"; tss_id "TSS18541";
chr1 unknown exon 139790 139847 . - . gene_id "LOC729737"; gene_name "LOC729737"; transcript_id "NR_039983"; tss_id "TSS18541";

Last edited by yangliao; 02-01-2014 at 04:21 AM.
yangliao is offline   Reply With Quote
Old 02-01-2014, 04:25 AM   #77
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Quote:
Originally Posted by yangliao View Post
The only problem is that your command line has an additional parameter "GTF.featureType=featureGENE". What was the value of variable featureGENE? The value should be "exon" for the GTF file you're using.
I see!

R crashed making the "$counts" table.

featureGENE = "gene_id"

My goal was to count reads "at gene level", thus adding all exon reads for one gene together.

Is it, be the way, possible to count more than one feature in each run?
sindrle is offline   Reply With Quote
Old 02-01-2014, 05:55 AM   #78
yangliao
Member
 
Location: Melboune, VIC

Join Date: May 2013
Posts: 13
Default

It is good that we have found the problem.

Every meta-feature in featureCounts can have one or multiple features. Each row in the GTF file is a feature if the third column equals the feature type you wanted (e.g., "exon", or any value defined in variable "GTF.featureType"). FeatureCounts then groups the features that have the same meta-feature-id into meta-features.

For example, if a row in the GTF file contains gene_id "LOC729737", then the meta-feature-id for this feature is LOC729737.

If the attribute name is not "gene_id" in the GTF file, you need to specify the correct attribute name. For example, if a row in the GTF file contains feature-name "LOC729737", you need to give a parameter telling featureCounts the new attribute name: GTF.attrType="feature-name", so that featureCounts can find the correct meta-feature-id by searching the attribute name (feature-name) in every row in the GTF file.

Thereby, if you want to assign reads on meta-feature level, please set GTF.attrType="gene_id", GTF.featureType="exon" and useMetaFeatures=TRUE.

If you want to assign reads on feature level, please set all variables the same values as above, but only useMetaFeatures=FALSE. FeatureCounts will do what it should do.

FeatureCounts counts reads for every gene in one single run. The returned R object contains a count table where the rows are meta-features (or features, depending on what you need) , and the columns are the input SAM/BAM file names. Every number in the table is the number of reads (or fragments if you specified isPairedEnd=TRUE) in that input file assigned to that gene.

Cheers,

Yang

Quote:
Originally Posted by sindrle View Post
I see!

R crashed making the "$counts" table.

featureGENE = "gene_id"

My goal was to count reads "at gene level", thus adding all exon reads for one gene together.

Is it, be the way, possible to count more than one feature in each run?
yangliao is offline   Reply With Quote
Old 02-01-2014, 06:03 AM   #79
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Thank you for the clarification! Then I know what I did wrong, and maybe thats why it crashed..
sindrle is offline   Reply With Quote
Old 02-02-2014, 07:21 AM   #80
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Quick question, Im trying to run featureCounts with DEXSeq.

> head(adiExons$counts)
D104.bam D121.bam D153.bam D155.bam D161.bam D162.bam D167.bam D173.bam
WASH7P 0 0 0 0 0 0 0 0
WASH7P 9 4 2 6 4 5 5 7
WASH7P 8 1 2 5 0 1 0 2

How do I know which exon these are? Or more importantly, how does DEXSeq get the information it needs?
sindrle is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO