Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Since you specified '-O' option, which instructs featureCounts to assign the read to all its overlapping features (they are exons in this case since you specified '-f'), reads will be assigned to all their overlapping exons. That says a read will be counted multiple times if it overlaps multiple exons. You can see this in the assignment results for two of the exons:

    ENSG00000223972 1 13221 14409 + 1189 4
    ENSG00000223972 1 13225 14412 + 1188 4

    If you do not specify the '-O' option, reads overlapping with multiple exons will not be counted. Thus, you possibly will get zero count for both the two exons above.

    Comment


    • Im actually struggling with the following, I want to easily create a DESeqDataSet from FeatureCounts. My goal is to use the
      fpkm() function within DESeq2, not the one in FC.

      In the DESeq2 help it says Ill get a matrix which is normalized per kilobase of the union of basepairs in the GRangesList or GRanges
      of the mcols(object), and per million of mapped fragments, either using the robust median ratio
      method (robust=TRUE, default) or using raw counts (robust=FALSE). Defining a column mcols(object)$basepairs takes precedence over internal calculation of the kilobases for each row.

      My question is how I get the mcols(object)$basepairs from FC? Does it correspond to the fc$annotation$Length column?

      So this is what I did, but is it correct?

      mcols(ddsFullCountTable) <- fc$annotation$Length
      mcols(ddsFullCountTable)$value
      mcols(ddsFullCountTable) <- rename(mcols(ddsFullCountTable), c("value"="basepairs"))
      mcols(ddsFullCountTable)$basepairs
      fpkm <- fpkm(ddsFullCountTable)
      Last edited by sindrle; 05-27-2014, 02:10 PM. Reason: Nevermind.. I was tired in the head.

      Comment


      • I'm not sure what mcols(object)$basepairs represents, but fc$annotation$Length gives you the total number of exonic bases each gene has (each exonic base is counted exactly once). You probably will get more help if you post to bioconductor mailing list.

        You may also have a look at a case study of using featureCounts in RNA-seq data analysis, which includes the step of generating RPKM values for genes if that's what you want to have:

        Comment


        • Hi Everyone,

          I have a quick fundamental question about featureCounts.

          I ran my reads through STAR and used featureCounts to get the counts of reads to that were successfully aligned.

          Despite mapping my reads to the transcriptome in both STAR and featureCounts, I often get a lot of Unassigned_NoFeatures.

          Code:
          featureCounts -p -a $dirgtf -t exon -f -g gene_id -O -o ${sample}featurecounts.txt ${sample}star_sort.bam
          HTML Code:
          Status	{sample}star_sort.bam
          Assigned	3478951
          Unassigned_Ambiguity	0
          Unassigned_MultiMapping	992009
          Unassigned_NoFeatures	3061428
          Unassigned_Unmapped	0
          Unassigned_MappingQuality	0
          Unassigned_FragementLength	0
          Unassigned_Chimera	0
          Unassigned_Secondary	0
          Unassigned_Nonjunction	0
          Unassigned_Duplicate	0
          In contrast, when I run the STAR bam files through cufflinks, it does pick up all the reads and so I am getting huge discrepancies between my featureCounts readout and my cufflinks readout.

          I've also run the same fastq files through tophat and got a similar result.

          HTML Code:
          Status	{sample}tophat_sort.bam
          Assigned	3306323
          Unassigned_Ambiguity	0
          Unassigned_MultiMapping	1625816
          Unassigned_NoFeatures	2938541
          Unassigned_Unmapped	0
          Unassigned_MappingQuality	0
          Unassigned_FragementLength	0
          Unassigned_Chimera	0
          Unassigned_Secondary	0
          Unassigned_Nonjunction	0
          Unassigned_Duplicate	0
          Could someone explain why despite being successfully aligned by either tophat or STAR, featureCounts returns almost 50% Unassigned_NoFeatures? What would be the reason a read is mapped by the aligner but not by featureCounts? I tried to look in the featureCounts user guide and original article but couldn't figure it out.

          Thanks in advance.

          Comment


          • hi Shi and sindrle,

            mcols(object)$basepairs when defined by the user should be the union of basepairs from all exons (so no double counting of basepairs which are in more than one exon), so the same as fc$annotation$Length.

            Comment


            • Originally posted by travelk View Post
              Hi Everyone,

              I have a quick fundamental question about featureCounts.

              I ran my reads through STAR and used featureCounts to get the counts of reads to that were successfully aligned.

              Despite mapping my reads to the transcriptome in both STAR and featureCounts, I often get a lot of Unassigned_NoFeatures.

              Code:
              featureCounts -p -a $dirgtf -t exon -f -g gene_id -O -o ${sample}featurecounts.txt ${sample}star_sort.bam
              HTML Code:
              Status	{sample}star_sort.bam
              Assigned	3478951
              Unassigned_Ambiguity	0
              Unassigned_MultiMapping	992009
              Unassigned_NoFeatures	3061428
              Unassigned_Unmapped	0
              Unassigned_MappingQuality	0
              Unassigned_FragementLength	0
              Unassigned_Chimera	0
              Unassigned_Secondary	0
              Unassigned_Nonjunction	0
              Unassigned_Duplicate	0
              In contrast, when I run the STAR bam files through cufflinks, it does pick up all the reads and so I am getting huge discrepancies between my featureCounts readout and my cufflinks readout.

              I've also run the same fastq files through tophat and got a similar result.

              HTML Code:
              Status	{sample}tophat_sort.bam
              Assigned	3306323
              Unassigned_Ambiguity	0
              Unassigned_MultiMapping	1625816
              Unassigned_NoFeatures	2938541
              Unassigned_Unmapped	0
              Unassigned_MappingQuality	0
              Unassigned_FragementLength	0
              Unassigned_Chimera	0
              Unassigned_Secondary	0
              Unassigned_Nonjunction	0
              Unassigned_Duplicate	0
              Could someone explain why despite being successfully aligned by either tophat or STAR, featureCounts returns almost 50% Unassigned_NoFeatures? What would be the reason a read is mapped by the aligner but not by featureCounts? I tried to look in the featureCounts user guide and original article but couldn't figure it out.

              Thanks in advance.
              Just to be sure, you say you aligned to the transcriptome with both STAR and featurecounts. Do you mean you aligned to a genome with STAR and then used featureCounts to count hits to transcripts with a GTF? That's how it should work and if that is what you did then the discrepancy is between then GTF you have and the potential extent of unannotated features in your actual samples. Cufflinks, remember, actually assembles a transcriptome and quantifies it in the same run so it is likely to use more of your aligned data in counts to genes. FeatureCounts is bound by your GTF. It seems likely that 50% of your data is mapping to in annotated features. That's all. Consider running featureCounts with the gtf made by cufflinks. You can annotate that gtf by using cuffcompare with the -r <gtf annotation> to pull in the information from the Gtf you are already using.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • Sorry, I worded it poorly. Indeed, I provided STAR with a fasta file to generate the genome files it aligns my reads to which would suggest it is genome based. To clarify, I provide the exact same gtf file for both FeatureCounts and Cufflinks. (I use the fasta and gtf files provided together by Ensembl.) When you recommend to run FeatureCounts with the gtf made by cufflinks, do you mean the the transcripts.gtf file it generates along with the genes.fpkm_tracking? I'm a bit confused about which file you are talking about...

                Comment


                • ah, with cufflinks did you use the -G or -g option? also how did you determine that cufflinks assigned more reads to features?

                  you could also give htseq-count a try just for a second opinion. that does basically what featureCounts does. another thing you can try is to align directly to transcripts with bowtie and see how many reads successfully align. that may be the best ground truth to how much of your data actually intersects the gene annotation.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • featureCounts' lack of strict intersect mode for paired end reads

                    Why doesn't featureCounts have a strict intersect mode for paired end reads, like htseq-count?
                    It's a great software in many other aspects, but this is a mistake

                    I've put a very simple example in attachment where htseq-count will count one fragment, and featureCounts 2 fragments.

                    To me, the count of one for HSPB6 is correct. There is one fragment that aligns entirely in the gene, while only one read of the other fragment aligns to the gene, therefore it should not be counted.

                    I did set the minimum overlap to 50, even though it is inconvenient if I trim my reads. I tried with the minimum overlap to 100, but then I get a count of 0 for all the genes since it only applies to reads and not fragments.
                    Attached Files

                    Comment


                    • unassigned reads

                      Hi all,

                      I have a question regarding some reads that have been excluded as "Unassigned_NoFeatures" but I have no idea why, so any clues would be really helpful!

                      Here is my example. I have a gtf file and looking at this gene:

                      Code:
                      grep Hba-x genes_ucsc_mm9.gtf | grep exon
                      chr11	unknown	exon	32176600 32176758	.	+	.	gene_id "Hba-x"; transcript_id "NM_010405"; gene_name "Hba-x"; p_id "P12021"; tss_id "TSS5558";
                      chr11	unknown	exon	32177569 32177773	.	+	.	gene_id "Hba-x"; transcript_id "NM_010405"; gene_name "Hba-x"; p_id "P12021"; tss_id "TSS5558";
                      chr11	unknown	exon	32177880 32178115	.	+	.	gene_id "Hba-x"; transcript_id "NM_010405"; gene_name "Hba-x"; p_id "P12021"; tss_id "TSS5558";
                      I run featureCounts with this command:
                      Code:
                      featureCounts -T 5 -a genes_ucsc_mm9.gtf -t exon -g gene_id -R -o ../features_counted_rev_try -p  -s 2 my_bam.bam
                      Extracting some random reads that I know that are in the gene (from igv):

                      Code:
                      $samtools view my_bam.bam | grep 113:8147:70638
                      HISEQ2000-06:419:C4HB0ACXX:6:1113:8147:70638	73	chr11	32177640	50	100M	*	0	0	TGCGGGCCCACGGCTTCAAGATCATGACCGCCGTAGGGGATGCGGTTAAGAGCATCGACAACCTCTCTAGTGCTTTGACTAAGCTGAGCGAGCTGCATGC@@@AD@DDDGHHHIIGHGFHIIHE@FGIIHIIGGEAEEBBDDECB;?8:ACACCCABABDBBDACDDDDA4>C>@::@>>CDCDDCDDDDBBDBDDCDCD	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU	XS:A:+	NH:i:1
                      $samtools view my_bam.bam |  grep 1307:1932:2934
                      HISEQ2000-06:419:C4HB0ACXX:6:1307:1932:2934	99	chr11	32176639	50	100M	=	32176684	955	TCCAACCCTCACCACCACCACCACCATGTCTCTGATGAAGAATGAGAGAGCTATCATCATGTCCATGTGGGAGAAGATGGCTGCTCAGGCCGAGCCCATT	@<@FFDDFHFHFHIGFFEHIIGHGIIEGD>DDEHFG@?GD@G>DHIGDFHGGHHHIEFIGG=FEIICHG@=AAEFF@6@CCEBD3@@>??<=@BBD?B9(	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU	XS:A:+	NH:i:1
                      HISEQ2000-06:419:C4HB0ACXX:6:1307:1932:2934	147	chr11	32176684	50	75M810N25M	=	32176639	-955	GAGAGCTATCATCATGTCCATGTGGGAGAAGATGGCTGCTCAGGCCGAGCCCATTGGCACTGAGACTCTAGAGAGGCTCTTCTGCAGCTACCCCCAGACG	:>>>@::>CCAAAACCCCDCB?CAC@:CCDCCABB@>@A;;6=AAB=.9AIFBFHCB9*?89*?9?9??9JIGEIIHGGIIGIIJIIHFCHHDFFFFBCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU	XS:A:+	NH:i:1
                      $samtools view my_bam.bam |  grep 2305:5148:22998
                      HISEQ2000-06:419:C4HB0ACXX:6:2305:5148:22998	73	chr11	32176646	50	100M	*	0	0	CTCACCACCACCACCACCATGTCTCTGATGAAGAATGAGAGAGCTATCATCATGTCCATGTGGGAGAAGATGGCTGCTCAGGCCGAGCCCATTGGCACTGCCCFFDFFFHHHGJJJIGGIBHHIGJJJJJJJJIJIJJJJJIJJIIJJJJIIIIHIIJIGHIJJGIJJIGHHFHFBDFFEECEDDDDDDDDDCACDCDD9	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU	XS:A:+	NH:i:1
                      I see that according to their coordinates and the gtf they should be counted.

                      But in the report (-R) I get them as unassigned no features:
                      Code:
                      $ grep 113:8147:70638 my_bam.bam.featureCounts 
                      HISEQ2000-06:419:C4HB0ACXX:6:1113:8147:70638	Unassigned_NoFeatures	*	*
                      $ grep 1307:1932:2934 my_bam.bam.featureCounts 
                      HISEQ2000-06:419:C4HB0ACXX:6:1307:1932:2934	Unassigned_NoFeatures	*	*
                      $  grep 2305:5148:22998 my_bam.bam.featureCounts 
                      HISEQ2000-06:419:C4HB0ACXX:6:2305:5148:22998	Unassigned_NoFeatures	*	*
                      What am I missing???
                      Thank you for your help!

                      Comment


                      • Why did you use the option '-s 2' (reversely stranded), given that your reads are located on the same strand as that of the target gene?

                        Comment


                        • Hi Shi,

                          Its because the reads for the rest of the genes are reversely stranded (!!!). Actually there has been an issue with possible contamination of that sample, because this gene is not supposed to be expressed at all in this sample and yet there are quite a few reads there. Could those reads be coming from another sample and how would that be??? (possibly the wrong place to be asking that but anyway, if anyone has an answer it would be nice to know)

                          Thanks for your answer! It helped a lot!

                          Comment


                          • Hi everybody,

                            what are your opinions on using RPKM or another alternative (unique counts, expected counts).

                            Can featureCounts generate RPKM, at least for comparison? Are there good reasons why it should not ?

                            Thanks,
                            Colin

                            Comment


                            • Dear Wei,

                              I would like to use featureCounts for counting multiple aligned reads, but instead of adding 1 to the count for each location y would like to add fractional counts (1/number of locations where the read mapped). Is this posible?

                              Thank you in advance for your reply.
                              Tamara

                              Comment


                              • Originally posted by colindaven View Post
                                what are your opinions on using RPKM or another alternative (unique counts, expected counts).

                                Can featureCounts generate RPKM, at least for comparison? Are there good reasons why it should not ?
                                Counts are needed for a proper statistical analyses, as is explained in the publications behind edgeR, voom and DESeq.

                                However, generating RPKM from featureCounts is easy, see:



                                for an example.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X