Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Dear Nicolas,

    Many thanks for sending us your read data which helped us to identify the problem. The problem was that in your SAM file about half the fragments had their forward read appearing before their reverse read and the other half had their forward read appearing after their reverse read. This is not what featureCounts expects to see since most aligners always put the forward read before the reverse read in their mapping output.

    The change of read order does not affect unstranded read counting with featureCounts, but it resulted in around half of the fragments not being counted when stranded read counting was performed, which is what you observed.

    Nonetheless, we are now modifying featureCounts to deal with this situation and a patch will be released soon.

    Best wishes,

    Wei

    Comment


    • #32
      I've written several utilities for post-processing on sam/bam files and I've learned that aligners definitely don't all report things in the same way. Another one is how they report singleton mates. Seems like there are a couple conventions for this.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #33
        Dear Nicolas,

        We have just fixed this issue. Please try on the latest version 1.3.5-p3 (http://subread.sourceforge.net).

        We have tested it and it worked fine in our end. Let me know if you ran into other problems.


        Best wishes,

        Wei

        Comment


        • #34
          Dear Wei,
          I have just tried the new version of featureCounts, it works perfectly (see below).

          Code:
          ## UNSTRANDED ##	accepted_gene	multi_mapping	Notfound_gene	Overlapped_genes
          featureCounts with SAM containing only unique	 3,181,215 	 -   	 794,232 	 11,469 
          Htseq-count with SAM containing only unique	 3,174,243 	 -   	 810,359 	 2,314
          				
          ## STRANDED ##	accepted_gene	multi_mapping	Notfound_gene	Overlapped_genes
          featureCounts with SAM containing only unique	 2,784,181 	 -   	 1,198,726 	 4,009 
          Htseq-count with SAM containing only unique	 2,783,943 	 -   	 1,202,275 	 698
          				
          ## STRANDED REVERSE ##	accepted_gene	multi_mapping	Notfound_gene	Overlapped_genes
          featureCounts with SAM containing only unique	 421,127 	 -   	 3,564,919 	 870 
          Htseq-count with SAM containing only unique	 420,840 	 -   	 3,566,004 	 72
          I am definitely switching to featureCounts for my new analyses, since it really is an amazing software.
          Thanks again for all your help and for always being available.
          Best wishes,
          Nicolas

          Comment


          • #35
            Dear Nicolas,

            Thanks for letting me know it worked for you. That's great. We really appreciate you putting up with our bugs and helping us to improve the program. I hope you find it useful for your analyses in the future.

            Best wishes,

            Wei

            Comment


            • #36
              Hi Wei,

              I am wondering about some results I have had with tests I am running. I have been looking at appropriate quality cutoff values using Trimmomatic to trim reads based on quality score. I have tested multiple Q scores. Run parameters in featureCounts:

              Code:
              featureCounts -a /home/bin/my.gtf \
              -i /home/data/quality/Q20/Aligned.out.sam \ 
              -o /home/data/quality/Q20/Q20.featco.trimmo.counts \
              -T 12 -p -P -C -R -D 500000
              I run:

              Code:
              cut -f 2 *.counts.reads | sort | uniq -c > *.diags
              giving me:

              Code:
              ::::::::::::::
              Q20.diags
              ::::::::::::::
                87805 ACCEPTED_2VOTE_GENE
              14507634 ACCEPTED_GENE
              4345122 MULTI_MAPPING
              2315513 NOTFOUND_GENE
              8709356 OVERLAPPED_GENES
                20001 PAIR_DISTANCE
              ::::::::::::::
              Q40.diags
              ::::::::::::::
               149517 ACCEPTED_2VOTE_GENE
              22640462 ACCEPTED_GENE
              3425942 MULTI_MAPPING
              4112768 NOTFOUND_GENE
                96425 OVERLAPPED_GENES
                 2464 PAIR_DISTANCE
              Can you see any reason that the untrimmed (ie Q40) alignment would have such a vastly reduced number of OVERLAPPED_GENES? Both input SAMs made in the same way using STAR with same parameters.

              Anyone else with any ideas also very welcome to contribute!

              Comment


              • #37
                Sorry that I misunderstood the issue.
                Last edited by yangliao; 06-14-2013, 05:18 AM.

                Comment


                • #38
                  Originally posted by yangliao View Post
                  Sorry that I misunderstood the issue.
                  Still, I will try the Q filtering of featureCounts as it will remove a preprocessing step of the data. Thanks for your input.

                  Comment


                  • #39
                    Dear bruce01,

                    The featureCounts results just tell you that after trimming you got a lot more fragments mapped to regions where there is more than one gene residing. In fact, for the untrimmed reads you got about 23 million fragments successfully assigned (the groups 'ACCEPTED_GENE' and 'ACCEPTED_2VOTE_GENE'), whereas for the trimmed reads you got only about 15 million fragments assigned. So you lost 8 million assigned fragments. I bet these 8 million reads were almost all affected by your trimming and their mapping locations were changed. But the strange thing here is why they were all remapped to the regions where there are overlapping genes? You will have to ask the developers of the aligner you used about this.

                    Using Q20 to filter out read bases seems to result in the removal of a lot of good quality bases. This will make the mapping of RNA-seq data harder because reads become shorter, especially for those exon-spanning reads, from my point of view. We do not perform read trimming when mapping RNA-seq reads. Part of the reason for this is that we use the Subread aligner which can tolerate quite a few sequencing errors.


                    Hope this helps.

                    Best wishes,

                    Wei

                    Comment


                    • #40
                      Dear Wei,

                      Really sorry to bother you again, I have just one question as a matter of interest regarding featureCounts.

                      How do you handle your read count summarisation, when is a read assigned to a feature exactly? I know that HTseq-count has three summarisation modes (see pictures enclosed): union, intersection_strict, intersection_nonempty; so I am just interested to know what is yours (sorry I was not able to find the information in the users guide and paper).



                      Thanks a lot for all your help.
                      Best wishes,
                      Nicolas

                      Comment


                      • #41
                        Dear Nicolas,

                        No worries. Thanks for your interest in featureCounts.

                        featureCounts calls an overlap when there is at least 1bp overlap found between a read and a feature. This also applies for paired-end reads, ie. an overlap is called as long as 1bp overlap was found between one of the two reads and the feature.

                        featureCounts will check every feature to see if it overlaps with a read. If more than one overlapping feature was found, it will not assign this read to any feature (by default). However if '-O' is specified, featureCounts will assign it to all its overlapping features.

                        For paired-end reads, featureCounts also checks if both ends from the same fragment overlapping with the features. For example, if a fragment has only one end overlapped with a feature F1 but this fragment also has both ends overlapped with a feature F2, this fragment will be assigned to F2 because this overlap has a stronger support.

                        Hope this makes it clearer. Don't hesitate to ask me if you have any further questions.


                        Best wishes,

                        Wei

                        Comment


                        • #42
                          Dear Wei,

                          That is great, very helpful.
                          Thanks a lot.
                          Best wishes,
                          Nicolas

                          Comment


                          • #43
                            Hi Wei,

                            so to return to my issue above with the many OVERLAP reads resulting from using Trimmomatic:

                            Having contacted the authors of STAR we have determined that it is not the aligner at fault for the issue I have reported to you. I have attached 3 short SAM files; I took the first 100 OVERLAP reads from Trimmomatic trimmed reads as determined by featureCounts. I grepped these out of Trimmomatic qulity trimmed at Q30 ('trimmo'), hard-trimmed by 10bp ('trim') and not trimmed ('notrim') SAM. I then used featureCounts to again determine what features these reads map to. Reads were initially found using:
                            Code:
                            grep -m 100 OVER trimmo.featco.counts.reads | cut -f 1 | grep - ./trimmo/Aligned.out.sam > trimmo.100-over.sam
                            now when I rerun featureCounts on the 'trimmo' set:
                            Code:
                            ~/bin/subread-1.3.5/bin/featureCounts -a Bos_taurus.UMD3.1.70.gtf -i trimmo.100-over.sam -o trimmo.100-over.counts -p -P -C -R -D 500000
                            then
                            Code:
                            grep -v OVER trimmo.100-over.counts.reads | wc -l
                            returns 10. So now, from exactly the same 100 reads that were initially OVERLAP, 10% are assigned with clarity to a single feature. The other two sets, 'notrim' and 'trim', return the same results as previously, each having only 2 reads return OVERLAP.

                            Interestingly, or unfortunately, depending on viewpoint:
                            Code:
                            grep -f trimmo-100.over.sam notrim-100.over.sam | wc -l
                            returns 71. So 71 of ~200 lines are identical between 'trimmo' and 'notrim', ie there has been no trimming done on these reads and they align to the same position. But when these reads are under the header of 'trimmo' they produce OVERLAP, and under the header 'notrim' they produce ACCEPTED.

                            I have now taken all reads which are called OVERLAP in the trimmo analysis and grepped them from trimmo and notrim SAM, then compared those two files for number of lines which are identical.

                            Code:
                            wc -l trimmo.featco.reads.solo
                            7380972 trimmo.featco.reads.solo
                            wc -l notrim.trimmo-overlap.matched.lines.sam
                            5036198 notrim.trimmo-overlap.matched.lines.sam
                            So that's 68% identical, of which 100% in trimmo data are called OVERLAP, but those 5,036,198 from notrim data, identical reads to those called OVERLAP, give a breakdown thus (obviously there are nearly double the number of reads as this is PE data and the featureCounts run include -p -P flags):

                            Code:
                            28587 ACCEPTED_2VOTE_GENE
                            2616603 ACCEPTED_GENE
                            6191 NOTFOUND_GENE
                            55188 OVERLAPPED_GENES
                            234 PAIR_DISTANCE
                            Which is 96% ACCEPTED, 2% OVERLAP versus 100% OVERLAP from the trimmo data. Which is a massive 'discrepancy'.

                            Your thoughts on this would be greatly appreciated.
                            Attached Files
                            Last edited by bruce01; 06-25-2013, 01:50 PM. Reason: Further evidence of issue.

                            Comment


                            • #44
                              Hi Bruce,

                              The three SAM files you sent to me contained different numbers reads, so I'm not sure if you used the same set of reads in your evaluations.

                              Also, your 'trimmo' SAM file is not in the correct format. One of the read pairs had only one read included in the file, the other was missing. The ID of this read pair is:

                              HWI-ST600:257:C1MUWACXX:3:1101:8762:2175

                              featureCounts requires that for paired-end read data both ends must be included in the SAM/BAM file and the two reads from the same pair must be next to each other. It has a unpredicted behavior if this requirement is not met. The problematic read pair in your 'trimmo' SAM file is located at line 17, so it is possible all the subsequent read pairs were incorrectly summarized. This may explain the bizarre result you observed for this dataset.

                              I speculate that the discrepancy in your previous results was also due to the missing reads in your SAM files.

                              Best wishes,

                              Wei

                              Comment


                              • #45
                                Hi Wei,

                                I was hopeful this was the issue resolved. Using Trimmomatic, you are returned 'paired' and 'unpaired' fastq files, so I immediately thought that was a nice way to remove unpaired reads; I only aligned the paired fastq. I have a script to remove all unpaired lines in SAM, and have confirmed the new, paired sam. When it is run in featureCounts, I still get the very large amount of OVERLAP:

                                Code:
                                ::::::::::::::
                                trimmo.paired.diags
                                ::::::::::::::
                                  87664 ACCEPTED_2VOTE_GENE
                                11663844 ACCEPTED_GENE
                                2824684 MULTI_MAPPING
                                1925844 NOTFOUND_GENE
                                7234680 OVERLAPPED_GENES
                                  10940 PAIR_DISTANCE
                                ::::::::::::::
                                trimmo.diags
                                ::::::::::::::
                                  87710 ACCEPTED_2VOTE_GENE
                                11713945 ACCEPTED_GENE
                                2826089 MULTI_MAPPING
                                1929703 NOTFOUND_GENE
                                7380972 OVERLAPPED_GENES
                                  10940 PAIR_DISTANCE
                                The difference is 201,703 counts, which is half the number of reads removed, ~70% are from the OVERLAPPED_GENES, but the rest are not 'correctly' allocated.

                                Any other ideas about this?
                                Last edited by bruce01; 06-27-2013, 05:25 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                50 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X