Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large number of Unassigned_NoFeature reads from featureCounts

    I have a bit of a mystery on my hands about an RNA-Seq Project.

    I am working with some mice lines, standard RNA-seq prep, paired end. Initial adapter trimming is done with cutadapt. At this point fastqc looks good across the board, so on to alignment.

    I'm aligning my reads to the pre-build mouse MM37(mm9) build from ensemble. I'm using tophat2 for alignment and am getting pretty good mapping. Here is an example.
    Left reads:
    Input : 27528769
    Mapped : 24331602 (88.4% of input)
    these: 3339296 (13.7%) have multiple alignments (92032 have >20)
    Right reads:
    Input : 27528769
    Mapped : 24655147 (89.6% of input)
    of these: 3347668 (13.6%) have multiple alignments (91524 have >20)
    89.0% overall read mapping rate.

    Aligned pairs: 23141474
    of these: 3172252 (13.7%) have multiple alignments
    760681 ( 3.3%) are discordant alignments
    81.3% concordant pair alignment rate.
    I ran tophat with the following options:
    tophat2 -p 6 -N 5 --read-edit-dist 5 -o $OUTPUT $INDEX_BASE $R1_FILE_PATH $R2_FILE_PATH
    Here are some stats on the alignment from samtools:
    SN raw total sequences: 48986749
    SN filtered sequences: 0
    SN sequences: 48986749
    SN is sorted: 1
    SN 1st fragments: 24331602
    SN last fragments: 24655147
    SN reads mapped: 48986749
    SN reads mapped and paired: 46282948 # paired-end technology bit set + both mates mapped
    SN reads unmapped: 0
    SN reads properly paired: 41476118 # proper-pair bit set
    SN reads paired: 48986749 # paired-end technology bit set
    SN reads duplicated: 0 # PCR or optical duplicate bit set
    SN reads MQ0: 720045 # mapped and MQ=0
    SN reads QC failed: 0
    SN non-primary alignments: 16364564
    SN total length: 4947661649 # ignores clipping
    SN bases mapped: 4947661649 # ignores clipping
    SN bases mapped (cigar): 4947661649 # more accurate
    SN bases trimmed: 0
    SN bases duplicated: 0
    SN mismatches: 58293637 # from NM fields
    SN error rate: 1.178206e-02 # mismatches / bases mapped (cigar)
    SN average length: 101
    SN maximum length: 101
    SN average quality: 34.4
    SN insert size average: 730.1
    SN insert size standard deviation: 1723.8
    SN inward oriented pairs: 22043384
    SN outward oriented pairs: 262611
    SN pairs with other orientation: 40818
    SN pairs on different chromosomes: 727628
    Next, I ran featureCounts as following:
    featureCounts -T $NUM_THREADS -M -a $GTF_FILE -o $out_file $TOPHAT_BAMFILE
    Note that I included the -M option to include multimapping reads. I did that because I was already losing a large amount of reads.

    Here is the problem I'm having, I am getting a very large number of "Unassigned_NoFeatures" when I run featureCounts. For example:

    Assigned 32997164
    Unassigned_Ambiguity 4502244
    Unassigned_MultiMapping 0
    Unassigned_NoFeatures 14503113
    Unassigned_Unmapped 0
    Unassigned_MappingQuality 0
    Unassigned_FragmentLength 0
    Unassigned_Chimera 0
    Unassigned_Secondary 0
    Unassigned_Nonjunction 0
    Unassigned_Duplicate 0
    Sometimes it's even around 50% of the reads are unassigned_nofeatures.

    I used the ensemble prebuilt bowtie indexes and accompanying .gtf annotation, so I can't imagine its an annotation issue. When I look at the reads in a genome browser they seem to largely (not completely) map to exons, and there aren't any super-over-expressed genes in the count list generated by featurecounts.

    Does anyone know what might be the cause of this issue? I am currently attempting to get counts with cufflinks to see if it's any better, or I might try htseq, but I am concerned the results will be similar.

    I read here that MAPQ scores are generated differently for all the different alignment tools and wondered if featureCounts might take that into account, but I couldn't find documentation about how featurecounts handles MAPQ scores from tophat2.

    Why am I losing so many reads at the count stage!?

    I would appreciate any help or insight you can offer, thanks very much in advance.

  • #2
    For reference cross-posted: https://www.biostars.org/p/253337/

    Please close this post with a cross-reference, if an acceptable answer is found in other forum(s).
    Last edited by GenoMax; 05-17-2017, 03:33 PM.

    Comment


    • #3
      repeats?

      I currently have the same issue. I wonder if it's related to the fact that featureCounts counts multimapping reads once per mapping position. If you have some reads that map repetitive positions outside genes, and each of those reads maps for example to 50 repeats in the genome, the Unassigned_NoFeatures count will be greatly inflated.

      Comment


      • #4
        That might be one contributing factor, but it's not enough to explain the scale of the read loss, at least not in my data. The level of multi-mapped reads I'm seeing is something on average of 10-15%, and the level of reads with unassigned_noFeature is sometimes around half of the total aligned reads.

        I have looked into this using multiple tools and always find the same thing, tried different parameter tuning, same thing. Looked at the alignments in a genome browser and it seems reads mostly are aligning to annotated regions.

        I talked to some more biology-heavy people about it and they suggested it might be some small RNAs or rRNAs that aren't annotated but are highly expressed or amplified. I did find one case where this was true.

        I noticed a spike in my GC plot in fastQC around 62%, blasted all of my overrepresented sequences, and found them to be a single 7s rRNA that wasn't annotated. So I feel like thats a possible contributing factor as well.

        I've decided that if there is sufficient depth for the counted reads though, that I'm going to proceed with the data. I feel like at 15-20x coverage we're still likely to be getting good expression ratios, and most of the data i'm looking at have even 80-100x coverage. So since I've been unable to locate the cause of this perfectly, I'm just filtering samples that don't meet a high coverage cutoff, because what else can we do, right?

        Anyhow, I wish you luck, let me know if you find anything else out about this.

        Comment


        • #5
          Originally posted by aprice67 View Post
          That might be one contributing factor, but it's not enough to explain the scale of the read loss, at least not in my data. The level of multi-mapped reads I'm seeing is something on average of 10-15%, and the level of reads with unassigned_noFeature is sometimes around half of the total aligned reads.
          You have to be careful here: The mapper counts reads that multimap. featureCounts counts mappings. Thus if you have 10% multimapped reads, and each of those reads maps 5 times to the non-annotated part, those reads will make up ~35% of all counts in featureCounts. At least that's how I understand it.

          Unannotated RNAs are of course also possible. You could look for areas of high coverage and substract the annotation gtf with bedtools to find those areas. Maybe I'll do that now with my data .

          Comment


          • #6
            That's a very nice thing to know, thanks for pointing this out. I feel slightly safer now. I wonder, are you working with mouse data as well?

            Comment


            • #7
              I wish! All so nicely annotated with all the tools for mouse. I'm working with weird fishes.

              Comment


              • #8
                As someone on the other side of the fence, let me assure you the grass is not quite so green as you might think. Maybe i'd rather have the weird fish. At least then I can say the annotation is to blame.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X