Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting very low numbers of annotated reads from STAR/mm10

    Hi all,
    I'm trying to analyze RNA-seq data (mouse, multiplexed Nextera XT libraries) using STAR and I'm having the problem that I'm getting mostly "no feature" and hardly any annotated reads downstream.

    I trimmed reads with cutadapt, ran fastqc for QC and STAR using UCSC mm10 (from illumina ftp) and the associated gtf, running STAR with default parameters. I'm getting on average 85-95% uniquely aligned reads so I was assuming things work. However in the "gene counts" output of STAR (when using quantMode GeneCounts), I'm getting more than 50% of reads as "no feature". When running the STAR output bam files through either htseq-count or featureCounts, I'm getting mostly zeros.
    I've included some published (from the Linnarson group) fastq files as positive controls and I'm getting the same results with those files also, suggesting it's not our libraries but the pipeline that's the problem. Would be grateful for any suggestions!
    Best Wishes

  • #2
    Originally posted by analog900 View Post
    Hi all,
    I'm trying to analyze RNA-seq data (mouse, multiplexed Nextera XT libraries) using STAR and I'm having the problem that I'm getting mostly "no feature" and hardly any annotated reads downstream.

    I trimmed reads with cutadapt, ran fastqc for QC and STAR using UCSC mm10 (from illumina ftp) and the associated gtf, running STAR with default parameters. I'm getting on average 85-95% uniquely aligned reads so I was assuming things work. However in the "gene counts" output of STAR (when using quantMode GeneCounts), I'm getting more than 50% of reads as "no feature". When running the STAR output bam files through either htseq-count or featureCounts, I'm getting mostly zeros.
    I've included some published (from the Linnarson group) fastq files as positive controls and I'm getting the same results with those files also, suggesting it's not our libraries but the pipeline that's the problem. Would be grateful for any suggestions!
    Best Wishes
    Hi @analog900

    could you please post Log.final.out file and the first 4 lines of the ReadsPerGene.out.tab file?

    Cheers
    Alex

    Comment


    • #3
      Hi Alex,
      Thanks for replying!

      See below the first few lines from ReadsPerGene and afterwards the Log.final.out

      ReadsPerGene:
      N_unmapped 55905 55905 55905
      N_multimapping 48816 48816 48816
      N_noFeature 603492 868892 866859
      N_ambiguous 13500 2740 2813
      Xkr4 0 0 0
      Rp1 0 0 0
      Sox17 0 0 0
      Mrpl15 0 0 0
      Lypla1 0 0 0
      Tcea1 45 22 23
      Rgs20 12 5 7
      Atp6v1h 0 0 0
      Oprk1 0 0 0
      Npbwr1 0 0 0
      Rb1cc1 0 0 0
      Fam150a 0 0 0
      St18 0 0 0
      Pcmtd1 162 79 83
      Sntg1 0 0 0
      Rrs1 0 0 0
      Adhfe1 0 0 0


      --------------------------------
      Log.Final.Out:
      Started job on | Nov 26 21:01:50
      Started mapping on | Nov 26 21:01:50
      Finished on | Nov 26 21:02:50
      Mapping speed, Million of reads per hour | 74.44

      Number of input reads | 1,240,652
      Average input read length | 357
      UNIQUE READS:
      Uniquely mapped reads number | 1140207
      Uniquely mapped reads % | 91.90%
      Average mapped length | 353.64
      Number of splices: Total | 421644
      Number of splices: Annotated (sjdb) | 402181
      Number of splices: GT/AG | 416953
      Number of splices: GC/AG | 4241
      Number of splices: AT/AC | 209
      Number of splices: Non-canonical | 241
      Mismatch rate per base, % | 0.55%
      Deletion rate per base | 0.02%
      Deletion average length | 1.68
      Insertion rate per base | 0.01%
      Insertion average length | 1.22
      MULTI-MAPPING READS:
      Number of reads mapped to multiple loci | 48816
      % of reads mapped to multiple loci | 3.93%
      Number of reads mapped to too many loci | 2488
      % of reads mapped to too many loci | 0.20%
      UNMAPPED READS:
      % of reads unmapped: too many mismatches | 0.00%
      % of reads unmapped: too short | 3.76%
      % of reads unmapped: other | 0.20%

      Thanks so much in advance for your help!

      Comment


      • #4
        Nothing suspicious in the summary statistics...
        It might be a problem with annotations - could you point me to the genome/annotations you were using? And also the public .fastq files you mentioned.

        Comment


        • #5
          Hi Alex,

          I downloaded UCSC mm10 from here:

          Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments.


          and downloaded "control data" from here:


          I used the first dozen of samples, starting with SRR1544693.

          Indexed the genome using the --sjdbGTFfile and --sjdbOverhang 299 because we sequenced 300bp paired-end (MiSeq v3 on Nextera XT). I'm assuming this parameter should not negatively affect the short 50bp reads in the Zeisel data (SRR.... files)?

          I recompiled STAR (v2.4.2) for long(er) reads by inserting:
          #define DEF_readSeqLengthMax 700
          into the IncludeDefine.h as described here:
          GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


          As mentioned, mapping using default parameters (I tried outputting as SAM or coord.-sorted BAMS, didn't seem to matter).

          Thanks again for your help!

          Comment


          • #6
            Hi @analog900,

            I have mapped one of these samples to m38 assembly with Gencode M8 annotations, and got 1.2M reads mapped uniquely, out of which 600k (50%) were "no feature", i.e. did not overlap annotated exons - this is similar to what you got.

            To trick STAR into counting how many read map to entire genes (including both introns and exons), you can use --sjdbGTFfile gencode.vM8.annotation.gtf --sjdbGTFfeatureExon gene --sjdbGTFtagExonParentTranscript gene_id --sjdbGTFtagExonParentGene gene_name .
            This resulted in only 100k "no feature" reads (<10%).

            It seems like the explanation is that a large proportion of reads (~40%) map to the introns of the genes.

            Cheers
            Alex

            Comment


            • #7
              Hi Alex,
              Thanks so very much for your help. I'm still running a few more runs, but it does indeed help. In my own samples the decrease in percentage (no-feature) is not quite as dramatic as in the published controls, but getting a reduction of about 10%. Still leaves me with about 40% "no-feature". I'll look into this more carefully but thanks again so much for helping!

              Comment


              • #8
                continuation of problem

                Hi Alex,

                I work with the user who posted the original question. We have included the gene parameters for annotation. We have also added ERCCs and internal control transgenes to the fasta file and the gtf. For the ERCCs and transgenes, i triplicated the information modifying the column to include gene/transcript/exon for every entry. Among all data, we are getting no ReadsPerGene counts for our controls (the transgenes) - they should clearly be there. Also, the known genes that should be there for a specific cell type are not showing up with hits. Is there anything else we could augment to help STAR perform the ReadsPerGene counts better? Should we use some other tool and not rely on that? Please let me know if you prefer me to send you an example file.

                Also, i am using STARlong for the analysis - all the same abilities should be there in that script, right?

                Thank you!

                Comment


                • #9
                  Hi @viktoriag,

                  STAR GeneCount algorithm is very straightforward: all uniquely mapping reads that overlap exons of only one gene are counted. Multimapping reads are not counted, and reads overlapping exons of more than one gene are reported as "ambiguous". This algorithm produces exactly the same counts as htseq-count with default parameters.

                  To figure out what's going on with your transgenes, you can count the number of reads mapped to the transgene
                  (i) unique mappers only:
                  samtools view -c -q 255 Aligned.sortedByCoord.out.bam <Trasgene_Name>
                  (ii) unique+multipe
                  samtools view -c Aligned.sortedByCoord.out.bam <Trasgene_Name>

                  <Trasgene_Name> here is the name of one of your expressed transgenes in the fasta file.

                  The (i) number (divided by 2 if you have paired-end reads) should agree with the count for this gene in the ReadsPerGene file.

                  If this does not work, please send me an example file with ~1M reads, preferrably sorted BAM, and also the transgene portion of your GTF file.

                  Cheers
                  Alex

                  Originally posted by viktoriag View Post
                  Hi Alex,

                  I work with the user who posted the original question. We have included the gene parameters for annotation. We have also added ERCCs and internal control transgenes to the fasta file and the gtf. For the ERCCs and transgenes, i triplicated the information modifying the column to include gene/transcript/exon for every entry. Among all data, we are getting no ReadsPerGene counts for our controls (the transgenes) - they should clearly be there. Also, the known genes that should be there for a specific cell type are not showing up with hits. Is there anything else we could augment to help STAR perform the ReadsPerGene counts better? Should we use some other tool and not rely on that? Please let me know if you prefer me to send you an example file.

                  Also, i am using STARlong for the analysis - all the same abilities should be there in that script, right?

                  Thank you!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X