Seqanswers Leaderboard Ad

**alexdobin** · 12-23-2015, 11:27 AM

Originally posted by analog900 View Post

Hi all,
I'm trying to analyze RNA-seq data (mouse, multiplexed Nextera XT libraries) using STAR and I'm having the problem that I'm getting mostly "no feature" and hardly any annotated reads downstream.

I trimmed reads with cutadapt, ran fastqc for QC and STAR using UCSC mm10 (from illumina ftp) and the associated gtf, running STAR with default parameters. I'm getting on average 85-95% uniquely aligned reads so I was assuming things work. However in the "gene counts" output of STAR (when using quantMode GeneCounts), I'm getting more than 50% of reads as "no feature". When running the STAR output bam files through either htseq-count or featureCounts, I'm getting mostly zeros.
I've included some published (from the Linnarson group) fastq files as positive controls and I'm getting the same results with those files also, suggesting it's not our libraries but the pipeline that's the problem. Would be grateful for any suggestions!
Best Wishes

Hi @analog900

could you please post Log.final.out file and the first 4 lines of the ReadsPerGene.out.tab file?

Cheers
Alex

**analog900** · 12-23-2015, 01:16 PM

Hi Alex,
Thanks for replying!

See below the first few lines from ReadsPerGene and afterwards the Log.final.out

ReadsPerGene:
N_unmapped 55905 55905 55905
N_multimapping 48816 48816 48816
N_noFeature 603492 868892 866859
N_ambiguous 13500 2740 2813
Xkr4 0 0 0
Rp1 0 0 0
Sox17 0 0 0
Mrpl15 0 0 0
Lypla1 0 0 0
Tcea1 45 22 23
Rgs20 12 5 7
Atp6v1h 0 0 0
Oprk1 0 0 0
Npbwr1 0 0 0
Rb1cc1 0 0 0
Fam150a 0 0 0
St18 0 0 0
Pcmtd1 162 79 83
Sntg1 0 0 0
Rrs1 0 0 0
Adhfe1 0 0 0

--------------------------------
Log.Final.Out:
Started job on | Nov 26 21:01:50
Started mapping on | Nov 26 21:01:50
Finished on | Nov 26 21:02:50
Mapping speed, Million of reads per hour | 74.44

Number of input reads | 1,240,652
Average input read length | 357
UNIQUE READS:
Uniquely mapped reads number | 1140207
Uniquely mapped reads % | 91.90%
Average mapped length | 353.64
Number of splices: Total | 421644
Number of splices: Annotated (sjdb) | 402181
Number of splices: GT/AG | 416953
Number of splices: GC/AG | 4241
Number of splices: AT/AC | 209
Number of splices: Non-canonical | 241
Mismatch rate per base, % | 0.55%
Deletion rate per base | 0.02%
Deletion average length | 1.68
Insertion rate per base | 0.01%
Insertion average length | 1.22
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 48816
% of reads mapped to multiple loci | 3.93%
Number of reads mapped to too many loci | 2488
% of reads mapped to too many loci | 0.20%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 3.76%
% of reads unmapped: other | 0.20%

Thanks so much in advance for your help!

**alexdobin** · 12-23-2015, 04:05 PM

Nothing suspicious in the summary statistics...
It might be a problem with annotations - could you point me to the genome/annotations you were using? And also the public .fastq files you mentioned.

**analog900** · 12-23-2015, 04:25 PM

Hi Alex,

I downloaded UCSC mm10 from here:

Cufflinks

http://cole-trapnell-lab.github.io/cufflinks/igenome_table/index.html

Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments.

and downloaded "control data" from here:

http://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP045452

I used the first dozen of samples, starting with SRR1544693.

Indexed the genome using the --sjdbGTFfile and --sjdbOverhang 299 because we sequenced 300bp paired-end (MiSeq v3 on Nextera XT). I'm assuming this parameter should not negatively affect the short 50bp reads in the Zeisel data (SRR.... files)?

I recompiled STAR (v2.4.2) for long(er) reads by inserting:
#define DEF_readSeqLengthMax 700
into the IncludeDefine.h as described here:

Build software better, together

https://github.com/PacificBiosciences/cDNA_primer/wiki/Bioinfx-study:-Optimizing-STAR-aligner-for-Iso-Seq-data

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

As mentioned, mapping using default parameters (I tried outputting as SAM or coord.-sorted BAMS, didn't seem to matter).

Thanks again for your help!

**alexdobin** · 12-30-2015, 01:24 PM

Hi @analog900,

I have mapped one of these samples to m38 assembly with Gencode M8 annotations, and got 1.2M reads mapped uniquely, out of which 600k (50%) were "no feature", i.e. did not overlap annotated exons - this is similar to what you got.

To trick STAR into counting how many read map to entire genes (including both introns and exons), you can use --sjdbGTFfile gencode.vM8.annotation.gtf --sjdbGTFfeatureExon gene --sjdbGTFtagExonParentTranscript gene_id --sjdbGTFtagExonParentGene gene_name .
This resulted in only 100k "no feature" reads (<10%).

It seems like the explanation is that a large proportion of reads (~40%) map to the introns of the genes.

Cheers
Alex

**analog900** · 01-03-2016, 01:29 PM

Hi Alex,
Thanks so very much for your help. I'm still running a few more runs, but it does indeed help. In my own samples the decrease in percentage (no-feature) is not quite as dramatic as in the published controls, but getting a reduction of about 10%. Still leaves me with about 40% "no-feature". I'll look into this more carefully but thanks again so much for helping!

**viktoriag** · 01-12-2016, 10:11 AM

continuation of problem

Hi Alex,

I work with the user who posted the original question. We have included the gene parameters for annotation. We have also added ERCCs and internal control transgenes to the fasta file and the gtf. For the ERCCs and transgenes, i triplicated the information modifying the column to include gene/transcript/exon for every entry. Among all data, we are getting no ReadsPerGene counts for our controls (the transgenes) - they should clearly be there. Also, the known genes that should be there for a specific cell type are not showing up with hits. Is there anything else we could augment to help STAR perform the ReadsPerGene counts better? Should we use some other tool and not rely on that? Please let me know if you prefer me to send you an example file.

Also, i am using STARlong for the analysis - all the same abilities should be there in that script, right?

Thank you!

**alexdobin** · 01-12-2016, 12:02 PM

Hi @viktoriag,

STAR GeneCount algorithm is very straightforward: all uniquely mapping reads that overlap exons of only one gene are counted. Multimapping reads are not counted, and reads overlapping exons of more than one gene are reported as "ambiguous". This algorithm produces exactly the same counts as htseq-count with default parameters.

To figure out what's going on with your transgenes, you can count the number of reads mapped to the transgene
(i) unique mappers only:
samtools view -c -q 255 Aligned.sortedByCoord.out.bam <Trasgene_Name>
(ii) unique+multipe
samtools view -c Aligned.sortedByCoord.out.bam <Trasgene_Name>

<Trasgene_Name> here is the name of one of your expressed transgenes in the fasta file.

The (i) number (divided by 2 if you have paired-end reads) should agree with the count for this gene in the ReadsPerGene file.

If this does not work, please send me an example file with ~1M reads, preferrably sorted BAM, and also the transgene portion of your GTF file.

Cheers
Alex

Originally posted by viktoriag View Post

Hi Alex,

I work with the user who posted the original question. We have included the gene parameters for annotation. We have also added ERCCs and internal control transgenes to the fasta file and the gtf. For the ERCCs and transgenes, i triplicated the information modifying the column to include gene/transcript/exon for every entry. Among all data, we are getting no ReadsPerGene counts for our controls (the transgenes) - they should clearly be there. Also, the known genes that should be there for a specific cell type are not showing up with hits. Is there anything else we could augment to help STAR perform the ReadsPerGene counts better? Should we use some other tool and not rely on that? Please let me know if you prefer me to send you an example file.

Also, i am using STARlong for the analysis - all the same abilities should be there in that script, right?

Thank you!

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 48 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Getting very low numbers of annotated reads from STAR/mm10

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News