SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXseq - very low numbers of counts kajot RNA Sequencing 6 08-03-2018 11:04 PM
Annotated junctions file for STAR lre1234 RNA Sequencing 1 12-16-2015 12:31 PM
Very low read numbers from haloplex libraries henry.wood Illumina/Solexa 7 10-16-2015 01:15 AM
RNA expression - reads mapping outside of the annotated regions of the genome feralBiologist Bioinformatics 2 09-01-2014 11:51 PM
Suspiciously low read numbers with shotgun pipeline compared to amplicon pipeline Krisztab 454 Pyrosequencing 0 03-27-2014 07:05 AM

Reply
 
Thread Tools
Old 12-22-2015, 11:52 PM   #1
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default Getting very low numbers of annotated reads from STAR/mm10

Hi all,
I'm trying to analyze RNA-seq data (mouse, multiplexed Nextera XT libraries) using STAR and I'm having the problem that I'm getting mostly "no feature" and hardly any annotated reads downstream.

I trimmed reads with cutadapt, ran fastqc for QC and STAR using UCSC mm10 (from illumina ftp) and the associated gtf, running STAR with default parameters. I'm getting on average 85-95% uniquely aligned reads so I was assuming things work. However in the "gene counts" output of STAR (when using quantMode GeneCounts), I'm getting more than 50% of reads as "no feature". When running the STAR output bam files through either htseq-count or featureCounts, I'm getting mostly zeros.
I've included some published (from the Linnarson group) fastq files as positive controls and I'm getting the same results with those files also, suggesting it's not our libraries but the pipeline that's the problem. Would be grateful for any suggestions!
Best Wishes
analog900 is offline   Reply With Quote
Old 12-23-2015, 10:27 AM   #2
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by analog900 View Post
Hi all,
I'm trying to analyze RNA-seq data (mouse, multiplexed Nextera XT libraries) using STAR and I'm having the problem that I'm getting mostly "no feature" and hardly any annotated reads downstream.

I trimmed reads with cutadapt, ran fastqc for QC and STAR using UCSC mm10 (from illumina ftp) and the associated gtf, running STAR with default parameters. I'm getting on average 85-95% uniquely aligned reads so I was assuming things work. However in the "gene counts" output of STAR (when using quantMode GeneCounts), I'm getting more than 50% of reads as "no feature". When running the STAR output bam files through either htseq-count or featureCounts, I'm getting mostly zeros.
I've included some published (from the Linnarson group) fastq files as positive controls and I'm getting the same results with those files also, suggesting it's not our libraries but the pipeline that's the problem. Would be grateful for any suggestions!
Best Wishes
Hi @analog900

could you please post Log.final.out file and the first 4 lines of the ReadsPerGene.out.tab file?

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 12-23-2015, 12:16 PM   #3
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Hi Alex,
Thanks for replying!

See below the first few lines from ReadsPerGene and afterwards the Log.final.out

ReadsPerGene:
N_unmapped 55905 55905 55905
N_multimapping 48816 48816 48816
N_noFeature 603492 868892 866859
N_ambiguous 13500 2740 2813
Xkr4 0 0 0
Rp1 0 0 0
Sox17 0 0 0
Mrpl15 0 0 0
Lypla1 0 0 0
Tcea1 45 22 23
Rgs20 12 5 7
Atp6v1h 0 0 0
Oprk1 0 0 0
Npbwr1 0 0 0
Rb1cc1 0 0 0
Fam150a 0 0 0
St18 0 0 0
Pcmtd1 162 79 83
Sntg1 0 0 0
Rrs1 0 0 0
Adhfe1 0 0 0


--------------------------------
Log.Final.Out:
Started job on | Nov 26 21:01:50
Started mapping on | Nov 26 21:01:50
Finished on | Nov 26 21:02:50
Mapping speed, Million of reads per hour | 74.44

Number of input reads | 1,240,652
Average input read length | 357
UNIQUE READS:
Uniquely mapped reads number | 1140207
Uniquely mapped reads % | 91.90%
Average mapped length | 353.64
Number of splices: Total | 421644
Number of splices: Annotated (sjdb) | 402181
Number of splices: GT/AG | 416953
Number of splices: GC/AG | 4241
Number of splices: AT/AC | 209
Number of splices: Non-canonical | 241
Mismatch rate per base, % | 0.55%
Deletion rate per base | 0.02%
Deletion average length | 1.68
Insertion rate per base | 0.01%
Insertion average length | 1.22
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 48816
% of reads mapped to multiple loci | 3.93%
Number of reads mapped to too many loci | 2488
% of reads mapped to too many loci | 0.20%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 3.76%
% of reads unmapped: other | 0.20%

Thanks so much in advance for your help!
analog900 is offline   Reply With Quote
Old 12-23-2015, 03:05 PM   #4
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Nothing suspicious in the summary statistics...
It might be a problem with annotations - could you point me to the genome/annotations you were using? And also the public .fastq files you mentioned.
alexdobin is offline   Reply With Quote
Old 12-23-2015, 03:25 PM   #5
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Hi Alex,

I downloaded UCSC mm10 from here:

http://cole-trapnell-lab.github.io/c...ble/index.html

and downloaded "control data" from here:
http://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP045452

I used the first dozen of samples, starting with SRR1544693.

Indexed the genome using the --sjdbGTFfile and --sjdbOverhang 299 because we sequenced 300bp paired-end (MiSeq v3 on Nextera XT). I'm assuming this parameter should not negatively affect the short 50bp reads in the Zeisel data (SRR.... files)?

I recompiled STAR (v2.4.2) for long(er) reads by inserting:
#define DEF_readSeqLengthMax 700
into the IncludeDefine.h as described here:
https://github.com/PacificBioscience...r-Iso-Seq-data

As mentioned, mapping using default parameters (I tried outputting as SAM or coord.-sorted BAMS, didn't seem to matter).

Thanks again for your help!
analog900 is offline   Reply With Quote
Old 12-30-2015, 12:24 PM   #6
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Hi @analog900,

I have mapped one of these samples to m38 assembly with Gencode M8 annotations, and got 1.2M reads mapped uniquely, out of which 600k (50%) were "no feature", i.e. did not overlap annotated exons - this is similar to what you got.

To trick STAR into counting how many read map to entire genes (including both introns and exons), you can use --sjdbGTFfile gencode.vM8.annotation.gtf --sjdbGTFfeatureExon gene --sjdbGTFtagExonParentTranscript gene_id --sjdbGTFtagExonParentGene gene_name .
This resulted in only 100k "no feature" reads (<10%).

It seems like the explanation is that a large proportion of reads (~40%) map to the introns of the genes.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 01-03-2016, 12:29 PM   #7
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Hi Alex,
Thanks so very much for your help. I'm still running a few more runs, but it does indeed help. In my own samples the decrease in percentage (no-feature) is not quite as dramatic as in the published controls, but getting a reduction of about 10%. Still leaves me with about 40% "no-feature". I'll look into this more carefully but thanks again so much for helping!
analog900 is offline   Reply With Quote
Old 01-12-2016, 09:11 AM   #8
viktoriag
Junior Member
 
Location: California

Join Date: Jan 2016
Posts: 1
Default continuation of problem

Hi Alex,

I work with the user who posted the original question. We have included the gene parameters for annotation. We have also added ERCCs and internal control transgenes to the fasta file and the gtf. For the ERCCs and transgenes, i triplicated the information modifying the column to include gene/transcript/exon for every entry. Among all data, we are getting no ReadsPerGene counts for our controls (the transgenes) - they should clearly be there. Also, the known genes that should be there for a specific cell type are not showing up with hits. Is there anything else we could augment to help STAR perform the ReadsPerGene counts better? Should we use some other tool and not rely on that? Please let me know if you prefer me to send you an example file.

Also, i am using STARlong for the analysis - all the same abilities should be there in that script, right?

Thank you!
viktoriag is offline   Reply With Quote
Old 01-12-2016, 11:02 AM   #9
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Hi @viktoriag,

STAR GeneCount algorithm is very straightforward: all uniquely mapping reads that overlap exons of only one gene are counted. Multimapping reads are not counted, and reads overlapping exons of more than one gene are reported as "ambiguous". This algorithm produces exactly the same counts as htseq-count with default parameters.

To figure out what's going on with your transgenes, you can count the number of reads mapped to the transgene
(i) unique mappers only:
samtools view -c -q 255 Aligned.sortedByCoord.out.bam <Trasgene_Name>
(ii) unique+multipe
samtools view -c Aligned.sortedByCoord.out.bam <Trasgene_Name>

<Trasgene_Name> here is the name of one of your expressed transgenes in the fasta file.

The (i) number (divided by 2 if you have paired-end reads) should agree with the count for this gene in the ReadsPerGene file.

If this does not work, please send me an example file with ~1M reads, preferrably sorted BAM, and also the transgene portion of your GTF file.

Cheers
Alex

Quote:
Originally Posted by viktoriag View Post
Hi Alex,

I work with the user who posted the original question. We have included the gene parameters for annotation. We have also added ERCCs and internal control transgenes to the fasta file and the gtf. For the ERCCs and transgenes, i triplicated the information modifying the column to include gene/transcript/exon for every entry. Among all data, we are getting no ReadsPerGene counts for our controls (the transgenes) - they should clearly be there. Also, the known genes that should be there for a specific cell type are not showing up with hits. Is there anything else we could augment to help STAR perform the ReadsPerGene counts better? Should we use some other tool and not rely on that? Please let me know if you prefer me to send you an example file.

Also, i am using STARlong for the analysis - all the same abilities should be there in that script, right?

Thank you!
alexdobin is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO