SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
featureCounts: a universal read summarization program shi Bioinformatics 155 02-20-2020 03:21 PM
[bwa] BWA -a inconsistent results EReckase Bioinformatics 4 11-13-2014 12:35 PM
convert bam file to bedgraph format with inconsistent results zhiqiang Bioinformatics 0 07-26-2012 11:48 PM
Inconsistent results of cuffcompare, looking for help weibominingdata Bioinformatics 2 11-09-2011 08:47 PM
Very Bad Mapping Results with several mapping softwares xquan Bioinformatics 13 05-22-2011 11:31 PM

Reply
 
Thread Tools
Old 08-22-2016, 06:14 PM   #1
Sebastian_Quezada_R
Junior Member
 
Location: Australia

Join Date: Aug 2016
Posts: 7
Default Inconsistent results between mapping and read summarization

Hi everyone =) I present you a question that has been bothering me for a while now and, unfortunately, none of my colleagues have been able to help me.

I've finally got my biological replicates data back from the sequencing platforms so I can move into the DE analysis and make some stats and fancy graphs to show my boss =) but I'm having some problems right at the start of the pipeline.

I'll put you into context regarding my data: I'm sequencing little amounts of RNA, therefore no ribodepletion or polyA pull-down was possible. The whole transcriptome is being sequenced (kind of this single-cell sequencing using SPIA sequencing). The quality of the sequences is top-notch (no base with score under 40 in the fastqc report). I can post you the fastqc reports if you wanna look at them. Three biological replicates from two different zones of a single tissue. The project aims to detect the DE genes between those two zones of the tissue. I'm working with sheep.

I've mapped my reads using STAR with Ensembl sheep genome fasta and GTF annotation files, and this program told me that, on average, around 60% were uniquely mapped reads, around 38% were multimappers and around 2% were deemed as unmapped for all my files. So far so good 'cause I think they are expected percentages.

The problem comes when I used featureCounts, because when I read summarization it tells me that, on average, only around 10% of my reads were successfully assigned to a feature for all my files.

I've been thinking for a while, but to me it doesn't make any sense. If the sequence could not be properly mapped to any feature, they would have been deemed as unmapped by STAR in the first place.

I have no interest at the moment at multimappers, so I left them out in featureCounts, but anyway I was expecting to have more or less the same percentages than the STAR output, because If the reads were successfully mapped with STAR it means they recognize the feature inside the GTF file so, how can featureCounts not assign it to the feature if they are using the same reference files?

These are the codes used for:

Creating the genomic index:

module load STAR

STAR --runThreadN 16 \
--runMode genomeGenerate \
--genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
--genomeFastaFiles ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_ensembl.fa \
--sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
--sjdbOverhang 49


Mapping:

module load STAR

fqFiles=`find $1 -name '*.gz' -type f`

for fqFile in $fqFiles;do
STAR --runThreadN 16 \
--genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
--readFilesIn $fqFile \
--readFilesCommand zcat \
--outFileNamePrefix ~/RNASeqSQ/mapped_data_ensembl/$(basename $fqFile .fastq.gz)_ensembl_ \
--outSAMstrandField intronMotif \
--sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--outFilterIntronMotifs RemoveNoncanonical
done


Reading summarization:

module load subread

featureCounts -a ../ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
-F GTF \
-o GvS_counts_full_ensembl.txt \
*_Aligned.out.bam


Another colleague of mine told me that I should go around this by making a de novo assembly, but, wouldn't that be too biased given that my data comes only from one tissue (brain) and only from one cell type from that tissue (cortical neurons). I'm trying to address if the cortical neurons transcriptome profiles from one side of the brain differs from another side of it.

Thanks for your support =)
Sebastian_Quezada_R is offline   Reply With Quote
Old 08-23-2016, 12:14 AM   #2
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

Hi Sebastian,

You should have a look at the rRNA content. Download just the sheep's rRNA sequences from NCBI as fasta, create a bowtie2 index, and map the reads with bowtie2 against them. The mapping rate is equal to the rRNA content in your sample. The remaining reads can be used for your STAR mapping.

Cheers,

Michael
Michael.Ante is offline   Reply With Quote
Old 08-23-2016, 03:18 PM   #3
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

What featureCounts does is it counts the number of reads overlapping with features included in the annotation. Reads that were reported as mapped by aligner but not overlapping any features will not be counted by featureCounts. For an RNA-seq dataset, typically you will see around 30-50 percent of reads that were mapped but not counted.

featureCounts outputs a counting summary file (called "GvS_counts_full_ensembl.txt.summary" for your data), which should be helpful for you to understand why some reads were not counted.
shi is offline   Reply With Quote
Old 08-23-2016, 04:55 PM   #4
Sebastian_Quezada_R
Junior Member
 
Location: Australia

Join Date: Aug 2016
Posts: 7
Default

Quote:
Originally Posted by Michael.Ante View Post
Hi Sebastian,

You should have a look at the rRNA content. Download just the sheep's rRNA sequences from NCBI as fasta, create a bowtie2 index, and map the reads with bowtie2 against them. The mapping rate is equal to the rRNA content in your sample. The remaining reads can be used for your STAR mapping.

Cheers,

Michael
Thanks Michael, I'll try that and see what I can get.

Quote:
Originally Posted by shi View Post
What featureCounts does is it counts the number of reads overlapping with features included in the annotation. Reads that were reported as mapped by aligner but not overlapping any features will not be counted by featureCounts. For an RNA-seq dataset, typically you will see around 30-50 percent of reads that were mapped but not counted.

featureCounts outputs a counting summary file (called "GvS_counts_full_ensembl.txt.summary" for your data), which should be helpful for you to understand why some reads were not counted.
I just checked that file and it tells me the following stats:

Status v |Sample > 1251.1G 1251.1S 1251.2G 1251.2S 658.1G 658.1S
1 Assigned 21298779 19203813 23927215 20277649 21917507 27799505
2 Unassigned_Ambiguity 231677 212367 246618 210001 292053 256487
3 Unassigned_MultiMapping 146004332 162822044 159017518 147923500 116264627 154408557
4 Unassigned_NoFeatures 33888112 33895039 35942035 29565235 24630110 39141330
5 Unassigned_Unmapped 6568710 4145532 5041164 5343282 37451751 4849700
6 Unassigned_MappingQuality 0 0 0 0 0 0
7 Unassigned_FragmentLength 0 0 0 0 0 0
8 Unassigned_Chimera 0 0 0 0 0 0
9 Unassigned_Secondary 0 0 0 0 0 0
10 Unassigned_Nonjunction 0 0 0 0 0 0
11 Unassigned_Duplicate 0 0 0 0 0 0


So most of my reads were assigned as multimappers, whereas STAR did not count them as such =/.

Thanks a lot for your help.

Do you think the de novo assembly pipeline would be useful to go around this problem as a backup plan?

Last edited by Sebastian_Quezada_R; 08-23-2016 at 09:01 PM.
Sebastian_Quezada_R is offline   Reply With Quote
Old 08-25-2016, 03:30 PM   #5
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
So most of my reads were assigned as multimappers, whereas STAR did not count them as such =/.
featureCounts uses 'NH' tag to identify multi-mapping reads. So you have a lot of reads with this tag having a value greater than 1 in your mapping result.
shi is offline   Reply With Quote
Old 08-25-2016, 06:03 PM   #6
Sebastian_Quezada_R
Junior Member
 
Location: Australia

Join Date: Aug 2016
Posts: 7
Default

Quote:
Originally Posted by shi View Post
featureCounts uses 'NH' tag to identify multi-mapping reads. So you have a lot of reads with this tag having a value greater than 1 in your mapping result.
Thanks, Shi. I finally got it properly. I'm going to start another thread regarding the issues I'm having with my data, so I can see if I can get some insight that my bioinformatician colleagues might have missed.

Thanks again =)
Sebastian_Quezada_R is offline   Reply With Quote
Old 08-25-2016, 10:41 PM   #7
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

Hi Sebastian,
the latest versions of STAR report the NH field. In older versions you should be able to report the number of hits with an according parameters setting.
Michael.Ante is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO