Hi everyone =) I present you a question that has been bothering me for a while now and, unfortunately, none of my colleagues have been able to help me.
I've finally got my biological replicates data back from the sequencing platforms so I can move into the DE analysis and make some stats and fancy graphs to show my boss =) but I'm having some problems right at the start of the pipeline.
I'll put you into context regarding my data: I'm sequencing little amounts of RNA, therefore no ribodepletion or polyA pull-down was possible. The whole transcriptome is being sequenced (kind of this single-cell sequencing using SPIA sequencing). The quality of the sequences is top-notch (no base with score under 40 in the fastqc report). I can post you the fastqc reports if you wanna look at them. Three biological replicates from two different zones of a single tissue. The project aims to detect the DE genes between those two zones of the tissue. I'm working with sheep.
I've mapped my reads using STAR with Ensembl sheep genome fasta and GTF annotation files, and this program told me that, on average, around 60% were uniquely mapped reads, around 38% were multimappers and around 2% were deemed as unmapped for all my files. So far so good 'cause I think they are expected percentages.
The problem comes when I used featureCounts, because when I read summarization it tells me that, on average, only around 10% of my reads were successfully assigned to a feature for all my files.
I've been thinking for a while, but to me it doesn't make any sense. If the sequence could not be properly mapped to any feature, they would have been deemed as unmapped by STAR in the first place.
I have no interest at the moment at multimappers, so I left them out in featureCounts, but anyway I was expecting to have more or less the same percentages than the STAR output, because If the reads were successfully mapped with STAR it means they recognize the feature inside the GTF file so, how can featureCounts not assign it to the feature if they are using the same reference files?
These are the codes used for:
Creating the genomic index:
module load STAR
STAR --runThreadN 16 \
--runMode genomeGenerate \
--genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
--genomeFastaFiles ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_ensembl.fa \
--sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
--sjdbOverhang 49
Mapping:
module load STAR
fqFiles=`find $1 -name '*.gz' -type f`
for fqFile in $fqFiles;do
STAR --runThreadN 16 \
--genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
--readFilesIn $fqFile \
--readFilesCommand zcat \
--outFileNamePrefix ~/RNASeqSQ/mapped_data_ensembl/$(basename $fqFile .fastq.gz)_ensembl_ \
--outSAMstrandField intronMotif \
--sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--outFilterIntronMotifs RemoveNoncanonical
done
Reading summarization:
module load subread
featureCounts -a ../ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
-F GTF \
-o GvS_counts_full_ensembl.txt \
*_Aligned.out.bam
Another colleague of mine told me that I should go around this by making a de novo assembly, but, wouldn't that be too biased given that my data comes only from one tissue (brain) and only from one cell type from that tissue (cortical neurons). I'm trying to address if the cortical neurons transcriptome profiles from one side of the brain differs from another side of it.
Thanks for your support =)
I've finally got my biological replicates data back from the sequencing platforms so I can move into the DE analysis and make some stats and fancy graphs to show my boss =) but I'm having some problems right at the start of the pipeline.
I'll put you into context regarding my data: I'm sequencing little amounts of RNA, therefore no ribodepletion or polyA pull-down was possible. The whole transcriptome is being sequenced (kind of this single-cell sequencing using SPIA sequencing). The quality of the sequences is top-notch (no base with score under 40 in the fastqc report). I can post you the fastqc reports if you wanna look at them. Three biological replicates from two different zones of a single tissue. The project aims to detect the DE genes between those two zones of the tissue. I'm working with sheep.
I've mapped my reads using STAR with Ensembl sheep genome fasta and GTF annotation files, and this program told me that, on average, around 60% were uniquely mapped reads, around 38% were multimappers and around 2% were deemed as unmapped for all my files. So far so good 'cause I think they are expected percentages.
The problem comes when I used featureCounts, because when I read summarization it tells me that, on average, only around 10% of my reads were successfully assigned to a feature for all my files.
I've been thinking for a while, but to me it doesn't make any sense. If the sequence could not be properly mapped to any feature, they would have been deemed as unmapped by STAR in the first place.
I have no interest at the moment at multimappers, so I left them out in featureCounts, but anyway I was expecting to have more or less the same percentages than the STAR output, because If the reads were successfully mapped with STAR it means they recognize the feature inside the GTF file so, how can featureCounts not assign it to the feature if they are using the same reference files?
These are the codes used for:
Creating the genomic index:
module load STAR
STAR --runThreadN 16 \
--runMode genomeGenerate \
--genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
--genomeFastaFiles ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_ensembl.fa \
--sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
--sjdbOverhang 49
Mapping:
module load STAR
fqFiles=`find $1 -name '*.gz' -type f`
for fqFile in $fqFiles;do
STAR --runThreadN 16 \
--genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
--readFilesIn $fqFile \
--readFilesCommand zcat \
--outFileNamePrefix ~/RNASeqSQ/mapped_data_ensembl/$(basename $fqFile .fastq.gz)_ensembl_ \
--outSAMstrandField intronMotif \
--sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--outFilterIntronMotifs RemoveNoncanonical
done
Reading summarization:
module load subread
featureCounts -a ../ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
-F GTF \
-o GvS_counts_full_ensembl.txt \
*_Aligned.out.bam
Another colleague of mine told me that I should go around this by making a de novo assembly, but, wouldn't that be too biased given that my data comes only from one tissue (brain) and only from one cell type from that tissue (cortical neurons). I'm trying to address if the cortical neurons transcriptome profiles from one side of the brain differs from another side of it.
Thanks for your support =)
Comment