Hi,
I have been trying to use Stringtie for transcriptome re-assembly, based on a reference gtf file.
Here is how I ran it:
# for each of the bam files from my project (aligned with tophat2):
stringtie file.bam -G reference.gtf -o file_stringtie.gtf -p 4 -v -C file_coverage.txt -A file_gene_abundance.out
# then merging all gtf files together:
stringtie --merge -G reference.gtf -p 4 -o all_merged.gtf gtf_list.txt
It is very straightforward. It is also incredibly fast as compared to the cufflinks + cuffmerge pipeline.
But when I compare the number of transcripts found in the reference GTF file and in the output of Stringtie, it is dramatically different:
awk '$3=="transcript"' reference.gtf | wc -l
# 23963
awk '$3=="transcript"' all_merged.gtf | wc -l
# 57830
I expect and hope for new transcripts, but I think this is a bit too much difference (Am I wrong?).
How can I make the pipeline more stringent?
Would you advice to increase the minimum input transcript coverage for example, in the merging step?
Also, If I look at some of cuffmerge's parameters, the minimum isoform fraction is set to 0.05 while in stringtie it is set as 0.01 by default: is it the way to go?
I have tried these parameters:
stringtie --merge -c 2.5 -G reference.gtf -p 4 -o all_merged_bis.gtf gtf_list.txt
awk '$3=="transcript"' all_merged_bis.gtf | wc -l
# 57476
stringtie --merge -f 0.05 -G reference.gtf -p 4 -o all_merged_ter.gtf gtf_list.txt
awk '$3=="transcript"' all_merged_ter.gtf | wc -l
# 36164
I am merging together results from about 60 bam files, so I guess the approach can be different than for smaller projects.
Thank you for any help and advice!
Best,
I have been trying to use Stringtie for transcriptome re-assembly, based on a reference gtf file.
Here is how I ran it:
# for each of the bam files from my project (aligned with tophat2):
stringtie file.bam -G reference.gtf -o file_stringtie.gtf -p 4 -v -C file_coverage.txt -A file_gene_abundance.out
# then merging all gtf files together:
stringtie --merge -G reference.gtf -p 4 -o all_merged.gtf gtf_list.txt
It is very straightforward. It is also incredibly fast as compared to the cufflinks + cuffmerge pipeline.
But when I compare the number of transcripts found in the reference GTF file and in the output of Stringtie, it is dramatically different:
awk '$3=="transcript"' reference.gtf | wc -l
# 23963
awk '$3=="transcript"' all_merged.gtf | wc -l
# 57830
I expect and hope for new transcripts, but I think this is a bit too much difference (Am I wrong?).
How can I make the pipeline more stringent?
Would you advice to increase the minimum input transcript coverage for example, in the merging step?
Also, If I look at some of cuffmerge's parameters, the minimum isoform fraction is set to 0.05 while in stringtie it is set as 0.01 by default: is it the way to go?
I have tried these parameters:
stringtie --merge -c 2.5 -G reference.gtf -p 4 -o all_merged_bis.gtf gtf_list.txt
awk '$3=="transcript"' all_merged_bis.gtf | wc -l
# 57476
stringtie --merge -f 0.05 -G reference.gtf -p 4 -o all_merged_ter.gtf gtf_list.txt
awk '$3=="transcript"' all_merged_ter.gtf | wc -l
# 36164
I am merging together results from about 60 bam files, so I guess the approach can be different than for smaller projects.
Thank you for any help and advice!
Best,
Comment