SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   stringtie parameters (http://seqanswers.com/forums/showthread.php?t=66297)

sbcn 02-15-2016 03:05 AM

stringtie parameters
 
Hi,

I have been trying to use Stringtie for transcriptome re-assembly, based on a reference gtf file.
Here is how I ran it:

# for each of the bam files from my project (aligned with tophat2):
stringtie file.bam -G reference.gtf -o file_stringtie.gtf -p 4 -v -C file_coverage.txt -A file_gene_abundance.out

# then merging all gtf files together:
stringtie --merge -G reference.gtf -p 4 -o all_merged.gtf gtf_list.txt

It is very straightforward. It is also incredibly fast as compared to the cufflinks + cuffmerge pipeline.

But when I compare the number of transcripts found in the reference GTF file and in the output of Stringtie, it is dramatically different:
awk '$3=="transcript"' reference.gtf | wc -l
# 23963
awk '$3=="transcript"' all_merged.gtf | wc -l
# 57830

I expect and hope for new transcripts, but I think this is a bit too much difference (Am I wrong?).

How can I make the pipeline more stringent?

Would you advice to increase the minimum input transcript coverage for example, in the merging step?
Also, If I look at some of cuffmerge's parameters, the minimum isoform fraction is set to 0.05 while in stringtie it is set as 0.01 by default: is it the way to go?

I have tried these parameters:

stringtie --merge -c 2.5 -G reference.gtf -p 4 -o all_merged_bis.gtf gtf_list.txt
awk '$3=="transcript"' all_merged_bis.gtf | wc -l
# 57476

stringtie --merge -f 0.05 -G reference.gtf -p 4 -o all_merged_ter.gtf gtf_list.txt
awk '$3=="transcript"' all_merged_ter.gtf | wc -l
# 36164

I am merging together results from about 60 bam files, so I guess the approach can be different than for smaller projects.

Thank you for any help and advice!

Best,

SES 02-16-2016 08:44 AM

I would try gffcompare (by the same author) instead of "stringtie --merge" because it seems to be more stringent. I have also experienced the same issue that you report, but it is worse for a large genome. In my case, "stringtie --merge" generated 3X more transcripts than the reference, while gffcompare only generated about 2X more. You can also discard novel loci with gffcompare if you want to only consider the reference set.

Alternatively, you can increase the thresholds for stringtie to merge transcripts.

sbcn 02-17-2016 01:48 AM

Thanks a lot for your input.

I have now tried gffcompare, but it is actually a lot worse in my case:

gffcompare -r reference.gtf -s reference.fa -C -D -i gtf_list.txt

awk '$3=="transcript"' gffcmp.combined.gtf | wc -l
# 185653

As I understand it, gffcompare creates the union of all the gtf files given as an input, and as I am merging about 60 files, I get a huge final number of transcripts.

I think stringtie --merge is more appropriate in my case as it rather constructs a kind of consensus, so I will try and work on optimizing the parameters, although I would like to make sure not to be too stringent on some of them, and too flexible on others.

mpertea 02-18-2016 09:22 AM

It is very likely that most of the transcripts that make up the difference are intronic or intergenic single exon transcripts. Especially with such a large number of samples, there are many small fragments expressed all over the place. We are more aggressive in filtering these out in StringTie version 1.2.2 (just released today), so please give it a try.

The other ways to filter more of the transcripts are with the -f parameter just as mentioned before, or with the -F or -T parameters that filter out transcripts of very low abundance in the samples. We like filtering with -F and -T more than with the -f option, because -f filters transcripts that have a relative low abundance compared to the most abundant transcript in the bundle, even if sometimes the transcripts that are filtered out are highly expressed.

SES 02-18-2016 02:34 PM

Quote:

Originally Posted by mpertea (Post 189492)
It is very likely that most of the transcripts that make up the difference are intronic or intergenic single exon transcripts. Especially with such a large number of samples, there are many small fragments expressed all over the place. We are more aggressive in filtering these out in StringTie version 1.2.2 (just released today), so please give it a try.

The other ways to filter more of the transcripts are with the -f parameter just as mentioned before, or with the -F or -T parameters that filter out transcripts of very low abundance in the samples. We like filtering with -F and -T more than with the -f option, because -f filters transcripts that have a relative low abundance compared to the most abundant transcript in the bundle, even if sometimes the transcripts that are filtered out are highly expressed.

This is very helpful, thanks. One question I have would be about the merging that gffcompare does vs. the "stringtie --merge" method. It seems like "stringtie --merge" is the more appropriate method for joining libraries from different tissues, followed by an assessment with gffcompare. Is this correct? The docs say that gffcompare also does merging but it is not clear to how this relates to what "stringtie --merge" is doing.

mcsimenc 06-20-2016 10:23 AM

Can anyone suggest an interpretation of the following results using stringtie --merge: ?

Three stringtie assemblies with 29747, 30865, and 29863 transcripts are merged using stringtie --merge and the resulting gtf has only 25130 transcripts.

Am I losing information? I do not know the internal workings of stringtie --merge but I intuitively expect to have no fewer transcripts than the input assembly with the fewest transcripts.

Thanks!!
Matt

rajeev.vikram 02-07-2017 07:46 PM

Quote:

Originally Posted by mcsimenc (Post 195921)
Can anyone suggest an interpretation of the following results using stringtie --merge: ?

Three stringtie assemblies with 29747, 30865, and 29863 transcripts are merged using stringtie --merge and the resulting gtf has only 25130 transcripts.

Am I losing information? I do not know the internal workings of stringtie --merge but I intuitively expect to have no fewer transcripts than the input assembly with the fewest transcripts.

Thanks!!
Matt

Hello Matt,

According to my understanding, the number of merged transcripts presented depends on the relative expression of the input transcript files. As the literature stares, " generate a non-redundant set of transcripts observed in all the RNA-Seq samples assembled previously to generate a a global, unified set of transcripts (isoforms) across multiple RNA-Seq samples." which means, the merge option will only produce transcripts with robust expression (or whatever expression cutoff one selects). Are you using a reference transcriptome file in assembly? you can also use gff compare to check the accuracy of your files.

Cheers


All times are GMT -8. The time now is 08:05 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.