SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SOAPdenovo-trans alternative splicing pbrand Bioinformatics 8 10-10-2012 01:15 AM
alternative splicing of RNA-Seq AronaldJ RNA Sequencing 2 09-10-2012 11:35 AM

Reply
 
Thread Tools
Old 02-10-2014, 02:40 PM   #1
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 235
Exclamation A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and ...

ABSTRACT:

Numerous high-throughput sequencing studies focus on detecting conventionally spliced mRNAs in RNA-seq data. However, non-standard RNAs arising through gene fusion, circularization, or trans-splicing are often neglected. We introduce a novel, unbiased algorithm to detect splice junctions from single-end cDNA sequences. In contrast to other methods, our approach accommodates multi-junction structures. Our method compares favorably with competing tools on conventionally spliced mRNAs and, with a gain of up to 40\% of recall, systematically outperforms them on reads with multiple splits, trans-splicing and circular products. The algorithm is integrated into our mapping tool segemehl (www.bioinf.uni-leipzig.de/Software/segemehl/).

Steve Hoffmann, Christian Otto, Gero Doose, Andrea Tanzer, David Langenberger, Sabina Christ, Manfred Kunz, Lesca Holdt, Daniel Teupser, Jöerg Hackermüeller and Peter F Stadler: 'A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and fusion detection', Genome Biology, 15:R34, doi:10.1186/gb-2014-15-2-r34 (2014)
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 02-11-2014, 01:43 AM   #2
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default circular RNA

dear segemehl programmer,

which conditions are the best for finding back-spliced (circular) transcripts from 50 PE illumina reads.

i would run with following parameters:

Code:
./segemehl.x -t 20 -T -Y -S -i $index -d $fa -q $fq/15607.1.fq.gz -p $fq/15607.2.fq.gz | gzip > $out/CoCa_15607.sam.gz
should i change MEDAH?

how can i use Haarz to extract especially back-spliced reads?

dietmar

PS: the documentation is for version 0.1.3 - is there a newer one?
dietmar13 is offline   Reply With Quote
Old 02-11-2014, 03:57 AM   #3
luitpold
Junior Member
 
Location: europe

Join Date: Feb 2014
Posts: 3
Default

Hi Dietmar,

build the source

>make
>make testrealign.x

to do the mapping ()
>segemehl.x -q file.fq -d hg19.fa -i hg19.idx -S -s -o file.out

option -S turns on the splice feature. This includes all non-standard splicing events. The option -s shuts up the progress bar.

to call the junctions:
>testrealign.x -d hg19.fa -q file.out -n

option -n is necessary to stop the program from realigning reads - takes much longer.

Hope that helps!
luitpold is offline   Reply With Quote
Old 02-11-2014, 05:27 AM   #4
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default @luitpold

dear luitpold,

thank you!

but i always get this error:
Code:
testrealign.x: libs/memory.c:18: bl_realloc: Assertion `ptr != ((void *)0)' failed.
./testrealign_CoCa_CoNo.sh: line 11:  5078 Aborted                 (core dumped) ./testrealign.x -d $fa -q $out/CoCa_15607.sam -n -U $out/15607_splitfile.bed -T $out/15607_transsplit.bed
any hint what could be wrong? too large SAM-file: 42 GByte? i have 96 Gbyte RAM.

dietmar
dietmar13 is offline   Reply With Quote
Old 02-11-2014, 05:51 AM   #5
luitpold
Junior Member
 
Location: europe

Join Date: Feb 2014
Posts: 3
Default

Hi Dietmar,

seems to be an "out of memory" issue. You might want test it on a smaller SAM file … otherwise contact the developers directly …
luitpold is offline   Reply With Quote
Old 02-11-2014, 06:20 AM   #6
luitpold
Junior Member
 
Location: europe

Join Date: Feb 2014
Posts: 3
Default

Dietmar,

one more thought … is your SAM file sorted?
luitpold is offline   Reply With Quote
Old 02-12-2014, 09:51 AM   #7
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default thank you,

sorting solved the problem.

dietmar
dietmar13 is offline   Reply With Quote
Old 04-07-2014, 02:14 AM   #8
mamonster
Junior Member
 
Location: Taipei

Join Date: Feb 2014
Posts: 2
Default

Dear segemehl development team,

Using segemehl on Memczak 2013 Nature data sets, I managed to get tens of thousands circular RNA splice junctions. However when I compare them to the published data of Memczak, I found that 61 out of the 250~ circular RNAs in hek 293 cell line were not in the result I got from segemehl, which is different from what is declared in your manuscript. Do you think adding the trimming options (-Y -T) would make it different?

Also, I found it difficult to use the testrealign.x looking for junction sites on large sam files. Trying the -B option to split the result into different chromosomes, but still not working, the result bed files are empty.

Thank you
mamonster is offline   Reply With Quote
Old 04-27-2014, 05:50 AM   #9
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 235
Default

If you are interested in how to use segemehl to detect fusion transcripts and/or circularized RNAs, I can recommend you the following hands-on course:
Discovering standard and non-standard RNA transcripts - How to detect canonical splicing, circular RNAs, trans-splicing, and fusion transcripts

Developers of the algorithm will explain you step-by-step how you can use segemehl to detect standard and non-standard transcripts.
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 05-17-2014, 10:29 PM   #10
ntn12
Junior Member
 
Location: Tallin

Join Date: May 2014
Posts: 7
Default

Is out there any article, paper, study where segemhl has been used for finding fusion genes (e.g. show a fusion gene found by segemhl)? Has segemhl been compared with other gene fusion finders? On average how many fusion genes are reported per sample? What is the wet-lab validation rate of the fusions found by segemhl?

For my case reporting hundreds/thousands of candidate fusion genes per sample is totally useless because according to the medical/biological literature the fusion genes are very rare events (i.e. in 98% of the all patient samples are zero fusions per sample) and in case that the indeed are found then there are not more than very few in one sample, maybe a maximum of 25 per sample is the absolute maximum and an average would be around 1 or 3 per sample. Please notice, that fusion genes are not SNPs/indels/alternative-splicing-events. Here the scientific "null" hypothesis is that there are on average between 0-5 fusion genes per sample! This hypothesis can be rejected using only wet-lab data and NOT in silico data! If a tool reports over 100 candidate fusion genes per sample it means that that tool already has a ~95% false positive rate!

I would like to use it for finding pathogenetic/somatic fusion genes and I looked/searched very hard and I was not able to find anything which suggest that segemhl has ever been used for finding pathogenetic/somatic fusion genes.

Last edited by ntn12; 05-18-2014 at 12:46 AM.
ntn12 is offline   Reply With Quote
Old 05-18-2014, 01:05 AM   #11
Paul Newport
Member
 
Location: Bristol, UK

Join Date: May 2014
Posts: 10
Default

Aren't most of these questions answered when reading the segemehl publication? They compared their tool with 7 other state-of-the-art tools and validated their results based on available RNA seq datasets.

As far as I can judge the situation, the group that developed segemehl is a pure bioinformatics group and thus they did not perform any wet-lab validation, but implemented a tool that does what it should (compared to other algorithms). And since it was published only some month ago, I think we have to wait until we find any article where segemehl was used to find fusion genes.

I'm curious about these future publications, since the examples shown in the paper are quite impressive. But the future will show if segemehl is really that good.
Paul Newport is offline   Reply With Quote
Old 05-18-2014, 03:30 AM   #12
ntn12
Junior Member
 
Location: Tallin

Join Date: May 2014
Posts: 7
Default

Quote:
Originally Posted by Paul Newport View Post
Aren't most of these questions answered when reading the segemehl publication? They compared their tool with 7 other state-of-the-art tools and validated their results based on available RNA seq datasets.
...
.
Could you point to the publication where SEGEMEHL is used for finding fusion genes?

If you mean this:
http://bioinformatics.oxfordjournals...s.btu146.short

then there SEGEMEHL is compared to STAR, BOWTIE2, BWA-MEM, BLAT, etc. and not even one of these is a gene fusion finder! The word fusion is not mentioned even once in the entire article (except in the references). Fusion gene finders are for example: SOAPfuse, deFuse, FusionHunter, etc. How does SEGEMEHL compare to these? Here is a nice comparisons for fusion genes finders: http://code.google.com/p/fusioncatcher/wiki/comparison

Did I miss something here?

I mean by fusion genes this:
http://erc.endocrinology-journals.or.../R143.full.pdf

P.S. Read splitter is not the same as finding fusion genes!

Last edited by ntn12; 05-18-2014 at 10:15 AM.
ntn12 is offline   Reply With Quote
Old 05-18-2014, 04:58 AM   #13
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 235
Default

Dear ntn12,

thanks for your comments and questions.

segemehl itself is not a fusion-gene-finder. It is a mapping tool that can detect split-reads and its resulting set of these split-reads can be used to call fusion genes. But it has to be done in a separate downstream analysis and is not included in the segemehl algorithm. I hope that makes things clearer.
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

Last edited by ecSeq Bioinformatics; 05-19-2014 at 12:01 AM.
ecSeq Bioinformatics is offline   Reply With Quote
Old 05-18-2014, 04:48 PM   #14
Paul Newport
Member
 
Location: Bristol, UK

Join Date: May 2014
Posts: 10
Default

Quote:
Originally Posted by ntn12 View Post
Here is a nice comparisons for fusion genes finders: http://code.google.com/p/fusioncatcher/wiki/comparison
Sorry, but I don't understand the list shown on the linked page.

My questions would be:
  1. Where do these 40 fusion genes come from?
  2. Why does only FusionCatcher find all of these?
  3. Why is this list on the FusionCatcher website?

That looks a bit suspicious to me!
Paul Newport is offline   Reply With Quote
Old 05-18-2014, 05:00 PM   #15
Paul Newport
Member
 
Location: Bristol, UK

Join Date: May 2014
Posts: 10
Default

Quote:
Originally Posted by Paul Newport View Post
Where do these 40 fusion genes come from?
I just did some research and found on the FusionCatcher website:

FusionCatcher has been used originally for finding novel and known fusion genes in breast tumor cell lines BT474, SKBR3, MCF7, KPL4 as shown in the following articles:
  • S. Kangaspeska, S. Hultsch, H. Edgren, D. Nicorici, A. Murumägi, O.P. Kallioniemi, Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms, PLOS One 2012. http://dx.plos.org/10.1371/journal.pone.0048745
  • H. Edgren, A. Murumagi, S. Kangaspeska, D. Nicorici, V. Hongisto, K. Kleivi, I.H. Rye, S. Nyberg, M. Wolf, A.L. Borresen-Dale, O.P. Kallioniemi, Identification of fusion genes in breast cancer by paired-end RNA-sequencing, Genome Biology 2011, Vol. 12. http://genomebiology.com/2011/12/1/R6

These are the same two publications shown on the "comparison" page. So the 40 genes were predicted using FusionCatcher? Honestly?
Paul Newport is offline   Reply With Quote
Old 05-18-2014, 07:30 PM   #16
ntn12
Junior Member
 
Location: Tallin

Join Date: May 2014
Posts: 7
Default

Quote:
Originally Posted by ecSeq Bioinformatics View Post
Dear ntn12,

thanks for your comments and questions.

segemehl itself is not a fusion-finder. It is a mapping tool that can detect split-reads and its resulting set of these split-reads can be used to call fusion genes. But it has to be done in a separate downstream analysis and is not included in the segemehl algorithm. I hope that makes things clearer.

Ok. I understand now that SEGEMEHL is not a fusion genes finder and it has never been used for this. It has the same potential to be used for fusion finder as BLAT/BOWTIE/BWA for example.

I got confused because the authors of SEGEMEHL claim in the title of their paper:

Hoffmann et al. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and FUSION DETECTION, Genome Biol. 2014.

http://www.ncbi.nlm.nih.gov/pubmed/24512684

that SEGEMEHL does FUSION DETECTION when actually it does not.
ntn12 is offline   Reply With Quote
Old 05-18-2014, 07:55 PM   #17
ntn12
Junior Member
 
Location: Tallin

Join Date: May 2014
Posts: 7
Default

Quote:
Originally Posted by Paul Newport View Post
Sorry, but I don't understand the list shown on the linked page.

My questions would be:
  1. Where do these 40 fusion genes come from?
  2. Why does only FusionCatcher find all of these?
  3. Why is this list on the FusionCatcher website?
I do not know. We have not used yet FusionCatcher. We have been testing TopHat-fusion, FusionMap, ChimeraScan, and FusionFinder. We found puzzling that all these four give thousands of candidate fusion genes per sample (some even hundred of thousands) when we know from the medical literature that there should not be more than 1-3 fusion genes per sample!!! Therefore one has here 99% false positives.

UPDATE: We started testing SOAPfuse and we start to like it!

Last edited by ntn12; 05-19-2014 at 05:48 AM.
ntn12 is offline   Reply With Quote
Old 05-18-2014, 11:25 PM   #18
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 235
Default

Quote:
Originally Posted by ntn12 View Post
I got confused because the authors of SEGEMEHL claim in the title of their paper:

Hoffmann et al. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and FUSION DETECTION, Genome Biol. 2014.

http://www.ncbi.nlm.nih.gov/pubmed/24512684

that SEGEMEHL does FUSION DETECTION when actually it does not.
Dear ntn12,

please step gently here. The title of the paper is very clear and all claims are met. Before reading something into the title, you should actually read the paper. Everything is written in very clear manner and all claims are confirmed by public available data.

Nevertheless, I do not understand your frustrations here. Perhaps you should directly contact the developers of the algorithm and seek a dialogue.
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 05-19-2014, 05:46 AM   #19
ntn12
Junior Member
 
Location: Tallin

Join Date: May 2014
Posts: 7
Default

Quote:
Originally Posted by ecSeq Bioinformatics View Post
Dear ntn12,

please step gently here. The title of the paper is very clear and all claims are met. Before reading something into the title, you should actually read the paper. Everything is written in very clear manner and all claims are confirmed by public available data.
I am even confused about SEGEMEHL after reading the paper.

The authors of this paper:

Hoffmann et al. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and FUSION DETECTION, Genome Biol. 2014. http://www.ncbi.nlm.nih.gov/pubmed/24512684

clearly state in the title and other three places thru out their article that:

"Here, we present a unified unbiased algorithm to detect splicing, trans-splicing and gene fusion events from single-end read data..."

"The algorithmic strategy to identify splicing, trans-splicing or gene fusion sites is based on a greedy, score-based seed chaining followed by a Smith-Waterman-like transition alignment."

"Implemented in the segemehl mapping tool, it readily identifies conventional splice junctions, collinear and non-collinear fusion transcripts, and trans-spliced RNAs, without the need for separate post-processing or an extensive computational overhead."


Also I did not find in the same article not even one fusion gene or fusion transcript found by SEGEMEHL. According to the last statement SEGEMEHL should identify readily fusion transcripts without the need for separate post-processing.

We will use SOAPfuse for finding fusion genes because it performed really well in our tests.

Last edited by ntn12; 05-19-2014 at 06:04 AM.
ntn12 is offline   Reply With Quote
Old 05-19-2014, 06:27 AM   #20
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 235
Default

Dear ntn12,

I herewith take notice of your assumption that the segemehl developers wrote some statements which are confusing for you, so you will use SOAPfuse.
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

Last edited by ecSeq Bioinformatics; 05-19-2014 at 06:59 AM.
ecSeq Bioinformatics is offline   Reply With Quote
Reply

Tags
mapping, ngs

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO