SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-seq SNP calling softwore huangjun RNA Sequencing 8 07-23-2013 12:51 AM
De novo SNP calling in absence of complete reference assembly fcr De novo discovery 15 09-21-2012 03:34 AM
Editing fasta , reference base in snp calling samtools moriah Bioinformatics 2 08-10-2011 12:11 AM
SNP calling from a reference sequence blackrabite Genomic Resequencing 2 05-21-2011 09:48 PM
Hierarchical reference-free SNP calling Marius Bioinformatics 1 12-27-2010 09:38 AM

Reply
 
Thread Tools
Old 12-04-2011, 03:01 PM   #1
shoegame2001
Member
 
Location: California

Join Date: Dec 2010
Posts: 21
Default RNA-seq SNP-calling without a complete reference

I am working on a project that seeks to call SNPs for a non-model organism with no existing reference genome or transcriptome using multiplexed Illumina RNA-seq data.

I used Trinity to assemble a partial 'reference' transcriptome of the most highly expressed transcripts for which we had sufficient coverage, as well as many fragments of lower-expressed transcripts. Then I used BWA to map all data for multiple individuals back to that reference, and finally used GATK to call SNPs.

However, I am running into an issue where reads derived from paralogous genes or a multigene family are mapping back to the same reference contig, creating false SNPs in divergent positions. My evidence of this is that in general one 'allele' (actually a slightly divergent gene) is supported by significantly fewer than half of the reads for a given individual that is called a heterozygote. These 'SNPs' are also generally observed across several individuals, leading me to believe that these are not sequencing/library prep errors.

I think that I will be able to identify these cases with some statistic, but I am wondering if there is a good way to modify the corresponding SAM files to remove the mis-mapped reads, then re-genotype. Has anyone else encountered similar issues, and if so how did you deal with it?

Last edited by shoegame2001; 12-06-2011 at 03:49 PM.
shoegame2001 is offline   Reply With Quote
Old 12-07-2011, 05:44 PM   #2
htchu.taiwan
Junior Member
 
Location: Taiwan

Join Date: Dec 2011
Posts: 5
Default

Hi, friend,

You may try my program: EBARDenovo for RNA-Seq.
https://sourceforge.net/projects/ebardenovo


It's a 64-bits Windows command with .Net.

EBARDenovo can assembly lower-expressed transcripts even their coverage depths are very low (e.g., 1.5).


Frank H.T. Chu from Taiwan

Quote:
Originally Posted by shoegame2001 View Post
I am working on a project that seeks to call SNPs for a non-model organism with no existing reference genome or transcriptome using multiplexed Illumina RNA-seq data.

I used Trinity to assemble a partial 'reference' transcriptome of the most highly expressed transcripts for which we had sufficient coverage, as well as many fragments of lower-expressed transcripts. Then I used BWA to map all data for multiple individuals back to that reference, and finally used GATK to call SNPs.

However, I am running into an issue where reads derived from paralogous genes or a multigene family are mapping back to the same reference contig, creating false SNPs in divergent positions. My evidence of this is that in general one 'allele' (actually a slightly divergent gene) is supported by significantly fewer than half of the reads for a given individual that is called a heterozygote. These 'SNPs' are also generally observed across several individuals, leading me to believe that these are not sequencing/library prep errors.

I think that I will be able to identify these cases with some statistic, but I am wondering if there is a good way to modify the corresponding SAM files to remove the mis-mapped reads, then re-genotype. Has anyone else encountered similar issues, and if so how did you deal with it?
htchu.taiwan is offline   Reply With Quote
Old 12-14-2011, 03:24 PM   #3
Nico55
Junior Member
 
Location: Wa.

Join Date: Dec 2011
Posts: 7
Default

I’m in the same boat my friend. Right now I am using oases to assemble; after trialing several assembly programs I found it did the best work with my transcriptomes. I then implemented SOAPaligner in conjunction with SOAPsnp. This trial is still underway I will update you as soon as I compile my results. I would love to hear if you have made any progress using different programs or pipelines.
Thanks

Last edited by Nico55; 12-14-2011 at 06:38 PM.
Nico55 is offline   Reply With Quote
Old 03-06-2012, 05:12 AM   #4
rururara
Member
 
Location: montreal

Join Date: Jan 2011
Posts: 31
Default RNA-seq SNP-calling without a complete reference

Hi all,

I tried also Oases for de novo transcriptome and quite satisfied with the output.
But now, I notice that how to obtain the SNP position from de novo assembly?
Can we just rely on the SNP position that was given from variant calls etc: samtools, gigabayes, freebayes or we need to write in house script ?

In my case, I'm working with diploid plant. Some people said it's easier. But for me it's still a challenge.

Hope to hear comments from you guys.
Thanks!
rururara is offline   Reply With Quote
Old 05-30-2012, 10:18 AM   #5
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi shoegame2001,

Do you figure out the solution for your doubt?
Currently I'm facing the same problem as well.
I have a Illumina RNA-seq pair-end read, reference transcriptome.
However, I have no idea how to get the SNP result from my data set.
Thanks for any advice.
edge is offline   Reply With Quote
Old 06-29-2012, 04:39 PM   #6
shoegame2001
Member
 
Location: California

Join Date: Dec 2010
Posts: 21
Default

As far as I can tell, there is no software designed for SNP-calling in RNA-seq data in the absence of a reference genome. Aligning reads back to a de novo assembled transcriptome and then filtering based on the proportion of reads supporting the alternative allele in called heterozygotes as well as deviation from Hardy-Weinberg results in a more reliable SNP set, but I am afraid there are still false positives that slip through.
shoegame2001 is offline   Reply With Quote
Old 07-04-2012, 01:55 AM   #7
htchu.taiwan
Junior Member
 
Location: Taiwan

Join Date: Dec 2011
Posts: 5
Default

Hi, friends,

You may try my program: EBARDenovo for RNA-Seq.
EBARDenovo now can output SNP locations in the comtigs with the parameter (-P)
Please check:
https://sourceforge.net/projects/ebardenovo

It's a 64-bits Windows command with .Net.
You can run it on a Windows PC with 16G RAM for 30~40G fastq RNA-Seq data.
In our experiments, EBARDenovo is more accurate than Trinity and Oases.

Hsueh-Ting Chu

Quote:
Originally Posted by shoegame2001 View Post
As far as I can tell, there is no software designed for SNP-calling in RNA-seq data in the absence of a reference genome. Aligning reads back to a de novo assembled transcriptome and then filtering based on the proportion of reads supporting the alternative allele in called heterozygotes as well as deviation from Hardy-Weinberg results in a more reliable SNP set, but I am afraid there are still false positives that slip through.
htchu.taiwan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO