SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ORFs in contigs from strand-specific RNA-seq panos_ed RNA Sequencing 0 10-19-2012 04:18 AM
RNA-Seq: MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Newsbot! Literature Watch 2 10-14-2010 08:35 AM

Reply
 
Thread Tools
Old 06-18-2013, 06:00 PM   #1
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default RNA-seq Mapping to Many Contigs

What advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?
Dario1984 is offline   Reply With Quote
Old 06-18-2013, 06:48 PM   #2
peromhc
Senior Member
 
Location: Durham, NH

Join Date: Sep 2009
Posts: 108
Default

try filtering out contigs that have a FPKM of less than 1, or .5. This should get rid of a large number of, likely junk, contigs. There are tools in Trinity (RSEM or eXpress) to to this.

Also, you could try clustering with cd-hit-est to get rid of redundancy.
peromhc is offline   Reply With Quote
Old 06-18-2013, 08:23 PM   #3
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Dario1984,

You may try the Subread aligner which can deal with large number of contigs.

http://subread.sourceforge.net/

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-18-2013, 09:00 PM   #4
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

Thanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?
Dario1984 is offline   Reply With Quote
Old 06-19-2013, 01:09 PM   #5
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by Dario1984 View Post
What advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?
To avoid RAM problems for the large number of contigs with STAR, try reducing --genomeChrBinNbits (=18 by default) to a smaller number, ~14 or less. The mapping speed will be slow by STAR's standards, but it may still adequate.
alexdobin is offline   Reply With Quote
Old 06-19-2013, 07:33 PM   #6
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Quote:
Originally Posted by Dario1984 View Post
Thanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?
This paper should be of good reference:
https://www.biomedcentral.com/1471-2164/13/392
Kennels is offline   Reply With Quote
Old 06-20-2013, 10:00 PM   #7
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.
Dario1984 is offline   Reply With Quote
Old 06-20-2013, 10:17 PM   #8
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Most of the reads in the Trinity assembly will be background RNA (something like 80% of the genome is transcribed remember) and assembly junk. As mentioned already mapping the reads to the Trinity assembly and excluding low count sequences will remove this junk. I prefer to use raw read count, then you can easily see what portion of reads map to the 20-40K Trinity sequences you are left with. I have done something like that and from 370,000 trinity sequences, 96% of the reads mapped to about 38,000 trinity sequences and the rest were discarded.
Jeremy is offline   Reply With Quote
Old 06-21-2013, 03:32 AM   #9
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by Dario1984 View Post
I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.
Hi Dario,

Could you please provide a bit more info about your data such as read length, single-end or paired-end etc? There could be many reasons contributing to a low mappability. Although Subread does not allow mismatches in the seeds, these seeds are quite short (16bp), so I do not really think this was the reason you got a low mapping percentage when mapping your reads to a related species.

One thing which may be worthwhile to try is to set -m=1 to test how many reads have a 16bp substring perfectly matched with the reference. If you still got a low percentage, this may simply tell you that your reads are very different from the reference.

Best regards,

Wei
shi is offline   Reply With Quote
Old 06-21-2013, 09:18 AM   #10
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

What happens if you only take the 50000 biggest contigs from your reference? A lot of times these draft assemblies have many small contigs that aren't going to contain useful information for gene expression analysis anyway. Meaning they will mostly not contain coding regions, or if they do its only one, maybe two exons, and you can't assign orthology anyway.
Wallysb01 is offline   Reply With Quote
Old 06-26-2013, 05:00 PM   #11
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

I think the related genome is too distant. I took 100 random reads and used BLAST to get an impression of what the mapping would be like. Two representative examples of one of the 50 base read pairs are

Code:
>Scaffold915 
          Length = 323013

 Score = 42.1 bits (21), Expect = 0.006
 Identities = 39/45 (86%)
 Strand = Plus / Plus

                                                           
Query: 6      ttccagacaaaacagacaacaaatcataatcataaatatcatttg 50
              |||| ||||||| ||||||||  || |||| ||||||||||||||
Sbjct: 261960 ttcctgacaaaatagacaacatttcttaattataaatatcatttg 262004
and

Code:
>Scaffold476 
          Length = 632255

 Score = 40.1 bits (20), Expect = 0.025
 Identities = 20/20 (100%)
 Strand = Plus / Minus

                                  
Query: 8      caagaatttttttgatgaaa 27
              ||||||||||||||||||||
Sbjct: 568677 caagaatttttttgatgaaa 568658
I will proceed by implementing the filtering strategies for de-novo assembly.
Dario1984 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:55 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO