SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
question about mapping reads back to an endosymbiont genome NGS_Newbie RNA Sequencing 0 09-26-2011 06:20 PM
Converting reads mapped from transcriptome back to genome golharam Bioinformatics 12 09-19-2011 02:49 PM
Need Help, gap between two contigs formed by mapping ketan_bnf Bioinformatics 1 06-14-2011 11:36 PM
Reads and contigs (Newbler) FelipeAd Bioinformatics 3 01-12-2011 07:13 AM
BWA and Velvet : Anamolies in contigs and read mapping to reference apratap Bioinformatics 1 01-29-2010 12:44 PM

Reply
 
Thread Tools
Old 02-23-2012, 06:05 AM   #1
cerebralrust
Junior Member
 
Location: Sweden

Join Date: Jan 2012
Posts: 8
Exclamation Too few reads mapping back to contigs

I assembled plant transcriptome 454 data (non normalised) using trinity after the following

1)pre processing (removal of adaptors, vector contamination)
2)removal of rRna sequences
3)removal of chloroplast and mitochondrial genes using bwa

From 3,70,929 reads, i got 21,486 contigs. When i mapped the reads to the contigs using bwa, only 44,678 reads were used in the assembly. What am i doing wrong here? I randomly blasted the contigs to observe that they share over 90% similarity with related legume proteins (although many were hypothetical).
However, only a small percentage of the contigs align to the transcript assemblies of related legumes when mapped using bwa.

The velvet assembly of the same data resulted in 15,323 contigs with lesser n50 value, n90 value, max length etc.
MIRA assembly resulted in more contigs and more reads being used but lesser n50, n90 and avg length of contig.
Why are only 44,678 reads being used? Any advice is greatly appreciated.
cerebralrust is offline   Reply With Quote
Old 02-23-2012, 07:30 AM   #2
seqret
Junior Member
 
Location: Upstate New York

Join Date: Sep 2009
Posts: 1
Default

I assume you used newbler, and that you had about 370 thousand reads? Did you check the 454NewblerMetrics.txt file and/or the 454ReadStatus.txt file to determine how many reads the assembler thought it used? I would guess that the assembly was very fragmented so that many of the reads ended up in contigs that were too small to report. When doing transcriptome assemblies, Newbler has some rules about what gets reported as isotigs, contigs, or not reported at all --- don't remember them all off the top of my head.

Also, you did tell the assembler that this is a cdna assembly project, correct?
seqret is offline   Reply With Quote
Old 02-23-2012, 08:50 AM   #3
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

@seqret ... note his first line. He used Trinity, not newbler. Then he used Velvet and MIRA.

Quote:
Originally Posted by cerebralrust View Post
I assembled plant transcriptome 454 data (non normalised) using trinity

I have been thinking about this problem. Hard to tell without looking at the data. However it is possible that Trinity, Velvet and MIRA are not up to the task. If you are recommending using Newbler then I heartily agree with that idea.
westerman is offline   Reply With Quote
Old 02-23-2012, 01:37 PM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

I'm wondering if the problem is not with the assembly but with the mapping. Is bwa the best tool to use here, or were the options used appropriate? (I'm asking because I'm not that familiar with bwa.) Frankly, if I had a set of contigs (putative transcripts) and wanted to map raw 454 reads back to them just to count I would use blat.
kmcarr is offline   Reply With Quote
Old 02-23-2012, 08:42 PM   #5
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

For 454, I recommend bwasw, bowtie2, smalt or tmap. Blat is a bit slow and does not output SAM.
lh3 is offline   Reply With Quote
Old 02-23-2012, 09:19 PM   #6
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

I would recommend Newbler since it has been specifically designed for 454 data.
I am assuming that by mapping the reads back you are trying to get read counts per contig/isotig/isogroup yes?

If you use newbler you can get read counts per contig from the 454ReadStatus.txt file that is produced when you perform a transcriptome assembly. Just do a grep for 'Assembled' and count the number of times each contig appears, if you have different samples in different lanes you can do the appropriate grep to subset them also. This file lists the 3` and 5` match of each read so you effectively count each read twice. I don't think that is a problem since the reads are generally pretty long to begin with. This method means that some contigs may have a zero or low read count, but it does count every read so that should not be a problem after you sum the read counts of contigs to form read counts per isotig.

Alternatively you can grep 'Assembled', and make a subset of the assembled reads and then map them back to your contigs using GSMapper. I recommend only using reads with the assembled status to minimise false mapping. I use mapping for SNP deiscovery also, so I set -ais 1 which means that the mapped read needs to be a very good match.

Last edited by Jeremy; 02-23-2012 at 09:22 PM.
Jeremy is offline   Reply With Quote
Old 02-25-2012, 12:27 AM   #7
cerebralrust
Junior Member
 
Location: Sweden

Join Date: Jan 2012
Posts: 8
Default

Thank you for all your suggestions, members!

@ seqret : As Rick pointed out, i've never used Newbler.

@ Rick : Using Newbler is not an option, i guess, since it is not open source and we got the sequenced data from a collaborator in the US. Perhaps my only option is to standardise mira parameters to improve the assembly?

@kmcarr : I was wondering about the mapping also. I will try mapping with bwasw and bowtie2 on the suggestion of lh3 since i require results in sam format also.

@lh3 : I will try all, compare and pick the best one.

@Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future.

I suppose i'm left with the option of using mira with various combinations of parameters to get the best assembly.

If it may be of help to anyone, I should not have used Trinity for this data considering :

According to one of key developers of Trinity - Brian J. Haas' option:

"Ultimately, Trinity might not be the best tool for assembling 454 data, since coverage won't be anywhere near what is expected from Illumina in most cases, and Trinity exploits the high coverage data as part of reconstructing transcripts. The current version of Newbler is supposed to work especially well for 454 transcriptome data, so I encourage you to give that a try if you haven't already."
cerebralrust is offline   Reply With Quote
Old 02-25-2012, 06:56 AM   #8
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by cerebralrust View Post
@Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future.
Newbler may be proprietary but proprietary != $. You can obtain Newbler free of charge by completing the software request at this webpage. Note: I'm not sure if there are any restrictions for non-USA distribution.
kmcarr is offline   Reply With Quote
Old 02-26-2012, 10:25 PM   #9
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Once you do get Newbler, you should use the .sff file(s) for assembly and mapping. This file has the quality scores as well as the fasta sequence so it will produce much better results than just a .txt of the sequence.
Jeremy is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO