![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
question about mapping reads back to an endosymbiont genome | NGS_Newbie | RNA Sequencing | 0 | 09-26-2011 07:20 PM |
Converting reads mapped from transcriptome back to genome | golharam | Bioinformatics | 12 | 09-19-2011 03:49 PM |
Need Help, gap between two contigs formed by mapping | ketan_bnf | Bioinformatics | 1 | 06-15-2011 12:36 AM |
Reads and contigs (Newbler) | FelipeAd | Bioinformatics | 3 | 01-12-2011 08:13 AM |
BWA and Velvet : Anamolies in contigs and read mapping to reference | apratap | Bioinformatics | 1 | 01-29-2010 01:44 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Sweden Join Date: Jan 2012
Posts: 8
|
![]()
I assembled plant transcriptome 454 data (non normalised) using trinity after the following
1)pre processing (removal of adaptors, vector contamination) 2)removal of rRna sequences 3)removal of chloroplast and mitochondrial genes using bwa From 3,70,929 reads, i got 21,486 contigs. When i mapped the reads to the contigs using bwa, only 44,678 reads were used in the assembly. What am i doing wrong here? I randomly blasted the contigs to observe that they share over 90% similarity with related legume proteins (although many were hypothetical). However, only a small percentage of the contigs align to the transcript assemblies of related legumes when mapped using bwa. The velvet assembly of the same data resulted in 15,323 contigs with lesser n50 value, n90 value, max length etc. MIRA assembly resulted in more contigs and more reads being used but lesser n50, n90 and avg length of contig. Why are only 44,678 reads being used? Any advice is greatly appreciated. |
![]() |
![]() |
![]() |
#2 |
Junior Member
Location: Upstate New York Join Date: Sep 2009
Posts: 1
|
![]()
I assume you used newbler, and that you had about 370 thousand reads? Did you check the 454NewblerMetrics.txt file and/or the 454ReadStatus.txt file to determine how many reads the assembler thought it used? I would guess that the assembly was very fragmented so that many of the reads ended up in contigs that were too small to report. When doing transcriptome assemblies, Newbler has some rules about what gets reported as isotigs, contigs, or not reported at all --- don't remember them all off the top of my head.
Also, you did tell the assembler that this is a cdna assembly project, correct? |
![]() |
![]() |
![]() |
#3 | |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
@seqret ... note his first line. He used Trinity, not newbler. Then he used Velvet and MIRA.
Quote:
I have been thinking about this problem. Hard to tell without looking at the data. However it is possible that Trinity, Velvet and MIRA are not up to the task. If you are recommending using Newbler then I heartily agree with that idea. |
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
I'm wondering if the problem is not with the assembly but with the mapping. Is bwa the best tool to use here, or were the options used appropriate? (I'm asking because I'm not that familiar with bwa.) Frankly, if I had a set of contigs (putative transcripts) and wanted to map raw 454 reads back to them just to count I would use blat.
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
For 454, I recommend bwasw, bowtie2, smalt or tmap. Blat is a bit slow and does not output SAM.
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Pathum Thani, Thailand Join Date: Nov 2009
Posts: 190
|
![]()
I would recommend Newbler since it has been specifically designed for 454 data.
I am assuming that by mapping the reads back you are trying to get read counts per contig/isotig/isogroup yes? If you use newbler you can get read counts per contig from the 454ReadStatus.txt file that is produced when you perform a transcriptome assembly. Just do a grep for 'Assembled' and count the number of times each contig appears, if you have different samples in different lanes you can do the appropriate grep to subset them also. This file lists the 3` and 5` match of each read so you effectively count each read twice. I don't think that is a problem since the reads are generally pretty long to begin with. This method means that some contigs may have a zero or low read count, but it does count every read so that should not be a problem after you sum the read counts of contigs to form read counts per isotig. Alternatively you can grep 'Assembled', and make a subset of the assembled reads and then map them back to your contigs using GSMapper. I recommend only using reads with the assembled status to minimise false mapping. I use mapping for SNP deiscovery also, so I set -ais 1 which means that the mapped read needs to be a very good match. Last edited by Jeremy; 02-23-2012 at 10:22 PM. |
![]() |
![]() |
![]() |
#7 |
Junior Member
Location: Sweden Join Date: Jan 2012
Posts: 8
|
![]()
Thank you for all your suggestions, members!
@ seqret : As Rick pointed out, i've never used Newbler. @ Rick : Using Newbler is not an option, i guess, since it is not open source and we got the sequenced data from a collaborator in the US. Perhaps my only option is to standardise mira parameters to improve the assembly? @kmcarr : I was wondering about the mapping also. I will try mapping with bwasw and bowtie2 on the suggestion of lh3 since i require results in sam format also. @lh3 : I will try all, compare and pick the best one. @Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future. I suppose i'm left with the option of using mira with various combinations of parameters to get the best assembly. If it may be of help to anyone, I should not have used Trinity for this data considering : According to one of key developers of Trinity - Brian J. Haas' option: "Ultimately, Trinity might not be the best tool for assembling 454 data, since coverage won't be anywhere near what is expected from Illumina in most cases, and Trinity exploits the high coverage data as part of reconstructing transcripts. The current version of Newbler is supposed to work especially well for 454 transcriptome data, so I encourage you to give that a try if you haven't already." |
![]() |
![]() |
![]() |
#8 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Pathum Thani, Thailand Join Date: Nov 2009
Posts: 190
|
![]()
Once you do get Newbler, you should use the .sff file(s) for assembly and mapping. This file has the quality scores as well as the fasta sequence so it will produce much better results than just a .txt of the sequence.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|