![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
automated remote blasting issue | queueing | Bioinformatics | 1 | 05-04-2015 03:45 AM |
BLAST+ and blasting against the NCBI database | kevluv93 | Bioinformatics | 4 | 04-25-2015 02:25 AM |
Blasting contigs against reference database | cyanoevo | Bioinformatics | 4 | 01-27-2015 05:54 AM |
Blasting your blastx results against your own database? | noobie | Bioinformatics | 1 | 06-30-2012 03:55 AM |
using blast+ for remote blasting | rangel | Bioinformatics | 2 | 03-29-2012 03:30 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Vienna Join Date: Jan 2016
Posts: 10
|
![]()
Dear colleagues,
I recently asked a laboratory how they extended annotations on a transcriptome for a non-model organism; they pointed me towards their github. Now this assembly/annotation pipeline in github was what i used to originally assemble and annotate a closely related species transcriptome. In the past this pipeline used blastx to query the uniref50 database. Now this laboratory is querying UniprotKB. I just defended my thesis and one of the criticisms from my committee was the poor annotation of my assembly. So of course I wanted to try blasting against this other database (if that was what improved this labs assembly). Imagine my surprise when the UniprotKB resulted in worse annotation than uniref50! Not knowing much of anything about criteria when selecting a database to BLAST I did a bit of reading. According to Suzek et al., 2007 uniref is a clustered sequences from Uniprot that hides redundant sequences; this results in a size reduction of database your blasting against which increases the speed of similarity search. From what I understand it also "improves detection of distant relationships". So my understanding is that I am getting better results from Uniref50 because sequences need at the very least 50% sequence identity. Can anyone correct me if I'm wrong. THE SECOND QUESTION What would you suggest to improve functional annotation? Obviously increasing the sequencing depth of coverage would be one suggestion but in my case is no longer possible. Given what I have currently what can be done? Is there another database you would suggest blasting against? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,082
|
![]()
As you have discovered first hand, doing annotation is hard, no matter what tool you use. Ultimately annotation requires careful inspection of results, weighing of evidence before making a final judgement.
Can you tell us what kind of genome you are working with (haploid, diploid etc, # of chromosomes, percentage of repeat sequence). How does your assembly compare to the close relative (in terms of # of contigs, N50 etc) that you refer to? If there is a closely related species the has been available/annotated then one of the reasons your annotation looks poor could be that your assembly is not very good (unless the closely related genome has theirs wrong). You may want to take a fresh look at redoing the assembly in that case. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Vienna Join Date: Jan 2016
Posts: 10
|
![]()
Hi Genomax,
I am working with a diploid eukaryotic transcriptome (not genome) from a coral species in the Acropora spp. complex. There is another Acropora species which there is an available genome for [avg. sequence length ~1700bp; N50=~2200bp]. However, N50 is often misleading as it measures the continuity of contigs and not their accuracy; in transcriptome assembly the optimal contig is not known a priori and therefore carries little information . Similarly, for transcriptome assembly, these reference-free measures, as well as others (e.g. median contig length and number of contigs) can be misleading, or even meaningless, and should be avoided . Therefore, I assessed the quality of my transcriptome assembly using Transrate; Transrate uses a reference genome/transcriptome to compare the quality of assembly. Because the A. digitifera genome is not annotated I used the annotated transcriptome of A. millepora. For my assembly, Transrate showed an initial score of 0.1316, and an optimized score of 0.2336 in Trinity. For comparison, approximately 50% of the de novo assemblies from the NCBI Transcriptome Shotgun Assembly database produce an overall score of 0.22 and optimized score of 0.35. So my assembly is somewhat sub-optimal Last edited by moldach; 02-01-2016 at 12:24 PM. Reason: punctuation |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,082
|
![]()
Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.
Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches. |
![]() |
![]() |
![]() |
#5 | ||
Member
Location: Vienna Join Date: Jan 2016
Posts: 10
|
![]() Quote:
Quote:
I've been talking with a lab about potentially providing support for assembling/annotating a number of transcriptomes and one of the concerns was the poor annotation results. So really, in the sake of making myself more employable, it would be very helpful if you could elaborate a bit on doing extended searches to refseq_genomic. Do you know how common this is with non-model organism assembly? What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost. What other databases besides refseq_genomic could be used? You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help Thank you very much Last edited by moldach; 02-03-2016 at 11:50 AM. Reason: clarity, brevity, punctuation |
||
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Cambridge Join Date: Sep 2010
Posts: 116
|
![]()
Dear Moldach,
It looks like the denovo assembly needs to be done properly first. Assumming you were using illumina: For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode (read the illumina cDNA library prep protocol, fragmentation section). Or do PacBio's isoseq... (If you did Illumina 1x75 bp or 1x100bp - it would not cut it very well...) Than process you data through the flash or panda (preassembly), and than do an incremental pure de novo assembly starting from 10k read and going up. Check the most abundant transcripts for completeion, and add them to the "vector.seq" database, so they wouldn't interfere with the next round of the assembly for the less abundant things. You can use MIRA or any other assembler in the est mode (can also try with CLC or DNASTAR's ngen). Than combine the final edition of the vector.seq database with your final contigs and: 1. use it as reference for mapping reads to it (to get the relative abundance) 2. annotate your reference by blastx I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level. Markiyan. |
![]() |
![]() |
![]() |
#7 | |
Member
Location: Vienna Join Date: Jan 2016
Posts: 10
|
![]() Quote:
I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly. I know that good annotation starts with a good assembly (**** in=**** out) - i get it. So obviously suggesting > 100 million reads for a de novo assembly is good advice for future experimental design, however, our lab only had that much money so it is what it is. I'm really looking for ways to improve this assembly, but thanks for you kind suggestions. OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level? |
|
![]() |
![]() |
![]() |
#8 | |||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,082
|
![]() Quote:
It is quite possible that you have reached (or are close to) that point where the return on time investment is not going to be worth it, with the assemblies you have. Since you are not going to generate additional data what you have is what you have. Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
#9 |
Member
Location: Vienna Join Date: Jan 2016
Posts: 10
|
![]()
Thank you very much Genomax and Markiyan for your time and valuable feedback.
|
![]() |
![]() |
![]() |
#10 | ||
Senior Member
Location: Cambridge Join Date: Sep 2010
Posts: 116
|
![]() Quote:
I would still try doing iterative cDNA assembly approach, because it helps grealty with removal of all those spurious links to highly expressed transcripts from the low expressed transcripts by the chimeric reads. Even if you have only 2-5% of them, they still can cause a lot of trouble, because one would expect at least 3-4 orders of magnitude dynamic range, so 10^4 more expressed template would have a lot of chimera links to low expressed ones. If you assemble only 10K or so reads at first, you would get the most expressed ones, than you can remove them from the next iterations, so highly expressed chimeric part would be simply masked off instead of confusing the assembler. Increase your dataset by 5-50X at a time (avoid getting contigs with more than 500X coverage). One can use nearly any DNA assembler for this (I've done exactly this with snail transcriptome in 2009 (done with 454 flx) using sff2phd & PHRAP over 3 iterations) and got way better results than from the newbler v2.0 in the cDNA mode over a single pass. PS: results were evaluated in the consed. Quote:
1. formatdb into a blast database and blastn / tblastx some denovo transcriptome contigs against it 2. simply try mapping your reads against it using bwa or similar. PS: pay attention to fasta_ID's, not all mappers like default NCBI format! Also see what the % of mapped reads and "SNP" density to give you some roughf idea of similarity. Markiyan. |
||
![]() |
![]() |
![]() |
Tags |
annotation, blast, database, uniprot, uniref |
Thread Tools | |
|
|