SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
automated remote blasting issue queueing Bioinformatics 1 05-04-2015 02:45 AM
BLAST+ and blasting against the NCBI database kevluv93 Bioinformatics 4 04-25-2015 01:25 AM
Blasting contigs against reference database cyanoevo Bioinformatics 4 01-27-2015 04:54 AM
Blasting your blastx results against your own database? noobie Bioinformatics 1 06-30-2012 02:55 AM
using blast+ for remote blasting rangel Bioinformatics 2 03-29-2012 02:30 PM

Reply
 
Thread Tools
Old 02-01-2016, 09:03 AM   #1
moldach
Member
 
Location: Vienna

Join Date: Jan 2016
Posts: 10
Default Selecting the best database for blasting

Dear colleagues,

I recently asked a laboratory how they extended annotations on a transcriptome for a non-model organism; they pointed me towards their github.

Now this assembly/annotation pipeline in github was what i used to originally assemble and annotate a closely related species transcriptome. In the past this pipeline used blastx to query the uniref50 database. Now this laboratory is querying UniprotKB.

I just defended my thesis and one of the criticisms from my committee was the poor annotation of my assembly. So of course I wanted to try blasting against this other database (if that was what improved this labs assembly).

Imagine my surprise when the UniprotKB resulted in worse annotation than uniref50!

Not knowing much of anything about criteria when selecting a database to BLAST I did a bit of reading. According to Suzek et al., 2007 uniref is a clustered sequences from Uniprot that hides redundant sequences; this results in a size reduction of database your blasting against which increases the speed of similarity search. From what I understand it also "improves detection of distant relationships".

So my understanding is that I am getting better results from Uniref50 because sequences need at the very least 50% sequence identity. Can anyone correct me if I'm wrong.


THE SECOND QUESTION

What would you suggest to improve functional annotation? Obviously increasing the sequencing depth of coverage would be one suggestion but in my case is no longer possible. Given what I have currently what can be done? Is there another database you would suggest blasting against?
moldach is offline   Reply With Quote
Old 02-01-2016, 09:41 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,494
Default

As you have discovered first hand, doing annotation is hard, no matter what tool you use. Ultimately annotation requires careful inspection of results, weighing of evidence before making a final judgement.

Can you tell us what kind of genome you are working with (haploid, diploid etc, # of chromosomes, percentage of repeat sequence). How does your assembly compare to the close relative (in terms of # of contigs, N50 etc) that you refer to?

If there is a closely related species the has been available/annotated then one of the reasons your annotation looks poor could be that your assembly is not very good (unless the closely related genome has theirs wrong). You may want to take a fresh look at redoing the assembly in that case.
GenoMax is offline   Reply With Quote
Old 02-01-2016, 11:16 AM   #3
moldach
Member
 
Location: Vienna

Join Date: Jan 2016
Posts: 10
Default More information

Hi Genomax,

I am working with a diploid eukaryotic transcriptome (not genome) from a coral species in the Acropora spp. complex.

There is another Acropora species which there is an available genome for [avg. sequence length ~1700bp; N50=~2200bp].

However, N50 is often misleading as it measures the continuity of contigs and not their accuracy; in transcriptome assembly the optimal contig is not known a priori and therefore carries little information . Similarly, for transcriptome assembly, these reference-free measures, as well as others (e.g. median contig length and number of contigs) can be misleading, or even meaningless, and should be avoided .

Therefore, I assessed the quality of my transcriptome assembly using Transrate; Transrate uses a reference genome/transcriptome to compare the quality of assembly. Because the A. digitifera genome is not annotated I used the annotated transcriptome of A. millepora. For my assembly, Transrate showed an initial score of 0.1316, and an optimized score of 0.2336 in Trinity. For comparison, approximately 50% of the de novo assemblies from the NCBI Transcriptome Shotgun Assembly database produce an overall score of 0.22 and optimized score of 0.35.

So my assembly is somewhat sub-optimal

Last edited by moldach; 02-01-2016 at 11:24 AM. Reason: punctuation
moldach is offline   Reply With Quote
Old 02-01-2016, 02:20 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,494
Default

Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.

Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches.
GenoMax is offline   Reply With Quote
Old 02-03-2016, 09:16 AM   #5
moldach
Member
 
Location: Vienna

Join Date: Jan 2016
Posts: 10
Default

Quote:
Originally Posted by GenoMax View Post
Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.
I have not done searches against the annotated transcriptome with blastn. What exactly needs to be done when doing a blastn in addition to a blastx search? Would you just concatenate the resulting output file of both blastx and blastn?

Quote:
Originally Posted by GenoMax View Post
Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches.
I'm thinking time-wise that for this project I only want to spend enough time extending annotations via a complementary blastn search. However, this is a graduate project that really should have been wrapped up by now.

I've been talking with a lab about potentially providing support for assembling/annotating a number of transcriptomes and one of the concerns was the poor annotation results. So really, in the sake of making myself more employable, it would be very helpful if you could elaborate a bit on doing extended searches to refseq_genomic.

Do you know how common this is with non-model organism assembly?

What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost.

What other databases besides refseq_genomic could be used?

You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help

Thank you very much

Last edited by moldach; 02-03-2016 at 10:50 AM. Reason: clarity, brevity, punctuation
moldach is offline   Reply With Quote
Old 02-08-2016, 02:31 AM   #6
Markiyan
Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 93
Exclamation Looks like the de novo transcriptome assembly needs to be properly done first...

Dear Moldach,

It looks like the denovo assembly needs to be done properly first.
Assumming you were using illumina:
For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode (read the illumina cDNA library prep protocol, fragmentation section).
Or do PacBio's isoseq...
(If you did Illumina 1x75 bp or 1x100bp - it would not cut it very well...)
Than process you data through the flash or panda (preassembly), and than do an incremental pure de novo assembly starting from 10k read and going up.

Check the most abundant transcripts for completeion, and add them to the "vector.seq" database, so they wouldn't interfere with the next round of the assembly for the less abundant things.

You can use MIRA or any other assembler in the est mode (can also try with CLC or DNASTAR's ngen).

Than combine the final edition of the vector.seq database with your final contigs and:
1. use it as reference for mapping reads to it (to get the relative abundance)
2. annotate your reference by blastx

I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level.

Markiyan.
Markiyan is offline   Reply With Quote
Old 02-08-2016, 02:43 PM   #7
moldach
Member
 
Location: Vienna

Join Date: Jan 2016
Posts: 10
Default

Quote:
Originally Posted by Markiyan View Post
It looks like the denovo assembly needs to be done properly first.
Assumming you were using illumina:
For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode
I used HiSeq250 Paired-end reads, this was contracted out by BGI Hong Kong (unfortunately the details of their library prep aren't available so I'm not sure what fragment size of cDNA library they used).

I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly. I know that good annotation starts with a good assembly (**** in=**** out) - i get it. So obviously suggesting > 100 million reads for a de novo assembly is good advice for future experimental design, however, our lab only had that much money so it is what it is.

I'm really looking for ways to improve this assembly, but thanks for you kind suggestions.

Quote:
Originally Posted by Markiyan View Post
I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level.
OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level?
moldach is offline   Reply With Quote
Old 02-08-2016, 03:38 PM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,494
Default

Quote:
Originally Posted by moldach View Post

What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost.
I don't have an informed answer since that would depend on the data. Generally for an annotation project it would be ideal to have some genomic DNA data to give at least low pass coverage. Having genomic DNA would allow you to assemble that data/build gene models and to see how much of the potentially expressible component you are recovering in your transcriptome data.

It is quite possible that you have reached (or are close to) that point where the return on time investment is not going to be worth it, with the assemblies you have. Since you are not going to generate additional data what you have is what you have.

Quote:
What other databases besides refseq_genomic could be used?
You could use genpept, trembl. Also other searches such as psi-blast/delta-blast.

Quote:
You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help
You probably realize that automated blast searches (or any similarity searches) are only going to take you a part of the way. At some point you may need to do more thorough searches (with same query) with different parameters (e.g. similarity matrices). Once you have potential hits, you would need to manually look at regions of homology (and they may not be extensive, think a conserved active site), construct/edit sequence alignments, to see if you can extend the annotation from a well known protein from a distant species to yours. This type of work is pretty tedious/time consuming and not something you want to do as a side project. Hope that helps.
GenoMax is offline   Reply With Quote
Old 02-08-2016, 05:10 PM   #9
moldach
Member
 
Location: Vienna

Join Date: Jan 2016
Posts: 10
Default

Thank you very much Genomax and Markiyan for your time and valuable feedback.
moldach is offline   Reply With Quote
Old 02-09-2016, 03:37 AM   #10
Markiyan
Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 93
Lightbulb

Quote:
Originally Posted by moldach View Post
I used HiSeq250 Paired-end reads, this was contracted out by BGI Hong Kong (unfortunately the details of their library prep aren't available so I'm not sure what fragment size of cDNA library they used).

I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly.
I'm really looking for ways to improve this assembly, but thanks for you kind suggestions.
So the input dataset theoretically looks quite good. Assuming the data is clean of the adapters (fastqc it first!).

I would still try doing iterative cDNA assembly approach, because it helps grealty with removal of all those spurious links to highly expressed transcripts from the low expressed transcripts by the chimeric reads. Even if you have only 2-5% of them, they still can cause a lot of trouble, because one would expect at least 3-4 orders of magnitude dynamic range, so 10^4 more expressed template would have a lot of chimera links to low expressed ones.

If you assemble only 10K or so reads at first, you would get the most expressed ones, than you can remove them from the next iterations, so highly expressed chimeric part would be simply masked off instead of confusing the assembler.
Increase your dataset by 5-50X at a time (avoid getting contigs with more than 500X coverage).

One can use nearly any DNA assembler for this (I've done exactly this with snail transcriptome in 2009 (done with 454 flx) using sff2phd & PHRAP over 3 iterations) and got way better results than from the newbler v2.0 in the cDNA mode over a single pass.
PS: results were evaluated in the consed.

Quote:
Originally Posted by moldach View Post
OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level?
Get the published genome fasta file and try 2 things:
1. formatdb into a blast database and blastn / tblastx some denovo transcriptome contigs against it
2. simply try mapping your reads against it using bwa or similar.
PS: pay attention to fasta_ID's, not all mappers like default NCBI format!
Also see what the % of mapped reads and "SNP" density to give you some roughf idea of similarity.

Markiyan.
Markiyan is offline   Reply With Quote
Reply

Tags
annotation, blast, database, uniprot, uniref

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO