Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Selecting the best database for blasting

    Dear colleagues,

    I recently asked a laboratory how they extended annotations on a transcriptome for a non-model organism; they pointed me towards their github.

    Now this assembly/annotation pipeline in github was what i used to originally assemble and annotate a closely related species transcriptome. In the past this pipeline used blastx to query the uniref50 database. Now this laboratory is querying UniprotKB.

    I just defended my thesis and one of the criticisms from my committee was the poor annotation of my assembly. So of course I wanted to try blasting against this other database (if that was what improved this labs assembly).

    Imagine my surprise when the UniprotKB resulted in worse annotation than uniref50!

    Not knowing much of anything about criteria when selecting a database to BLAST I did a bit of reading. According to Suzek et al., 2007 uniref is a clustered sequences from Uniprot that hides redundant sequences; this results in a size reduction of database your blasting against which increases the speed of similarity search. From what I understand it also "improves detection of distant relationships".

    So my understanding is that I am getting better results from Uniref50 because sequences need at the very least 50% sequence identity. Can anyone correct me if I'm wrong.


    THE SECOND QUESTION

    What would you suggest to improve functional annotation? Obviously increasing the sequencing depth of coverage would be one suggestion but in my case is no longer possible. Given what I have currently what can be done? Is there another database you would suggest blasting against?

  • #2
    As you have discovered first hand, doing annotation is hard, no matter what tool you use. Ultimately annotation requires careful inspection of results, weighing of evidence before making a final judgement.

    Can you tell us what kind of genome you are working with (haploid, diploid etc, # of chromosomes, percentage of repeat sequence). How does your assembly compare to the close relative (in terms of # of contigs, N50 etc) that you refer to?

    If there is a closely related species the has been available/annotated then one of the reasons your annotation looks poor could be that your assembly is not very good (unless the closely related genome has theirs wrong). You may want to take a fresh look at redoing the assembly in that case.

    Comment


    • #3
      More information

      Hi Genomax,

      I am working with a diploid eukaryotic transcriptome (not genome) from a coral species in the Acropora spp. complex.

      There is another Acropora species which there is an available genome for [avg. sequence length ~1700bp; N50=~2200bp].

      However, N50 is often misleading as it measures the continuity of contigs and not their accuracy; in transcriptome assembly the optimal contig is not known a priori and therefore carries little information . Similarly, for transcriptome assembly, these reference-free measures, as well as others (e.g. median contig length and number of contigs) can be misleading, or even meaningless, and should be avoided .

      Therefore, I assessed the quality of my transcriptome assembly using Transrate; Transrate uses a reference genome/transcriptome to compare the quality of assembly. Because the A. digitifera genome is not annotated I used the annotated transcriptome of A. millepora. For my assembly, Transrate showed an initial score of 0.1316, and an optimized score of 0.2336 in Trinity. For comparison, approximately 50% of the de novo assemblies from the NCBI Transcriptome Shotgun Assembly database produce an overall score of 0.22 and optimized score of 0.35.

      So my assembly is somewhat sub-optimal
      Last edited by moldach; 02-01-2016, 12:24 PM. Reason: punctuation

      Comment


      • #4
        Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.

        Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches.

        Comment


        • #5
          Originally posted by GenoMax View Post
          Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.
          I have not done searches against the annotated transcriptome with blastn. What exactly needs to be done when doing a blastn in addition to a blastx search? Would you just concatenate the resulting output file of both blastx and blastn?

          Originally posted by GenoMax View Post
          Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches.
          I'm thinking time-wise that for this project I only want to spend enough time extending annotations via a complementary blastn search. However, this is a graduate project that really should have been wrapped up by now.

          I've been talking with a lab about potentially providing support for assembling/annotating a number of transcriptomes and one of the concerns was the poor annotation results. So really, in the sake of making myself more employable, it would be very helpful if you could elaborate a bit on doing extended searches to refseq_genomic.

          Do you know how common this is with non-model organism assembly?

          What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost.

          What other databases besides refseq_genomic could be used?

          You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help

          Thank you very much
          Last edited by moldach; 02-03-2016, 11:50 AM. Reason: clarity, brevity, punctuation

          Comment


          • #6
            Looks like the de novo transcriptome assembly needs to be properly done first...

            Dear Moldach,

            It looks like the denovo assembly needs to be done properly first.
            Assumming you were using illumina:
            For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode (read the illumina cDNA library prep protocol, fragmentation section).
            Or do PacBio's isoseq...
            (If you did Illumina 1x75 bp or 1x100bp - it would not cut it very well...)
            Than process you data through the flash or panda (preassembly), and than do an incremental pure de novo assembly starting from 10k read and going up.

            Check the most abundant transcripts for completeion, and add them to the "vector.seq" database, so they wouldn't interfere with the next round of the assembly for the less abundant things.

            You can use MIRA or any other assembler in the est mode (can also try with CLC or DNASTAR's ngen).

            Than combine the final edition of the vector.seq database with your final contigs and:
            1. use it as reference for mapping reads to it (to get the relative abundance)
            2. annotate your reference by blastx

            I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level.

            Markiyan.

            Comment


            • #7
              Originally posted by Markiyan View Post
              It looks like the denovo assembly needs to be done properly first.
              Assumming you were using illumina:
              For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode
              I used HiSeq250 Paired-end reads, this was contracted out by BGI Hong Kong (unfortunately the details of their library prep aren't available so I'm not sure what fragment size of cDNA library they used).

              I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly. I know that good annotation starts with a good assembly (**** in=**** out) - i get it. So obviously suggesting > 100 million reads for a de novo assembly is good advice for future experimental design, however, our lab only had that much money so it is what it is.

              I'm really looking for ways to improve this assembly, but thanks for you kind suggestions.

              Originally posted by Markiyan View Post
              I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level.
              OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level?

              Comment


              • #8
                Originally posted by moldach View Post

                What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost.
                I don't have an informed answer since that would depend on the data. Generally for an annotation project it would be ideal to have some genomic DNA data to give at least low pass coverage. Having genomic DNA would allow you to assemble that data/build gene models and to see how much of the potentially expressible component you are recovering in your transcriptome data.

                It is quite possible that you have reached (or are close to) that point where the return on time investment is not going to be worth it, with the assemblies you have. Since you are not going to generate additional data what you have is what you have.

                What other databases besides refseq_genomic could be used?
                You could use genpept, trembl. Also other searches such as psi-blast/delta-blast.

                You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help
                You probably realize that automated blast searches (or any similarity searches) are only going to take you a part of the way. At some point you may need to do more thorough searches (with same query) with different parameters (e.g. similarity matrices). Once you have potential hits, you would need to manually look at regions of homology (and they may not be extensive, think a conserved active site), construct/edit sequence alignments, to see if you can extend the annotation from a well known protein from a distant species to yours. This type of work is pretty tedious/time consuming and not something you want to do as a side project. Hope that helps.

                Comment


                • #9
                  Thank you very much Genomax and Markiyan for your time and valuable feedback.

                  Comment


                  • #10
                    Originally posted by moldach View Post
                    I used HiSeq250 Paired-end reads, this was contracted out by BGI Hong Kong (unfortunately the details of their library prep aren't available so I'm not sure what fragment size of cDNA library they used).

                    I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly.
                    I'm really looking for ways to improve this assembly, but thanks for you kind suggestions.
                    So the input dataset theoretically looks quite good. Assuming the data is clean of the adapters (fastqc it first!).

                    I would still try doing iterative cDNA assembly approach, because it helps grealty with removal of all those spurious links to highly expressed transcripts from the low expressed transcripts by the chimeric reads. Even if you have only 2-5% of them, they still can cause a lot of trouble, because one would expect at least 3-4 orders of magnitude dynamic range, so 10^4 more expressed template would have a lot of chimera links to low expressed ones.

                    If you assemble only 10K or so reads at first, you would get the most expressed ones, than you can remove them from the next iterations, so highly expressed chimeric part would be simply masked off instead of confusing the assembler.
                    Increase your dataset by 5-50X at a time (avoid getting contigs with more than 500X coverage).

                    One can use nearly any DNA assembler for this (I've done exactly this with snail transcriptome in 2009 (done with 454 flx) using sff2phd & PHRAP over 3 iterations) and got way better results than from the newbler v2.0 in the cDNA mode over a single pass.
                    PS: results were evaluated in the consed.

                    Originally posted by moldach View Post
                    OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level?
                    Get the published genome fasta file and try 2 things:
                    1. formatdb into a blast database and blastn / tblastx some denovo transcriptome contigs against it
                    2. simply try mapping your reads against it using bwa or similar.
                    PS: pay attention to fasta_ID's, not all mappers like default NCBI format!
                    Also see what the % of mapped reads and "SNP" density to give you some roughf idea of similarity.

                    Markiyan.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X