Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Too few reads mapping back to contigs

    I assembled plant transcriptome 454 data (non normalised) using trinity after the following

    1)pre processing (removal of adaptors, vector contamination)
    2)removal of rRna sequences
    3)removal of chloroplast and mitochondrial genes using bwa

    From 3,70,929 reads, i got 21,486 contigs. When i mapped the reads to the contigs using bwa, only 44,678 reads were used in the assembly. What am i doing wrong here? I randomly blasted the contigs to observe that they share over 90% similarity with related legume proteins (although many were hypothetical).
    However, only a small percentage of the contigs align to the transcript assemblies of related legumes when mapped using bwa.

    The velvet assembly of the same data resulted in 15,323 contigs with lesser n50 value, n90 value, max length etc.
    MIRA assembly resulted in more contigs and more reads being used but lesser n50, n90 and avg length of contig.
    Why are only 44,678 reads being used? Any advice is greatly appreciated.

  • #2
    I assume you used newbler, and that you had about 370 thousand reads? Did you check the 454NewblerMetrics.txt file and/or the 454ReadStatus.txt file to determine how many reads the assembler thought it used? I would guess that the assembly was very fragmented so that many of the reads ended up in contigs that were too small to report. When doing transcriptome assemblies, Newbler has some rules about what gets reported as isotigs, contigs, or not reported at all --- don't remember them all off the top of my head.

    Also, you did tell the assembler that this is a cdna assembly project, correct?

    Comment


    • #3
      @seqret ... note his first line. He used Trinity, not newbler. Then he used Velvet and MIRA.

      Originally posted by cerebralrust View Post
      I assembled plant transcriptome 454 data (non normalised) using trinity

      I have been thinking about this problem. Hard to tell without looking at the data. However it is possible that Trinity, Velvet and MIRA are not up to the task. If you are recommending using Newbler then I heartily agree with that idea.

      Comment


      • #4
        I'm wondering if the problem is not with the assembly but with the mapping. Is bwa the best tool to use here, or were the options used appropriate? (I'm asking because I'm not that familiar with bwa.) Frankly, if I had a set of contigs (putative transcripts) and wanted to map raw 454 reads back to them just to count I would use blat.

        Comment


        • #5
          For 454, I recommend bwasw, bowtie2, smalt or tmap. Blat is a bit slow and does not output SAM.

          Comment


          • #6
            I would recommend Newbler since it has been specifically designed for 454 data.
            I am assuming that by mapping the reads back you are trying to get read counts per contig/isotig/isogroup yes?

            If you use newbler you can get read counts per contig from the 454ReadStatus.txt file that is produced when you perform a transcriptome assembly. Just do a grep for 'Assembled' and count the number of times each contig appears, if you have different samples in different lanes you can do the appropriate grep to subset them also. This file lists the 3` and 5` match of each read so you effectively count each read twice. I don't think that is a problem since the reads are generally pretty long to begin with. This method means that some contigs may have a zero or low read count, but it does count every read so that should not be a problem after you sum the read counts of contigs to form read counts per isotig.

            Alternatively you can grep 'Assembled', and make a subset of the assembled reads and then map them back to your contigs using GSMapper. I recommend only using reads with the assembled status to minimise false mapping. I use mapping for SNP deiscovery also, so I set -ais 1 which means that the mapped read needs to be a very good match.
            Last edited by Jeremy; 02-23-2012, 10:22 PM.

            Comment


            • #7
              Thank you for all your suggestions, members!

              @ seqret : As Rick pointed out, i've never used Newbler.

              @ Rick : Using Newbler is not an option, i guess, since it is not open source and we got the sequenced data from a collaborator in the US. Perhaps my only option is to standardise mira parameters to improve the assembly?

              @kmcarr : I was wondering about the mapping also. I will try mapping with bwasw and bowtie2 on the suggestion of lh3 since i require results in sam format also.

              @lh3 : I will try all, compare and pick the best one.

              @Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future.

              I suppose i'm left with the option of using mira with various combinations of parameters to get the best assembly.

              If it may be of help to anyone, I should not have used Trinity for this data considering :

              According to one of key developers of Trinity - Brian J. Haas' option:

              "Ultimately, Trinity might not be the best tool for assembling 454 data, since coverage won't be anywhere near what is expected from Illumina in most cases, and Trinity exploits the high coverage data as part of reconstructing transcripts. The current version of Newbler is supposed to work especially well for 454 transcriptome data, so I encourage you to give that a try if you haven't already."

              Comment


              • #8
                Originally posted by cerebralrust View Post
                @Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future.
                Newbler may be proprietary but proprietary != $. You can obtain Newbler free of charge by completing the software request at this webpage. Note: I'm not sure if there are any restrictions for non-USA distribution.

                Comment


                • #9
                  Once you do get Newbler, you should use the .sff file(s) for assembly and mapping. This file has the quality scores as well as the fasta sequence so it will produce much better results than just a .txt of the sequence.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X