Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • illumina 454 de novo hybrid

    Hi all,

    I am currently trying to assemble a 5mb bacterial genome. I have 43bp single end reads from an Illumina Genome Analyzer IIx and 500bp double end reads from a Roche 454 GS FLX and was wondering if anyone had any luck with hybrid de novo assembly for these library types. I have read that putting the illumina data into velvet, using EMBOSS to cut the resulting contigs down a bit, and then combining it with the 454 data in Newbler gets decent results but that was two years ago so I didn't know if there was a higher quality pipeline to go with now.

    Thanks
    Last edited by pmart1; 06-30-2014, 01:35 PM.

  • #2
    Use mira assembler and make sure you feed it with untrimmed Illumina reads (do not do quality trimming on your own). It will remove Illumina adapters on its own. Regarding 454 data ... I cannot recommend any good adapter removal tool for it, except the one I wrote.

    Comment


    • #3
      Originally posted by martin2 View Post
      I cannot recommend any good adapter removal tool for it, except the one I wrote.
      Share the code then

      Comment


      • #4
        Thank you for the suggestion, Martin! I will have to try that. I've never used MIRA before. Also, would you be able to share your code for the 454 trimming?

        Comment


        • #5
          Hi, it happened I did all the development on my own so currently I only offer a data cleanup as a service (or even assembly). It is not only the code (28k lines of python code) but also a collection of artefacts which I found more 'manually' than by any 'computer-based' approach. They are not so abundant in one dataset while maybe you hit them in some other later on ...

          I am a molecular biologist and with some datasets (transcriptomes) I had a lot of fun while looking for the restrictions sites, ligation results, and namely tried to come up with an answer how they emerged and how to generalize queries for them. To date I developed/tested it on 2227 datasets, better not counting how many times I re-calculated all of them from scratch once I realized something has been escaping me to date. ( You wouldn't believe that I am still finding datasets produced by yet another lab protocol with yet another batch of primers/adapters and associated issues.

          It even works on at least some WGS IonTorrent datasets as the lab protocols are just same. If I am not mistaken it was started by people who left 454 so some ideas and issues are common to both.

          Unfortunately, I cannot share the code or even the queries. You can find URL in my Profile.

          --
          For your particular case, I think it is better to get more sequencing data, the 43bp are too short these days and I doubt it is worth the efforts.

          Comment


          • #6
            > ... and 250bp double end reads from a Roche 454 GS FLX ...

            What do you have? Isn't this Illumina instead?

            Comment


            • #7
              Pardon me, the 454 is 500bp double ended. I'm actually an intern in a lab and all of this (including linux) is extremely new to me.

              Comment


              • #8
                That sounds like Illumina mate-pair protocol. What is the name of the file and what is the first entry or two in it?

                Comment


                • #9
                  The libraries are from an old strain that perished in a power outage so we cannot run any further sequencing.

                  Comment


                  • #10
                    (Strain name).sff. I'm not sure how to open it.

                    Comment


                    • #11
                      sffinfo (Strain name).sff | head -n 100

                      Comment


                      • #12
                        Magic Number: 0x2E736666
                        Version: 0001
                        Index Offset: 551942992
                        Index Length: 4131866
                        # of Reads: 206557
                        Header Length: 840
                        Key Length: 4
                        # of Flows: 800
                        Flowgram Code: 1
                        Flow Chars: (sequence data)
                        Key Sequence: TCAG

                        >(Strain name)
                        Run Prefix:
                        Region #:
                        XY Location:

                        Run Name:
                        Analysis Name:


                        Thank you so much for the patience and help.

                        Comment


                        • #13
                          OK, so this is likely Titanium sequencing, General Library Preparation protocol or Amplicon/paired-end ..., so read length up to 500nt. The best would be to feed it into newbler:

                          runAssembly -o (Strain name) -mi 90 -ml 80 -consed -scaffold -cpu 2 (Strain name).sff GAIIxdata.fastq



                          For non-Roche assemblers you have to go with:

                          sffinfo -s (Strain name).sff > (Strain name).fasta
                          sffinfo -q (Strain name).sff > (Strain name).fasta.qual
                          Last edited by martin2; 06-30-2014, 02:31 PM. Reason: Typo

                          Comment


                          • #14
                            I do have Newbler on my other computer. When I get home, I will run it and post the results. Thank you very much for all of your help! It is very appreciated.

                            Comment


                            • #15
                              I was able to run it and these were the resulting metrics.

                              scaffoldMetrics
                              numberOfScaffolds = 51;
                              numberOfBases = 145548;
                              avgScaffoldSize = 2853;
                              N50ScaffoldSize = 2594, 19;
                              largestScaffoldSize = 12532;
                              numberOfScaffoldContigs = 51;
                              numberOfScaffoldContigBases = 145548;
                              avgScaffoldContigSize = 2853;
                              N50ScaffoldContigSize = 2594, 19;
                              largestScaffoldContigSize = 12532;
                              scaffoldEndMetrics
                              NoEdges = 97, 95.1%;
                              OneEdge = 1, 1.0%;
                              TwoEdges = 4, 3.9%;
                              ManyEdges = 0, 0.0%;
                              scaffoldGapMetrics
                              BothNoEdges = 0, 0.0%;
                              OneNoEdges = 0, 0.0%;
                              BothOneEdge = 0, 0.0%;
                              MultiEdges = 0, 0.0%;
                              largeContigMetrics
                              numberOfContigs = 2050;
                              numberOfBases = 1802379;
                              avgContigSize = 879;
                              N50ContigSize = 888;
                              largestContigSize = 12532;
                              Q40PlusBases = 1705911, 94.65%;
                              Q39MinusBases = 96468, 5.35%;
                              largeContigEndMetrics
                              NoEdges = 4052, 98.8%;
                              OneEdge = 29, 0.7%;
                              TwoEdges = 16, 0.4%;
                              ManyEdges = 3, 0.1%;
                              allContigMetrics
                              numberOfContigs = 4866;
                              numberOfBases = 2738478;

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X