Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbler with fastq from sff

    I know this is an odd request, but here goes:

    I want to be able to extract an 8KB sff to paired fastq files, and then assemble those fastqs through newbler (I know, it's not recommended, but bear with me). thus far, no matter how I do this, I have not been able to get newbler to treat these as paired ends. I have tried the workarounds I can find when referring to illumina pe data (#0/1 #0/1 as trailing), as well as any other tricks I can figure out.

    Does anyone have any advice on this?

    Thanks!

    =matt

  • #2
    I'm not sure you'll be able to do this (as in, not sure if you can trick Newbler into using FASTQ for scaffolding). I might be wrong. Why do you want to do this?

    But assuming it is possible, have you checked the orientation of the reads is correct for Illumina paired-end data, e.g. you might need to check that first read is forward and the second is reverse.

    Also it's usually /1 /2 as suffixes.

    Comment


    • #3
      So, we HAVE used illumina PE data for newbler using /1 /2 suffixes, and it's perfectly happy to treat them as paired reads. I'm not sure if newbler is also looking for some other character (illumina read names have a ton of colons in them, e.g.) in the naming to make it check for pairs.

      From what I've read orientation for paired fastq needs to be inward facing, not sure if that orientation is generated for our set, but that would break out during the assembly, not during the read QC, which is where the reporting of # of paired reads used from each library is read in.....

      Comment


      • #4
        Originally posted by mscholz View Post
        So, we HAVE used illumina PE data for newbler using /1 /2 suffixes, and it's perfectly happy to treat them as paired reads. I'm not sure if newbler is also looking for some other character (illumina read names have a ton of colons in them, e.g.) in the naming to make it check for pairs.
        Yes, I have too. The newer header format (Illumina 1.8+) with colons isn't compatible, here's a blog post about how to convert to the older format:

        One unfortunate drawback of working with Illumina sequences is the many changes to the format of their fastq readfiles. The quality scoring has been changed several times since the first Solexa rea…


        From what I've read orientation for paired fastq needs to be inward facing, not sure if that orientation is generated for our set, but that would break out during the assembly, not during the read QC, which is where the reporting of # of paired reads used from each library is read in.....
        OK, maybe you can post the top bit of your 454NewblerMetrics.txt file as this helps with debugging. The other possibility is that your sequences are too short.

        I note you didn't say why you wanted to do this. Just to say that if you wanted to use Newbler with SFF files from another - competing - platform with different mate-pair linker sequences, then there is a hacky way of doing this.

        Comment


        • #5
          Wait wait wait!

          These are native 454 sffs that are being extracted from 454 runs. I haven't bothered letting a run go to completion for a while, since the first outputs during qc to the command line are whether read sets are being treated as paired or not.

          The reasoning is convoluted, but involves using multiple library sets into newbler, of which the 8kB library may or may not be 454 data, so our informatics team would prefer to have the pipeline stable regardless of the sequencing type for the 8kb library. That means that all sffs have to go to fastqs for this to work. as far as I can tell the method you sent the link to works great for illumina data into newbler, but doesn't seem to work in any permutation for extracted and adapter split 454 reads.

          If you really think that the length of the 454 subreads may be causing the problem I can size filter and try again, but it really looks to my unpracticed eye that it has something to do with the headers.

          Once I have a completed run, I'll be happy to post the metrics file. In the meantime, this is what the headers currently look like:

          sff ->fastq edit (read truncated for visibility)
          @HK0J9ML02GGB5G#0/1
          CGCGAGGAAATACGGTCGACGCGGGCGGCGATCAC
          +
          88?=444<;;9698<<8444??444422@@???ABB==
          @HK0J9ML02IA4J3#0/1

          Comment


          • #6
            Newbler will not check for the linker in fastq files, so you'll have to provide the forward and reverse reads separately - this much I think you already know. You could use either newbler itself to split the reads (best option), or the sff_extract tool. For the first case, run newbler with your 8kb sff file and the '-tr flag, and look for the 454TrimmedReads.fna and qual files.

            For mate pairs, I recommend using the -p option to force newbler to treat them as such. I don't think it will work with fastq files, even when set up corrrectly.
            The headers need to be adjusted as per http://contig.wordpress.com/2011/01/...her-platforms/, or when you run sff_extract as per http://flxlexblog.wordpress.com/2012...substr-mg1655/

            Comment


            • #7
              Yes, for my own edification I just tried this:

              test.fastq
              Code:
              @test1/1
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              @test1/2
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              @test2/1
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              @test2/2
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              Code:
              runAssembly test.fastq
              Created assembly project directory P_2012_10_31_10_52_07_runAssembly
              1 read file successfully added.
                  test.fastq  (Fastq dataset, with standard scores)
              Doesn't work ... but if you fake up Illumina headers (test2.fastq):

              Code:
              @HWI-0001_0001_AAAAAA:1:1:1:1#ATCACG/1
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              @HWI-0001_0001_AAAAAA:1:1:1:1#ATCACG/2
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              @HWI-0001_0001_AAAAAA:1:1:1:2#ATCACG/1
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              @HWI-0001_0001_AAAAAA:1:1:1:2#ATCACG/2
              ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATAT
              +
              IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
              Code:
              runAssembly test2.fastq
              Created assembly project directory P_2012_10_31_10_53_56_runAssembly
              1 read file successfully added.
                  test.fastq  (Illumina paired-end dataset, with standard scores)
              Assembly computation starting at: Wed Oct 31 10:53:56 2012  (v2.6 (20110517_1502))
              Indexing test.fastq (with quality scores)...
                -> 4 reads, 304 bases, 4 marked as matepairs.
              It does ...

              If you go back to the first version and add -p as Lex says, it does add it as a paired-end dataset, but doesn't mark the reads as mate-pairs:

              Code:
              runAssembly -p test.fastq
              Created assembly project directory P_2012_10_31_10_55_01_runAssembly
              1 read file successfully added as explicit paired-end files.
                  test.fastq  (Fastq paired-end dataset, with standard scores)
              Assembly computation starting at: Wed Oct 31 10:55:01 2012  (v2.6 (20110517_1502))
              Indexing test.fastq (with quality scores)...
                -> 4 reads, 304 bases.
              Interesting!

              Comment


              • #8
                Originally posted by mscholz View Post
                The reasoning is convoluted, but involves using multiple library sets into newbler, of which the 8kB library may or may not be 454 data, so our informatics team would prefer to have the pipeline stable regardless of the sequencing type for the 8kb library.
                Separately, I'd suggest always using SFF files with Newbler when they are available due to the additional signal information contained in the flowgrams. Tends to give better results.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X