Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to add a suffix to fastq file

    Hi everyone,

    I'm trying to do some alignments, but my latest illumina data came with a rather strange suffix. The read identifiers looks like, so for the pairs:
    @HWI-ST201:195:BB0036ABXX:6:1101:1407:1941 1:N:0:
    @HWI-ST201:195:BB0036ABXX:6:1101:1407:1941 2:N:0:

    This seems to be confusing a number of programs, does anyone know of good script to trim off that 1:N:0: and just add a more standard .1 or .F kind of thing?

    Thanks

  • #2
    Your question made me finally write a blogpost about this: http://contig.wordpress.com/2011/09/...-fastq-header/. The awk command mentioned can be adjusted as needed. Also, check out http://en.wikipedia.org/wiki/Fastq for a comparison of old- and new-style headers.

    Comment


    • #3
      That looks great, thank you. I was all set to attempt to write my own script, but I'm very new to actually writing scripts, so I'm glad it didn't come to that.

      Comment


      • #4
        Hi,

        Is it possible to ADD a term to the illumina FASTQ file? Will that interfere with programs that map reads to a reference? Or will those programs generally ignore everything after the "#"?

        Thanks,
        Andor

        Comment


        • #5
          You can add whatever you want to the read name, but you can't add new lines or anything to the lines with bases or qualities.

          Comment


          • #6
            Originally posted by cement_head View Post
            Is it possible to ADD a term to the illumina FASTQ file? Will that interfere with programs that map reads to a reference? Or will those programs generally ignore everything after the "#"?
            As Brian Bushell said, the fastq header line is free text. Quoting from Cock et al 2010:

            ‘@’ title line which often holds just a record identifier. This is a free format field with no length limit—allowing arbitrary annotation or comments to be included...
            However, you can't assume that downstream programs, e.g. aligners, expect more stringent constraints, e.g. absence of blank spaces. Also, some programs expect PE reads to have the same name (not sure if the fastq spec require this?)

            Comment


            • #7
              I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...

              Comment


              • #8
                Originally posted by dpryan View Post
                I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...
                Indeed, and what is more annoying is when programs discard anything after the first part of the name because then the pair information is lost. Examples: seqtk and the readfq library (I submitted a patch for the Perl version a long time ago, and this fix is on github).

                If the pair information has been lost, or you need to adjust the format for some aligner, you can use Pairfq (specifically with the subcommand addinfo). For this one simple task though, it is probably just as easy to write out a shell command. If this is a useful tool you are using, it would probably be worth asking the developers to support Illumina Fastq files.

                Comment


                • #9
                  Originally posted by SES View Post
                  Indeed, and what is more annoying is when programs discard anything after the first part of the name because then the pair information is lost. Examples: seqtk and the readfq library (I submitted a patch for the Perl version a long time ago, and this fix is on github).
                  I just want to mention that everything in the BBTools package does NOT do this, so you can subsample, normalize, trim, filter, etc. while leaving the names intact. But, some pipelines require that everything after the first whitespace be truncated, on the assumption that these are comments. For example, sam format requires read 1 and read 2 to have the exact same name, while in Illumina's output they have different names (/1 and /2, for example). So BBMap has a a couple related flags - "trimreaddescriptions", which will truncate everything after the first whitespace (for both reads and reference contigs), default false; and "keepnames", which will force read 1 and read 2 to retain their original name, even though the resulting sam file will not technically be spec-compliant (it's still useful in many situations). By default, for paired reads, read 1 and read 2 will both get the full name of read 1 so as to produce a valid sam file.

                  Originally posted by dpryan View Post
                  I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...
                  Also, BBTools makes use of that information for autodetecting whether a single file is paired and interleaved, but it can be overridden. And it's certainly not required
                  Last edited by Brian Bushnell; 08-27-2014, 09:23 AM.

                  Comment


                  • #10
                    Originally posted by Brian Bushnell View Post
                    I just want to mention that everything in the BBTools package does NOT do this, so you can subsample, normalize, trim, filter, etc. while leaving the names intact. But, some pipelines require that everything after the first whitespace be truncated, on the assumption that these are comments. For example, sam format requires read 1 and read 2 to have the exact same name, while in Illumina's output they have different names (/1 and /2, for example). So BBMap has a a couple related flags - "trimreaddescriptions", which will truncate everything after the first whitespace (for both reads and reference contigs), default false; and "keepnames", which will force read 1 and read 2 to retain their original name, even though the resulting sam file will not technically be spec-compliant (it's still useful in many situations). By default, for paired reads, read 1 and read 2 will both get the full name of read 1 so as to produce a valid sam file.
                    That is helpful information. I guess the developers of some tools assume you are only going to be mapping to a reference and working with SAM files (thus, trimming the read names to be valid). Of course, this is not the case for many of us but I can see how that assumption is valid for some (possibly most) use cases.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X