Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is the PRINSEQ fastq-format weird?

    Hi,

    I have been using Prinseq to trim my Illumina data, and the fastq output looks like this:

    Code:
    @HWI-ST486:386:D1UMHACXX:3:1101:1399:2119 1:N:0:TGACCA
    TCGTATCTGTAATCATGAACTTGTCAACGGCTACCTGGTTTCTGTCCT
    +HWI-ST486:386:D1UMHACXX:3:1101:1399:2119 1:N:0:TGACCA
    1:BDBEDFHHGHGCBGHGCHHIJGIFGGIGEHCHJJIJHGH<FDGDHF
    @HWI-ST486:386:D1UMHACXX:3:1101:1367:2139 1:N:0:TGACCA
    ATGTTTTTTGGGGTTATAACAGGGTGGAGCGCTTTATGCGACTTCGCCCTTT
    +HWI-ST486:386:D1UMHACXX:3:1101:1367:2139 1:N:0:TGACCA
    1=DDDEDHHDHHJJIGIFDFGIIJHGI?GHJIFHI@GG@F@;AH=?=BDCEE
    Why is the header repeated twice? Mapping with bowtie seems to be fine, but when I try to collapse identical reads with fastx it will not take the format.

  • #2
    Is the PRINSEQ fastq-format weird?

    The specification for fastq is that the third line starts with +, optionally followed by the read identifier.

    Comment


    • #3
      I want to use FastX to collapse identical reads in the Fastq file and keep the count in the header.

      It seems that Prinseq only outputs fastq with the identifier in the third line, and that fastx will not take this input.

      Do you know of other programs that will collapse in this way?

      Comment


      • #4
        Prinseq option -no_qual_header should only show + in the third line.

        Comment


        • #5
          fastx? you mean the tool inside fastx toolkit? have you tried to define the quality score parameter -Q 33?

          Comment


          • #6
            Originally posted by ddb View Post
            Prinseq option -no_qual_header should only show + in the third line.
            That is correct, but I can see where it could be confusing for some people that the default for prinseq is to print the qual header when most parsers default to not printing this header. Previous versions of the program did not have this option, which was annoying because you had to run another command to remove this line.

            Comment


            • #7
              Thanks for the replies! I hadn't noticed that option in Prinseq

              Comment


              • #8
                Anyone have any idea what prinseq's 5maxd, 3maxd and exactmaxd stats mean? There is no explanation in the manual.

                Comment


                • #9
                  It is related to duplicates. This section from the help menu has some explanation.

                  -stats_dupl
                  Outputs the number of exact duplicates (exact), 5' duplicates
                  (5), 3' duplicates (3), exact duplicates with reverse
                  complements (exactrevcom) and 5'/3' duplicates with reverse
                  complements (revcomp), and total number of duplicates (total).
                  The maximum number of duplicates is given under the value name
                  with an additional "maxd" (e.g. exactmaxd or 5maxd).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 08:47 AM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X