Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Fastq Header Search

    Dear All,

    I would like to retrieve sequences (fastq format) from an Illumina fastq data file using the first part of the sequence header.

    Example of a Illumina fastq header:
    @X01032:109:000000000-AGKF7:1:1101:11950:1779 1:N:0:1

    My query:
    @X01032:109:000000000-AGKF7:1:1101:11950:1779

    I tried usearch (fastx_getseqs), seqtk, and seqret but nothing works because of the special characters (e.g. ":","-") in the header. A simple grep like

    Code:
    grep "@X01032:109:000000000-AGKF7:1:1101:11950:1779" -A 3 in.fastq
    would work but it would take a long time to finish. I could reformat the headers but I prefer not to (if possible).

    Is there a tool out there that would work with Illumina fastq files?

    Thanks for the help!

  • #2
    You can do that with "filterbyname.sh" in the BBMap package.

    filterbyname.sh in=reads.fq out=filtered.fq include=t names=names.txt

    ...where names.txt has 1 name per line. Or, you can say "names=X01032:109:000000000-AGKF7:1:1101:11950:1779" instead. This program will include reads that have non-matching stuff after the first whitespace. You should not include the leading "@" in the query, as it is not part of the name. But, if you do include the leading @ for whatever reason, then add the flag "truncateheadersymbol".

    Comment


    • #3
      Works - problem solved!

      Dear Brian,

      thanks for your suggestion!

      I downloaded bbmap and I tried filterbyname.sh

      Code:
      filterbyname.sh in=in.fq out=out.fq names=select.list include=t truncateheadersymbol
      
      Input is being processed as unpaired
      Time:               53.202 seconds.
      Reads Processed:    5747570 	108.03k reads/sec
      Bases Processed:    2296943848 	43.17m bases/sec
      Reads Out:          65246
      Bases Out:          25944173
      Number of reads for in.fq: 5,747,570
      Number of headers selected: 66,182
      Number of reads for out.fq: 65,246

      Works great and I really like the output summary!

      Question 1: Is there a way (setting) to get a list of the records that did not match?

      Question 2: bbmap seems to be a nice and very useful collection of tools - thanks a lot! - but is there an overview or a summary that would describe the tools briefly.

      Thanks for the help !

      Comment


      • #4
        Originally posted by loba17 View Post
        Question 2: bbmap seems to be a nice and very useful collection of tools - thanks a lot! - but is there an overview or a summary that would describe the tools briefly.

        Thanks for the help !
        See this thread for a recap of many things BBMap can do: http://seqanswers.com/forums/showthread.php?t=58221

        I would suggest trying outu=filename with your command to see if that captures reads that did not match.

        Comment


        • #5
          Originally posted by GenoMax View Post
          I would suggest trying outu=filename with your command to see if that captures reads that did not match.
          You know, to be consistent, I should really add that (I'll make a note to do so)! Unfortunately filterbyname does not currently capture outu. Instead, you need to run it twice, with "include=t" to capture the matching reads, and "include=f" to capture the nonmatching reads.

          Comment


          • #6
            Thanks

            Dear Brian, thanks for the clarification and the help.

            Comment


            • #7
              My Python script with a Galaxy interface:
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Today, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X