Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alexd106
    Junior Member
    • Dec 2011
    • 4

    remove suffix from fastq sequence ID

    Dear all,

    I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

    i.e.

    change

    @HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
    NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
    +HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
    BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

    to

    @HWI-ST182_0249:5:1101:1093:2017#GTATGACG
    NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
    +HWI-ST182_0249:5:1101:1093:2017#GTATGACG
    BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

    I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
    Thanks a million
    Alex
  • rahularjun86
    Member
    • Jan 2011
    • 58

    #2
    Dear Alex,
    You can use perl scripting, read the files, Split line if it is starting with @HWI or +HWI and print only the first part after splitting. And use else statement for printing rest of the sequence and quality lines as such.
    Or you can use unix 'awk' set FS in the BEGIN and then print $1 part if line is starting with seq Id @HWI or +HWI.
    Best wishes,
    Rahul
    Rahul Sharma,
    Ph.D
    Frankfurt am Main, Germany

    Comment

    • ehlin
      Member
      • Jan 2012
      • 12

      #3
      Originally posted by alexd106 View Post
      Dear all,

      I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

      i.e.

      change

      @HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
      NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
      +HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
      BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

      to

      @HWI-ST182_0249:5:1101:1093:2017#GTATGACG
      NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
      +HWI-ST182_0249:5:1101:1093:2017#GTATGACG
      BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

      I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
      Thanks a million
      Alex
      Hi Alex, while perl scripting is a good option, if you are new to bioinformatics there might be easier options for you. For example, FASTX-Toolkit:

      Comment

      • alexd106
        Junior Member
        • Dec 2011
        • 4

        #4
        Hi Rahul,

        Thank you very much for your suggestions. As i mentioned, I am new to bioinformatics and am just trying to teach myself some perl (and have never used awk). Would you mind providing a little more detail of the perl code you would use? No worries if not.

        Cheers
        Alex

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          awk is good but sed might be faster and easier to learn.

          Code:
          sed -i.bak -e '/^[@+]HWI/ s/\/[12]$//' <yourFileName>
          This sed script will look for lines starting with @HWI or +HWI, strip off either a /1 or /2 from the ends of those lines and save the result to the same file name as the original. The original file will be saved as <yourFileName>.bak.

          Comment

          • alexd106
            Junior Member
            • Dec 2011
            • 4

            #6
            Thanks very much for the info.

            All the best
            Alex

            Comment

            • rahularjun86
              Member
              • Jan 2011
              • 58

              #7
              Hi Alex,

              Following is the perl code:
              Code:
                1 use strict;
                2 use warnings;
                3 
                4 my $file_in=$ARGV[0];
                5 my $file_out=$ARGV[1];
                6 
                7 my $num=0;
                8 open I,"<$file_in" or die $!;
                9 open O,">$file_out" or die $!;
               10 
               11 do{
               12 
               13 my $f =<I>;
               14 chomp $f;
               15 
               16 if(($f =~ /^\@HWI/)||($f =~ /^\+HWI/))
               17      { $num++;
               18        my @s=split(/\//, $f);
               19        print O"$s[0]\n";
               20      }
               21 
               22 else
               23      {
               24        print O "$f\n";
               25         }
               26 
               27 }until eof(I);
               28 my $pr=$num/2;
               29 print "\nProcessed reads: $pr\n"
               30 
               31 
              ~                                                                                                                                                                    
              ~
              Usage: perl program_name.pl Input_file.fq Out_file.fq
              Last edited by rahularjun86; 03-13-2012, 07:04 AM.
              Rahul Sharma,
              Ph.D
              Frankfurt am Main, Germany

              Comment

              • alexd106
                Junior Member
                • Dec 2011
                • 4

                #8
                Dear all, thanks for all the really useful suggestions. What a great community this is. I hope I can contribute sometime in the future when i have a little more experience.

                [ehlin] I thought of using FASTX-Toolkit but couldn't see the appropriate tool. I looked at

                $ fastx_renamer -h
                usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
                Part of FASTX Toolkit 0.0.10 by A. Gordon ([email protected])

                [-n TYPE] = rename type:
                SEQ - use the nucleotides sequence as the name.
                COUNT - use simply counter as the name.

                but it looks like the renaming is restricted to either a sequence or counter.

                The sed and seemed to do the trick and I will look at the perl solution in an attempt the educate myself.
                Cheers again
                Alex

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  Yesterday, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 12:03 PM
                0 responses
                17 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, Yesterday, 11:40 AM
                0 responses
                13 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-26-2026, 10:12 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...