Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove suffix from fastq sequence ID

    Dear all,

    I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

    i.e.

    change

    @HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
    NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
    +HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
    BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

    to

    @HWI-ST182_0249:5:1101:1093:2017#GTATGACG
    NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
    +HWI-ST182_0249:5:1101:1093:2017#GTATGACG
    BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

    I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
    Thanks a million
    Alex

  • #2
    Dear Alex,
    You can use perl scripting, read the files, Split line if it is starting with @HWI or +HWI and print only the first part after splitting. And use else statement for printing rest of the sequence and quality lines as such.
    Or you can use unix 'awk' set FS in the BEGIN and then print $1 part if line is starting with seq Id @HWI or +HWI.
    Best wishes,
    Rahul
    Rahul Sharma,
    Ph.D
    Frankfurt am Main, Germany

    Comment


    • #3
      Originally posted by alexd106 View Post
      Dear all,

      I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

      i.e.

      change

      @HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
      NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
      +HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
      BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

      to

      @HWI-ST182_0249:5:1101:1093:2017#GTATGACG
      NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
      +HWI-ST182_0249:5:1101:1093:2017#GTATGACG
      BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

      I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
      Thanks a million
      Alex
      Hi Alex, while perl scripting is a good option, if you are new to bioinformatics there might be easier options for you. For example, FASTX-Toolkit:

      Comment


      • #4
        Hi Rahul,

        Thank you very much for your suggestions. As i mentioned, I am new to bioinformatics and am just trying to teach myself some perl (and have never used awk). Would you mind providing a little more detail of the perl code you would use? No worries if not.

        Cheers
        Alex

        Comment


        • #5
          awk is good but sed might be faster and easier to learn.

          Code:
          sed -i.bak -e '/^[@+]HWI/ s/\/[12]$//' <yourFileName>
          This sed script will look for lines starting with @HWI or +HWI, strip off either a /1 or /2 from the ends of those lines and save the result to the same file name as the original. The original file will be saved as <yourFileName>.bak.

          Comment


          • #6
            Thanks very much for the info.

            All the best
            Alex

            Comment


            • #7
              Hi Alex,

              Following is the perl code:
              Code:
                1 use strict;
                2 use warnings;
                3 
                4 my $file_in=$ARGV[0];
                5 my $file_out=$ARGV[1];
                6 
                7 my $num=0;
                8 open I,"<$file_in" or die $!;
                9 open O,">$file_out" or die $!;
               10 
               11 do{
               12 
               13 my $f =<I>;
               14 chomp $f;
               15 
               16 if(($f =~ /^\@HWI/)||($f =~ /^\+HWI/))
               17      { $num++;
               18        my @s=split(/\//, $f);
               19        print O"$s[0]\n";
               20      }
               21 
               22 else
               23      {
               24        print O "$f\n";
               25         }
               26 
               27 }until eof(I);
               28 my $pr=$num/2;
               29 print "\nProcessed reads: $pr\n"
               30 
               31 
              ~                                                                                                                                                                    
              ~
              Usage: perl program_name.pl Input_file.fq Out_file.fq
              Last edited by rahularjun86; 03-13-2012, 07:04 AM.
              Rahul Sharma,
              Ph.D
              Frankfurt am Main, Germany

              Comment


              • #8
                Dear all, thanks for all the really useful suggestions. What a great community this is. I hope I can contribute sometime in the future when i have a little more experience.

                [ehlin] I thought of using FASTX-Toolkit but couldn't see the appropriate tool. I looked at

                $ fastx_renamer -h
                usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
                Part of FASTX Toolkit 0.0.10 by A. Gordon ([email protected])

                [-n TYPE] = rename type:
                SEQ - use the nucleotides sequence as the name.
                COUNT - use simply counter as the name.

                but it looks like the renaming is restricted to either a sequence or counter.

                The sed and seemed to do the trick and I will look at the perl solution in an attempt the educate myself.
                Cheers again
                Alex

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X