Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastq and Perl

    Hello everybody!
    I got a script problem in Perl. I want to extract the sequences of my Fastq file (the second line of each block of four lines), regroups the identical sequences and count the number of times they are in the original file.
    So I made a little script Perl....................that doesn't work... Can anyone help me please?
    Here is my script:
    Code:
    $fastq="C:/Users/fastqtest.fastq";
    open(IN, "$fastq");
    $resultat="C:/Users/resultat.fastq";
    open(SEQ, ">$resultat");
    my %Hash;
        while (<IN>)
        {
            my $seq=<IN>;
            for (my $i = 0; $i <= IN.length(); $i++)
            {
                if ($i % 4 == 1)
                {
                    $Hash{$seq}++;
                }
            }
        }
        print SEQ $Hash;
    close (IN);
    close (SEQ);
    print "Script executed";
    <STDIN>

  • #2
    For any kind of work involving the processing of sequence files, I strongly recommend that you use BioPerl. One of the modules included in BioPerl is SeqIO, which will handle the input and output of FASTQ files.



    The rest is pretty straightforward, and just involves pattern matching of your sequences after you've imported them.

    Comment


    • #3
      Also, if you want feedback or help with any kind of script (custom or e.g. software developed by someone else), it is much better to provide an example of the error message(s) that you are encountering when you run it, plus details/an example of the input files you have and the output that you expect.

      Comment


      • #4
        there are a couple of things wrong with your script but first of all, it would always be a good idea if you seek help on a non-working script to say what exactly isn't working.

        Anyway, one thing that will definitely not work as you expected is
        Code:
        print SEQ $Hash;
        The scalar $Hash doesn't exist, instead, you have a hash called %Hash (BTW: my advice would be to give yuor variables meaningful names but that's just a style issue). You can not print out an entire hash like that.

        Instead, you have to iterate over the elements like so (just one option of several):

        Code:
        foreach my $seq ( keys %Hash){
          print "$seq\n";
        }
        You might also want to sort your sequences so they would always come out in the same order:

        Code:
        foreach my $seq ( sort keys %Hash){
          print "$seq\n";
        }
        Another thing to change is the loop to iterate over the input file, which I would have written like so:

        Code:
        my $i = 0;
        while (my $seq = <IN>){
          chomp $seq; # assuming you don't want to store the newlines
          if ( ++$i % 4 == 1){
            $Hash{$seq}++;
          }
        }
        (untested)

        It's true that you can do this with BioPerl and it's a lot safer that way because you then don't rely on the blocks of seq data to be exactly 4 lines long (which they don't have to be to be valid fastq) but it is also a lot slower to slurp a fastq file with the SeqIO module which can be a problem for large files, so I also use your quick and dirty approach for large files where I know that they will always have 4 lines per sequence.
        Last edited by tospo; 04-04-2013, 05:30 AM. Reason: just noticed an error in my code

        Comment


        • #5
          or use fastx-x toolkit :

          FASTQ/A Collapser
          Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)

          Comment


          • #6
            Hi,

            As toby said giving examples would be more helpful.

            When using perl you should always use

            Code:
            use warnings;
            use strict;
            at the beginning of your program in order to see errors and prevent uses of uninitialised variables.

            In your case I believe $Hash is not initialised, it is not a hash ref, and so the line will not be printed and if you had the warnings in place you would get an error here.

            Code:
            print SEQ $Hash;
            try

            Code:
            for each my $i (keys %Hash){
            [INDENT]print SEQ "$i => $Hash{$i}\n";[/INDENT]
            
            }

            Comment


            • #7
              First of all thank you for your help all of you. Now, it works on an extract of my file I just have to test it on the total file to see if it will really works on 8Go.
              Second: yes I know that an exemple of the errors is better than any explanation but as you saw, I didn't put the "use warnings" so I just had a blank file and didn't have any explanation why. Next time I won't forget it!
              By the way, I know there is Bioperl, but for very big files (like this fastq file) I read that a "homemade" script was better even if it's, as said tospo a "dirty approach", it's a little faster.

              Comment


              • #8
                Another approach:

                Convert the fastq to an unaligned .bam, so that you have one read per line.

                Then

                Code:
                samtools view unaligned.bam | cut -f 10 |sort | uniq -c |sort -nr > read_counts.txt

                Comment


                • #9
                  Thank you swbarnes, I noted your approach and will try it!

                  It's probably not the good place to ask it, but I don't wan't to create a new thread just for this question so here it is: what do you think of these three books:
                  * Perl Cookbook, 2nd Edition
                  By Tom Christiansen, Nathan Torkington
                  Publisher: O'Reilly Media

                  * Mastering Perl for Bioinformatics
                  By James Tisdall
                  Publisher: O'Reilly Media

                  * Beginning Perl for Bioinformatics
                  An Introduction to Perl for Biologists
                  By James Tisdall
                  Publisher: O'Reilly Media

                  Have a good evening (it's evening where I am )

                  Comment


                  • #10
                    Originally posted by Kawaccino View Post

                    Have a good evening (it's evening where I am )
                    The books are fine but while you are waiting on them how about starting here right now: http://korflab.ucdavis.edu/unix_and_Perl/

                    They also have a book available if you prefer that http://unixandperl.com/
                    Last edited by GenoMax; 03-25-2013, 09:41 AM.

                    Comment


                    • #11
                      The cookbook is useful, but you can always google for that kind of stuff.

                      The bioinformatics ones are not very helpful at all, in my opinion. Just jump to the lama or the camel ones. I don't think the bioinformatic-specific parts are helpful at all. I suspect they are all pretty out of date too.

                      Comment


                      • #12
                        Originally posted by Kawaccino View Post
                        It's probably not the good place to ask it, but I don't wan't to create a new thread just for this question so here it is: what do you think of these three books:
                        * Perl Cookbook, 2nd Edition
                        By Tom Christiansen, Nathan Torkington
                        Publisher: O'Reilly Media

                        * Mastering Perl for Bioinformatics
                        By James Tisdall
                        Publisher: O'Reilly Media

                        * Beginning Perl for Bioinformatics
                        An Introduction to Perl for Biologists
                        By James Tisdall
                        Publisher: O'Reilly Media
                        I like the "Beginning Perl for Bioinformatics" to get started (still recommend it to students) but it doesn't really cover BioPerl, so you have to bear in mind that parsing a Blast report file "manually" is a good exercise but not what you would normally do in the real world. The BioPerl HowTos are an excellent resource for getting started.

                        Comment


                        • #13
                          While I don't like re-inventing the wheel, I'm not a fan of BioPerl - a lot of overhead and dependencies to do simple tasks like reading in FASTQ files.

                          Comment


                          • #14
                            Originally posted by NGSfan View Post
                            While I don't like re-inventing the wheel, I'm not a fan of BioPerl - a lot of overhead and dependencies to do simple tasks like reading in FASTQ files.
                            Right, flame thrower in position...check
                            Seriously, I find that surprising. Sure, there are cases - like reading a fastq file IF you can 100% rely on it being organised into 4 lines per sequence - where I would also not use the BioPerl modules simply for performance reasons.

                            However, I would usually advise against this. even simple text file formats have their gotchas and you may see your code break in the most unexpected ways because you didn't know that the file format you use as input may legally have a line break somewhere or a character you didn't expect. Much better to use the existing wheel...

                            For most situations where you would reach for a Perl script, performance is not so crucial. Who cares if my one-off script runs overnight if I could put it together in 5 minutes using the extensive BioPerl library collection rather than spending a week doing it all from scratch?

                            Comment


                            • #15
                              Originally posted by swbarnes2 View Post
                              The bioinformatics ones are not very helpful at all, in my opinion. Just jump to the lama or the camel ones. I don't think the bioinformatic-specific parts are helpful at all. I suspect they are all pretty out of date too.
                              I agree completely, those books are more than a decade old and the modern best practices are not in agreement with a lot of the things taught in those bioinfo books. That being said, there are some good exercises and worked examples in the Tisdall books, but as a whole I think it is better to pick up a more recently published book if one is just starting.

                              Originally posted by NGSfan View Post
                              While I don't like re-inventing the wheel, I'm not a fan of BioPerl - a lot of overhead and dependencies to do simple tasks like reading in FASTQ files.
                              I agree on that last part, there are more efficient ways to process large sequence files (a pure-Perl approach being one). Though, I still use BioPerl everyday for working with alignments, phylogenetic trees, population data, etc., so it definitely has it's place. IMO, I think you'd be foolish to not use a Bio* package for those things because it would require a lot of effort to implement your own solution in some cases, and the performance gain would not be worth it. Maybe we'll see a more efficient FASTQ parser come out of the GSoC this year .

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X