Seqanswers Leaderboard Ad

**TobyH** · 03-25-2013, 03:02 AM

For any kind of work involving the processing of sequence files, I strongly recommend that you use BioPerl. One of the modules included in BioPerl is SeqIO, which will handle the input and output of FASTQ files.

Page not found · GitHub Pages

http://www.bioperl.org/wiki/Module:Bio::SeqIO

The rest is pretty straightforward, and just involves pattern matching of your sequences after you've imported them.

**TobyH** · 03-25-2013, 03:05 AM

Also, if you want feedback or help with any kind of script (custom or e.g. software developed by someone else), it is much better to provide an example of the error message(s) that you are encountering when you run it, plus details/an example of the input files you have and the output that you expect.

**tospo** · 03-25-2013, 03:18 AM

there are a couple of things wrong with your script but first of all, it would always be a good idea if you seek help on a non-working script to say what exactly isn't working.

Anyway, one thing that will definitely not work as you expected is

Code:

print SEQ $Hash;

The scalar $Hash doesn't exist, instead, you have a hash called %Hash (BTW: my advice would be to give yuor variables meaningful names but that's just a style issue). You can not print out an entire hash like that.

Instead, you have to iterate over the elements like so (just one option of several):

Code:

foreach my $seq ( keys %Hash){
  print "$seq\n";
}

You might also want to sort your sequences so they would always come out in the same order:

Code:

foreach my $seq ( sort keys %Hash){
  print "$seq\n";
}

Another thing to change is the loop to iterate over the input file, which I would have written like so:

Code:

my $i = 0;
while (my $seq = <IN>){
  chomp $seq; # assuming you don't want to store the newlines
  if ( ++$i % 4 == 1){
    $Hash{$seq}++;
  }
}

(untested)

It's true that you can do this with BioPerl and it's a lot safer that way because you then don't rely on the blocks of seq data to be exactly 4 lines long (which they don't have to be to be valid fastq) but it is also a lot slower to slurp a fastq file with the SeqIO module which can be a problem for large files, so I also use your quick and dirty approach for large files where I know that they will always have 4 lines per sequence.

**NicoBxl** · 03-25-2013, 03:18 AM

or use fastx-x toolkit :

FASTQ/A Collapser
Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)

FASTX-Toolkit

http://hannonlab.cshl.edu/fastx_toolkit/index.html

**davidblaney** · 03-25-2013, 03:24 AM

Hi,

As toby said giving examples would be more helpful.

When using perl you should always use

Code:

use warnings;
use strict;

at the beginning of your program in order to see errors and prevent uses of uninitialised variables.

In your case I believe $Hash is not initialised, it is not a hash ref, and so the line will not be printed and if you had the warnings in place you would get an error here.

Code:

print SEQ $Hash;

try

Code:

for each my $i (keys %Hash){
[INDENT]print SEQ "$i => $Hash{$i}\n";[/INDENT]

}

**Kawaccino** · 03-25-2013, 05:56 AM

First of all thank you for your help all of you. Now, it works on an extract of my file I just have to test it on the total file to see if it will really works on 8Go.
Second: yes I know that an exemple of the errors is better than any explanation but as you saw, I didn't put the "use warnings" so I just had a blank file and didn't have any explanation why. Next time I won't forget it!
By the way, I know there is Bioperl, but for very big files (like this fastq file) I read that a "homemade" script was better even if it's, as said tospo a "dirty approach", it's a little faster.

**swbarnes2** · 03-25-2013, 08:18 AM

Another approach:

Convert the fastq to an unaligned .bam, so that you have one read per line.

Then

Code:

samtools view unaligned.bam | cut -f 10 |sort | uniq -c |sort -nr > read_counts.txt

**Kawaccino** · 03-25-2013, 09:19 AM

Thank you swbarnes, I noted your approach and will try it!

It's probably not the good place to ask it, but I don't wan't to create a new thread just for this question so here it is: what do you think of these three books:
* Perl Cookbook, 2nd Edition
By Tom Christiansen, Nathan Torkington
Publisher: O'Reilly Media

* Mastering Perl for Bioinformatics
By James Tisdall
Publisher: O'Reilly Media

* Beginning Perl for Bioinformatics
An Introduction to Perl for Biologists
By James Tisdall
Publisher: O'Reilly Media

Have a good evening (it's evening where I am

)

**GenoMax** · 03-25-2013, 09:36 AM

Originally posted by Kawaccino View Post

Have a good evening (it's evening where I am

)

The books are fine but while you are waiting on them how about starting here right now: http://korflab.ucdavis.edu/unix_and_Perl/

They also have a book available if you prefer that http://unixandperl.com/

**swbarnes2** · 03-25-2013, 09:46 AM

The cookbook is useful, but you can always google for that kind of stuff.

The bioinformatics ones are not very helpful at all, in my opinion. Just jump to the lama or the camel ones. I don't think the bioinformatic-specific parts are helpful at all. I suspect they are all pretty out of date too.

**tospo** · 03-26-2013, 06:02 AM

Originally posted by Kawaccino View Post

It's probably not the good place to ask it, but I don't wan't to create a new thread just for this question so here it is: what do you think of these three books:
* Perl Cookbook, 2nd Edition
By Tom Christiansen, Nathan Torkington
Publisher: O'Reilly Media

* Mastering Perl for Bioinformatics
By James Tisdall
Publisher: O'Reilly Media

* Beginning Perl for Bioinformatics
An Introduction to Perl for Biologists
By James Tisdall
Publisher: O'Reilly Media

I like the "Beginning Perl for Bioinformatics" to get started (still recommend it to students) but it doesn't really cover BioPerl, so you have to bear in mind that parsing a Blast report file "manually" is a good exercise but not what you would normally do in the real world. The BioPerl HowTos are an excellent resource for getting started.

**NGSfan** · 03-26-2013, 06:19 AM

While I don't like re-inventing the wheel, I'm not a fan of BioPerl - a lot of overhead and dependencies to do simple tasks like reading in FASTQ files.

**tospo** · 03-26-2013, 06:31 AM

Originally posted by NGSfan View Post

While I don't like re-inventing the wheel, I'm not a fan of BioPerl - a lot of overhead and dependencies to do simple tasks like reading in FASTQ files.

Right, flame thrower in position...check

Seriously, I find that surprising. Sure, there are cases - like reading a fastq file IF you can 100% rely on it being organised into 4 lines per sequence - where I would also not use the BioPerl modules simply for performance reasons.

However, I would usually advise against this. even simple text file formats have their gotchas and you may see your code break in the most unexpected ways because you didn't know that the file format you use as input may legally have a line break somewhere or a character you didn't expect. Much better to use the existing wheel...

For most situations where you would reach for a Perl script, performance is not so crucial. Who cares if my one-off script runs overnight if I could put it together in 5 minutes using the extensive BioPerl library collection rather than spending a week doing it all from scratch?

**SES** · 03-27-2013, 06:39 AM

Originally posted by swbarnes2 View Post

The bioinformatics ones are not very helpful at all, in my opinion. Just jump to the lama or the camel ones. I don't think the bioinformatic-specific parts are helpful at all. I suspect they are all pretty out of date too.

I agree completely, those books are more than a decade old and the modern best practices are not in agreement with a lot of the things taught in those bioinfo books. That being said, there are some good exercises and worked examples in the Tisdall books, but as a whole I think it is better to pick up a more recently published book if one is just starting.

Originally posted by NGSfan View Post

While I don't like re-inventing the wheel, I'm not a fan of BioPerl - a lot of overhead and dependencies to do simple tasks like reading in FASTQ files.

I agree on that last part, there are more efficient ways to process large sequence files (a pure-Perl approach being one). Though, I still use BioPerl everyday for working with alignments, phylogenetic trees, population data, etc., so it definitely has it's place. IMO, I think you'd be foolish to not use a Bio* package for those things because it would require a lot of effort to implement your own solution in some cases, and the performance gain would not be worth it. Maybe we'll see a more efficient FASTQ parser come out of the GSoC this year

.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Fastq and Perl

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News