Seqanswers Leaderboard Ad

**simonandrews** · 09-27-2010, 12:20 AM

Originally posted by edge View Post

Hi simon,
Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?

I think the posts above have covered what I'd do, one way or another.

I'd run the sequences through FastQC. This will generate a per-base quality plot and a per-sequence quality plot. The per base plot will tell you if there is a systematic loss of quality as your run progresses. If there is you might want to consider trimming a bit off the length of all of your sequences.

You can also see if there is a sub-population of sequences which have universally poor quality within a generally good run. This is much more unusual in my experience, but if it is a problem in your data you could filter these out as well.

I have to say that in most cases we don't filter our data unless the qualities are really poor. Many downstream analyses can take the assigned quality values into account so they don't need to have it all removed, but this may be different in your case depending on what you're intending doing with the data.

**edge** · 09-27-2010, 12:30 AM

Thanks a lot and again for your explanation in detail, simon

I'm trying out my data regarding your suggested pipeline now.
Will ask your advice or comment if I'm facing any error or misunderstanding when dealing for the data read later.
Thanks again, simon ^^

**Seq_in_Perl** · 09-30-2010, 09:45 AM

Include Std dev?

Excellent post! Two questions:

1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

Thanks, and your help is immense!

Originally posted by Brugger View Post

Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.

More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.

My initial oneliner should have been:

Code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'

And to do the filtration how ever one solution would be:

Code:

cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'

Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

**simonandrews** · 09-30-2010, 11:10 AM

Originally posted by Seq_in_Perl View Post

Is there a way to include standard dev of the quality scores along with the average using the one-liner?

One other thing to mention here - quality scores are nowhere near normally distributed. They are heavily skewed towards high values, with a fairly rapid descent to values close to zero (especially for Illumina data where they use a score of 2 as a sort of flag rather than a true probability measure). This means that using mean and stdev tend to give a fairly poor reflection of the range of values you have. We tend to prefer looking at percentiles through the distribution to get a clearer picture of what's going on (we often find that the 25th, 50th, 75th and 90th percentiles can all have the same value, but that the 10th can vary wildly).

**Brugger** · 10-01-2010, 02:30 AM

Originally posted by Seq_in_Perl View Post

Excellent post! Two questions:

1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

Thanks, and your help is immense!

Code:

cat fastq_file | perl -ne 'chomp;$_ = substr($_,0,30); print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $_=ord($_)-64; $s+=$_;$s2=$_*$_;} split(""); $l=length($_); print "  avg score: " .($score/$l)." stddev".(sqrt(($l*$s2-$s*$s)/($l*($l-1)))) ."\n";$s=$s2=$l=0;'

I believe (not really tested). But now things are getting silly. A proper script would increase readability greatly...

And I agree with simon, this does not really make any sense. We normally run bwa with aln with a '-q 15' flag to remove low quality ends of if you do assemblies, velvet comes with a suite of scripts for clipping off the low quality ends.

**sterding** · 12-14-2011, 02:03 PM

An IMPORTANT bug

Thanks for the code snip -- very nice!

I just found a bug in the code -- without chomp the newline character, the score will be wrong. Here is the new code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>; chomp; $score=0; map{ $score += ord($_)-33}
split(""); print " avg score: " .($score/length($_))."\n";'

I tested this by following example:
@HWI-ST570:30

0EKUABXX:3:2208:7666:82627 2:N:0:
AGGTGGGGGGGGGTGGGGGTGTGGGGTGGGGTGGGTGGGTGTTGGGGGGATGGGGGTGTGAGGTGGGGGGGGGGGGGGGGGGGGGGGTTATGGTGT
+
################################################################################################

I also change the above code to Phred+33, which works for Illumina 1.8+ and Sanger. For Illumina 1.5+, use ord($_)-64

**strawbaubz** · 10-07-2015, 01:24 AM

hi

I tried your code for getting quality averages, howvere this is what it returned to me:

@M01232:82:000000000-AHHB7:1:2119:22801:25267 1:N:0:33 avg score: 635857.304635762
@M01232:82:000000000-AHHB7:1:2119:10835:25269 1:N:0:33 avg score: 635856.986754967
@M01232:82:000000000-AHHB7:1:2119:17952:25279 1:N:0:33 avg score: 635852.953642384

I think I have an idea of how they got to this score but what am I doing wrong or how can I fix this?
I copied the code exactley and added my fastq file.

Also once/if this can be fixed is it possible to run multiple separate fastq files at a time instead of having to do one at a time. All my files contains 2 mill reads per file, it takes incredibly long.

Any help?

Auberi

**strawbaubz** · 10-07-2015, 01:36 AM

Din't see the post prior to mine and that seems to have fixed my bug as I worked with ILLUMINA 1.8+.
Sorry about that.

Thanks

**bi_maniac** · 10-27-2015, 06:02 AM

Sorry. It is a little off topic but ...

I am trying to merge paired ends fastq files.

Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

Is this normal? I assumed that merging would be much above 50%.

Thanks.

**simonandrews** · 10-27-2015, 06:06 AM

Originally posted by bi_maniac View Post

Sorry. It is a little off topic but ...

I am trying to merge paired ends fastq files.

Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

Is this normal? I assumed that merging would be much above 50%.

Thanks.

It depends on the insert size you have in your library. If a lot of your inserts are long enough that the overlap between your reads will be short or non-existent then you won't be able to merge them. If the reads do overlap, but the quality is poor then you might have trouble uniquely identifying the correct place to join the reads.

**bi_maniac** · 10-27-2015, 09:04 AM

Thanks. If I understand correctly, given a library and a particular gene to be studied, insert size should be known. OK?

**simonandrews** · 10-27-2015, 09:05 AM

Originally posted by bi_maniac View Post

Thanks. If I understand correctly, given a library and a particular gene to be studied, insert size should be known. OK?

Whoever made the library should have a rough idea of the size selection they used when creating the library from which you can work out the insert size range. It will be an approximate range though, not a fixed value.

**bi_maniac** · 10-27-2015, 09:23 AM

Thanks a lot. You are being very helpful

**GenoMax** · 10-27-2015, 09:29 AM

@bi_maniac: While you wait to hear back from people who made the libraries, you can estimate the insert size based on data at hand. See Brian's suggestion here: http://seqanswers.com/forums/showpos...13&postcount=2

**Brian Bushnell** · 10-27-2015, 09:38 AM

It would help if you could post the insert-size histogram from merging, by the way. If it is monotonically increasing and then suddenly drops to zero just before 2*(read length), then they mostly don't overlap. If there is a prominent bell-shaped peak well below 2*(read length), then the problem is likely quality. But at a 20% merge rate I assume they mostly don't overlap.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, 11-08-2024, 11:09 AM	0 responses 35 views 0 likes	Last Post by seqadmin 11-08-2024, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, 11-08-2024, 06:13 AM	0 responses 28 views 0 likes	Last Post by seqadmin 11-08-2024, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 32 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 23 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News