Seqanswers Leaderboard Ad

**maubp** · 01-18-2010, 05:48 AM

Try using [ code ] text [ /code ] to stop the forum software showing the FASTQ read funny (with extra spaces).

It looks like something went wrong with the sequencing at about 20bp. I would ask your sequencing center to check this run.

**kmcarr** · 01-18-2010, 06:34 AM

Wei,

Have you really checked ALL of your sequences or did you just look at the first few hundred at the beginning of the FASTQ file? The Illumina GA sorts the reads by tile and then by x-coordinate. This means that if you look at the beginning of an Illumina generated FASTQ you are seeing the reads from the very left edge of tile number 1. Reads at the edge of any tile are generally of very poor quality; I typically will see hundreds of useless reads, with many, many N's at the beginning of a FASTQ file.

Ask the facility which generated the sequence to provide you with data about overall run quality.

**Wei-HD** · 01-18-2010, 06:40 AM

Thanks Kmcarr! I just checked the first several FASTQ sequences. Because the file is really big, I could not check the whole file very fast. Would you kindly tell me how could I check all the sequence. The file is about 3G text.

Thanks in advance!

**strob** · 01-18-2010, 06:47 AM

FASTX-Toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

here you can find some useful tools
you can also use them in Galaxy

**MattB** · 01-18-2010, 06:52 AM

Hi Wei,

Have a look at the 'FASTX Statistics' and 'FASTQ Quality Chart' programs that are part of the FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html). They provide a nice way to evaluate quality of your reads across the whole file.

Matt

**MattB** · 01-18-2010, 06:53 AM

Looks like strob just beat me to it

**maubp** · 01-18-2010, 07:06 AM

Originally posted by Wei-HD View Post

Because the file is really big, I could not check the whole file very fast. Would you kindly tell me how could I check all the sequence.

What is your favourite scripting language? Being about to answer this kind of question or variations on it yourself (rather than being limited to what a toolkit may provide) can be very helpful.

Here is a fairly general script in Python using Biopython which could be adapted for counting the N's in any supported sequence file format.

Code:

from Bio import SeqIO
from collections import defaultdict #Python 2.5+

tally = defaultdict(int)

#assumes all N are in upper case
for record in SeqIO.parse(open("example.fastq"), "fastq"):
    tally[record.seq.count("N")] += 1

if not tally:
    print "Did not find any N"
else:
    print "N count, occurrences"
    for n_count in range(0, max(tally.keys())):
        print n_count, tally[n_count]

(If you want a less general FASTQ only version, this can be made a lot faster)

**Wei-HD** · 01-18-2010, 07:23 AM

Hi all guys,

Thanks so much for your help!

My work is mostly about wet experiments, hence not familiar with writing script. I do not know how to use FASTX toolkit.

Maubp, thanks for your script, but can you tell me what do you mean by counting the N? Could I got how many N in the short sequence which means the quality of my data?

Thanks

**MattB** · 01-18-2010, 07:29 AM

Wei, have a look at Galaxy (http://main.g2.bx.psu.edu/)

It has a very user friendly interface for the Fastx tools..

**maubp** · 01-18-2010, 07:57 AM

Originally posted by Wei-HD View Post

Maubp, thanks for your script, but can you tell me what do you mean by counting the N? Could I got how many N in the short sequence which means the quality of my data?

Your example record was odd - it had 45 "N" characters in the 72bp sequence (the rest was 6 A, 3 C, 11 T and 7 G). In a good run I would expect very few "N" characters (hopefully none). Thus looking at the distribution in the number of "N" characters per read seemed a reasonable way to evaluate your data. This would help you answer the question "Does this affect all my reads or just some?".

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 50 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

RNA Seq 72 bp data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News