View Single Post
Old 10-25-2012, 08:44 AM   #3
Jan_R
Junior Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 8
Default

Here are my 2 cents:

1.
... or in the dataset. When you can find exact same sequence very often in your raw data while you are checking its quality (e.g. with FastQC), it is most probably an artifact. Sequencing primers, adaptors. But it can also be rRNA. Blast such sequences and find out. If you do not like them, find ways to extract them.

2.
... Illumina files usualy come in the FASTQ format: http://en.wikipedia.org/wiki/FASTQ_format
The size of the files would some hundred MB

3.
... I also highly recommend the FASTX toolkit. Combining these tools is perfect to get your data in shape
Jan_R is offline   Reply With Quote