PDA

View Full Version : MiSeq fastq output: 250-251 bp reads


dsobral
12-10-2013, 04:09 AM
Hello,

I'm getting reads from a MiSeq machine and noticing that many have 251bp instead of 250bp, and that the last base has a highly skewed base composition. From what I read of the manual, the sequencer always make read_length + 1 cycles of imaging, but only read_length are analyzed (for phasing etc...). Shouldn't the final fastq have all reads the same size and of only 250bp?? Is it ok to cut out the final base or is it a sign that there's something wrong?

Also, a side note: some reads have less than 250bp, and I assume it is because of adapter trimming. If this is the case, then both read1 and read2 of a pair should be trimmed to the same size, otherwise it shouldn't be trusted, no?

Thanks,
Daniel

mcnelson.phd
12-10-2013, 04:29 AM
Hi Daniel,

You're correct in that the actual read length of an Illumina run is always N+1 because the extra base is used for phasing/pre-phasing analysis. Ideally that last base should be trimmed off because it's not properly quality checked.

As for reads < 250bp, if you're using a Nextera kit and had Trim Adapters checked in the sample sheet, then you're correct about why you have shorter reads. If you're seeing that the two reads of a pair aren't the same length, then you're also probably seeing that read 1 is shorter than read 2. This would be an issue with the trimming where the base quality of read 2 dropped low enough that the adapter sequence wasn't properly called and thus couldn't be recognized to be trimmed. Some third-party apps can do a much better job of trimming so you may want to try those.

dsobral
12-10-2013, 04:33 AM
Thanks for the reply. Very useful information.

The only thing I'm still puzzled is why some reads have 250bp and others have 251bp.

Daniel

GenoMax
12-10-2013, 04:40 AM
Some facilities set up a run as (n+1) depending on the number of bases (n) you had asked to be sequenced.

If you did not set this run up yourself then it is possible that the original run was set up as 250 x 251 bp (if one read is consistently 250 or less and other is 251 bp or less depending on trimming).

dsobral
12-10-2013, 04:54 AM
I would understand if there was some obvious consistency.
What I observe is that for the same run, read1 OR read2 can be either 250 or 251bp (and sometimes 249bp!) with no apparently consistent pattern. I'm suspicious that the behaviour is coming from adapter trimming.

Counts | Read1 | Read2
4223 | 250 | 248
7940 | 250 | 249
58517 | 250 | 250
130842 | 250 | 251
10571 | 251 | 248
21321 | 251 | 249
145959 | 251 | 250
331396 | 251 | 251
...

GenoMax
12-10-2013, 04:58 AM
Adapter trimming can't be the cause of it (unless this was set up as a longer run originally than 250 bp).

Did you run this yourself (if not you should ask the facility that ran it to see how the original run was set up).

dsobral
12-10-2013, 05:11 AM
I didn't run it myself, but it was using Nextera V2 250x250
Adaptor trimming was on (I guess by default)

Thanks,
Daniel

dsobral
12-10-2013, 05:13 AM
PS: although the data has these peculiarities, I used these for denovo assembly of a bacteria, and it gave good results...

I just noticed because when I tried Edena on the full data, it complained about the sizes...

I was just wandering what to think of it.

Thanks