SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
samtools fastq (bam2fq) producing non-paired reads antifolate Bioinformatics 6 01-20-2016 08:13 PM
Nextera XT fragment size and MiSeq v3 read length clashing ? BioGenomics Sample Prep / Library Generation 3 07-18-2014 02:10 AM
Miseq v2 PE library length for Allpaths-LG hi-koike De novo discovery 0 07-18-2013 07:56 AM
Trim last 'n' reads from end of variable length PE reads swNGS Bioinformatics 4 11-12-2012 03:19 PM
Reads of different length? nike00 Bioinformatics 1 10-01-2011 03:47 PM

Reply
 
Thread Tools
Old 07-03-2016, 03:56 AM   #1
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 149
Default MiSeq producing various length reads

Hello all

I'm processing a micro-RNA-seq experiment for a collaborator of ours, and see a very unusual thing. They have sequenced three samples using miSeq, with the expected read length of 51. However instead I see lots of reads that are NNNNNNNNNNNNNN of length 20-21, and quite a few of intermediate ones too.

This is very unusual - do you have any idea about why it might have happened?
apredeus is offline   Reply With Quote
Old 07-03-2016, 04:33 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Are you saying that there are actual NNNN or just short(er) than 51 bp reads?

If there are N's then that may indicate a failure of basecalling. It could be due to overloading. Generally sequencing facilities will not release this kind of data.

If that is a result of some sort of post-run data processing (where they replaced the adapter sequences with N's for example, don't know if BaseSpace does something like that) then you would need to ask. If you ignore/strip the N's is the rest of the data good quality?
GenoMax is online now   Reply With Quote
Old 07-03-2016, 05:39 AM   #3
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 149
Default

There are a bunch of NNNNN reads that are 20 bp long, and there are bunch of other reads that are not N* but have a variable length. I'll try to align them to see if it will at least look like micro-RNA, but the thing is, you need to clip the adapters and it's hard to do it on a variable length read

It does not look like the cell is overloaded from FastQC report though. It looks like there's a small bubble there but that's all.

It was not a sequencing facility that did it - just a small institute ran it on their MiSeq. So they totally might have done something wrong there, they don't run it very often for this sort of libraries - mostly they sequence strains of viruses.
apredeus is offline   Reply With Quote
Old 07-03-2016, 10:10 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Reads don't come off the machine with variable length unless you set the Illumina software to trim the adapters during base-calling or demultiplexing or something (not sure exactly when it happens), or they've been postprocessed in some way. You should ask how the data was generated, or better yet, see if you can get the raw fastq data.
Brian Bushnell is offline   Reply With Quote
Old 07-03-2016, 10:13 AM   #5
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 149
Default

Those were supposed to be raw fastq. But you are right, I was thinking along the same lines. I'll just come over and get the data from the device myself.
apredeus is offline   Reply With Quote
Old 07-03-2016, 07:22 PM   #6
jdk787
josh kinman
 
Location: Austin

Join Date: Apr 2014
Posts: 60
Default

I've seen this with short small RNA libraries when using MiSeq reporter to demux with automatic adapter trimming.

To fix this you can redemultiplex the run with BCL2FastQ, or remove the adapter sequences from your sample sheet and redumultiplex with MiSeq reporter. Then just trim the adapters yourself.
jdk787 is offline   Reply With Quote
Old 07-04-2016, 03:45 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Quote:
Originally Posted by apredeus View Post
Those were supposed to be raw fastq. But you are right, I was thinking along the same lines. I'll just come over and get the data from the device myself.
If you can't get the raw data or can't get the facility to re-run the analysis then just trim the N's off. One can safely assume that Illumina would know how to identify their own adapter sequences. It sounds like they are masked by the default demux process.

@Brian: What is an easy way to trim those N's using BBMap? I should add this to my BBMap tricks thread.
GenoMax is online now   Reply With Quote
Old 07-04-2016, 08:47 AM   #8
jdk787
josh kinman
 
Location: Austin

Join Date: Apr 2014
Posts: 60
Default

Quote:
Originally Posted by GenoMax View Post
If you can't get the raw data or can't get the facility to re-run the analysis then just trim the N's off. One can safely assume that Illumina would know how to identify their own adapter sequences. It sounds like they are masked by the default demux process.
I couldn't find this info for MiSeq Reporter, but did see this in the Bcl2FastQ guide..

--mask-short-adapter-reads arg (=22) smallest number of remaining bases (after masking bases below the minimum trimmed read length) below which whole read is masked

So it looks like it is possible that the adapters are being correctly identified, but the remaining read after trimming is shorter than 22bp and may be being masked with NNNN.

Since this is micro RNA, I think it is worth trying to redemux without adapter trimming or changing this variable in order to unmask these reads instead of removing them. Doing this has worked for me when sequencing Small RNA libraries on the MiSeq.
jdk787 is offline   Reply With Quote
Old 07-04-2016, 09:24 AM   #9
jdk787
josh kinman
 
Location: Austin

Join Date: Apr 2014
Posts: 60
Default

From MiSeq Reporter User Guide

Masking Short Reads
MiSeq Reporter includes a setting that prevents reads that have been almost entirely
trimmed or masked from confounding downstream analysis, which is based on the following criteria:
} If the adapter is encountered within the first 32 bases of the read, the adapter sequence is N-masked.
} If the adapter is identified in the first 32 bases and the read includes ten or more bases from the start of the adapter, the entire read is N-masked. This ten-base limit is controlled by the configuration setting NMaskShortAdapterReads.
jdk787 is offline   Reply With Quote
Old 07-04-2016, 11:26 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
One can safely assume that Illumina would know how to identify their own adapter sequences.
I'd like to think so...

Quote:
What is an easy way to trim those N's using BBMap? I should add this to my BBMap tricks thread.
You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale.

@jdk787, thanks for posting the specific details of what's going on. Looks like defaults that make sense in many cases but not for small RNAs.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO