Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
What proportion of 'bad' quality reads are expected using HiSeq 2000 for RNA-Seq bob-loblaw Bioinformatics 2 07-02-2013 10:45 AM
Hiseq 2000 for Sale. kcacdna Illumina/Solexa 0 06-13-2013 05:14 AM
Concerns for combining data from HiSeq 2000 and HiSeq 2500 jaaker Illumina/Solexa 1 02-04-2013 02:56 PM
HiSeq 2000 v3: Bad tiles? rkirkbride Illumina/Solexa 3 01-20-2012 12:30 AM
Hiseq 2000 paired-end capture data analysis problem-too many variants! lazyworm Bioinformatics 1 08-11-2010 10:03 AM

Thread Tools
Old 09-24-2014, 11:48 AM   #1
Junior Member
Location: Bloomington, Indiana

Join Date: Sep 2014
Posts: 4
Default Demultiplexing HiSeq 2000 reads containing an N at the 5' end

Hello all:

I have an issue that I don't think has been covered yet in the SEQanswers community.

We recently performed CAGE (, which is a method for large-scale profiling of 5' mRNA ends, and obtained a large number of high quality Illumina HiSeq SE reads of 50bp.

This experiment contains data from 8 separate experiments, and so the 5' ends of the reads were barcoded with 8 trinucleotides, as follows:

sample_1 ACC
sample_2 CAC
sample_3 AGT
sample_4 GCG
sample_5 ATG
sample_6 TAC
sample_7 ACG
sample_8 GCT

Of course, these samples need to be demultiplexed so they can be analyzed separately. I did so using the FASTX-Toolkit's FASTX Barcode Splitter ( as follows:

cat myCompleteCAGEfile.fastq | --bcfile mybarcodes.txt --bol --exact --suffix ".txt" --prefix /my_directory/demulti-

I chose the --exact flag because the barcodes are only three bases in length, so I reasoned it was best to demand a precise match and then rescue the unmatched reads after the fact.

The above demultiplexed job worked well, and I was left with a small (<5%) but not insignificant number of unmatched reads. The largest class of these unmatched reads have an N at the first base.

For example, one of the first reads begins with:
For which the barcode (N)CT would correspond to Sample 8: GCT.

I ran the command again but with a tolerance for a single mismatch, but this causes a conflict between possible barcodes and as far as I know this command does not allow for specifying mismatch tolerance at a specific base, which would be ideal in this case. Also, creating a degenerate barcode file including the N is not tolerated by the program either.

I've considered using a set of piped linux commands, including cut and sed, but this would be trickier than it needs to be, and I expect there is another way to rescue these 'single leading N' unmatched reads. Can anyone point me in another direction? It may be possible to do this using CASAVA, but I very limited experience with that software package.

Thanks in advance,

rtraborn is offline   Reply With Quote
Old 09-24-2014, 12:00 PM   #2
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,972

If you recovered 95% of the reads that you are interested in then do you really need the remaining 5%? Generally an N is indicative of inability of the basecaller to decide on what base it thinks it is. In your case the last two bases are unique so your hypothesis as stated above may hold true i.e. (N)CT must really be a GCT. You could recover the remaining reads following that logic/some code but if you are happy with the 95% then I would say ignore the rest.
GenoMax is offline   Reply With Quote
Old 09-24-2014, 02:29 PM   #3
Junior Member
Location: Bloomington, Indiana

Join Date: Sep 2014
Posts: 4

That's a very good point- this is an edge case, and I don't necessarily need to hold the rest of the analysis up on account of <5% of the reads.

That said, I'm still interested in finding a solution to this problem so I can incorporate it in a pipeline that I'm building. If I find one I'll post it to this thread.
rtraborn is offline   Reply With Quote
Old 09-24-2014, 02:49 PM   #4
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

I agree with GenoMax; just throw those away. 3bp tags are really short; with an N, you have 2bp, and an indication that the other 2 bases are probably low quality, or else why would the other be an N? Remember that there are miscalled bases in barcodes, too. If you accept barcodes with an N, a single miscalled base will cause cross-contamination.

Of course, you already have some (like ACC and ACG) that are only a single base apart, so I hope the study is not sensitive to cross-contamination. But keeping the ones with N calls will just make the noise greater, because a 2bp code can be 1 substitution away from 3 or 4 other codes, thus increasing the chances of generating a valid code from a random sub.
Brian Bushnell is offline   Reply With Quote
Old 09-25-2014, 09:47 AM   #5
Junior Member
Location: Bloomington, Indiana

Join Date: Sep 2014
Posts: 4

Hi Brian:

Good points. I'll likely just keep these reads separate and go ahead with the analysis without them; not having them will not change the results, and we certainly have a tremendous number of reads. We are setting up to do similar 5' end profiling experiments in our lab, and when we do so we'll use much longer barcodes so we don't run into these ambiguity problems.

Best regards,

rtraborn is offline   Reply With Quote

cage, demultiplexing, transcriptome analysis

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 03:57 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO