SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract subset of Fastq sequences based on a list of IDs pepperoni Bioinformatics 36 05-06-2013 01:38 AM
New dual index Nextera TruSeq adapter sequences? koadman Illumina/Solexa 3 08-29-2012 05:17 PM
get snp list ardmore Bioinformatics 0 07-05-2011 01:03 PM
List of next-next gen... james hadfield The Pipeline 8 01-04-2011 11:41 PM
BFAST wish list aleferna Bioinformatics 16 08-01-2010 10:38 AM

Reply
 
Thread Tools
Old 07-11-2012, 01:02 PM   #1
Mouth_Breather
Junior Member
 
Location: Right Coast

Join Date: Jul 2011
Posts: 8
Default Getting a list of all index sequences

Hi!

For an internal project we are doing, we are trying to get at the actual index sequence for each read (all reads, whether it winds up in the undetermined indices bin or not).

We are using casava 1.8.2. From what I can tell looking at the Casava User Guide, this information is only present in the binary .bcl files.

Does anyone know of another way to retrieve this information, other than writing a script that parses binary?

Thanks for reading!
Mouth_Breather is offline   Reply With Quote
Old 07-11-2012, 04:01 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,156
Default

Mouth,

The actual barcode read for each read is recorded in the definition line of the FASTQ file. Here is the format of the Illumina FASTQ produced by CASAVA 1.8.x

Code:
@HWI-ST957:100:D0V52ACXX:6:1101:1221:2161 1:N:0:CGATGT
The barcode sequence at the end of the read is the actual barcode read for that cluster.

Here is an example showing that you can see variation in the barcode recorded in the defline. It grabs the first 1000 deflines from a gzipped fastq file, splits the defline at ":" and takes the 10th field (the barcode), sorts them and counts the number of each uniq one.

Code:
zgrep ^@HWI CTRL1_CGATGT_L006_R1_001.fastq.gz | head -1000 | cut -d":" -f10 | sort | uniq -c
      1 CGATAT
      2 CGATGA
      1 CGATGG
    987 CGATGT
      2 CGCTGT
      1 CGGTGT
      6 TGATGT
You can see that 98.7% match he expected and there are a few with mismatches.

But be aware if CASAVA demultiplexing was run with default settings no mismatches are allowed in the barcode. You will only see differences between the barcode read vs. the configured if you set up the CASAVA run (configureBclToFastq.pl) with --mismatches=1.
kmcarr is offline   Reply With Quote
Old 07-11-2012, 04:33 PM   #3
Mouth_Breather
Junior Member
 
Location: Right Coast

Join Date: Jul 2011
Posts: 8
Default

Quote:
Originally Posted by kmcarr View Post
Mouth,

The actual barcode read for each read is recorded in the definition line of the FASTQ file. Here is the format of the Illumina FASTQ produced by CASAVA 1.8.x

Code:
@HWI-ST957:100:D0V52ACXX:6:1101:1221:2161 1:N:0:CGATGT
The barcode sequence at the end of the read is the actual barcode read for that cluster.

Here is an example showing that you can see variation in the barcode recorded in the defline. It grabs the first 1000 deflines from a gzipped fastq file, splits the defline at ":" and takes the 10th field (the barcode), sorts them and counts the number of each uniq one.

Code:
zgrep ^@HWI CTRL1_CGATGT_L006_R1_001.fastq.gz | head -1000 | cut -d":" -f10 | sort | uniq -c
      1 CGATAT
      2 CGATGA
      1 CGATGG
    987 CGATGT
      2 CGCTGT
      1 CGGTGT
      6 TGATGT
You can see that 98.7% match he expected and there are a few with mismatches.

But be aware if CASAVA demultiplexing was run with default settings no mismatches are allowed in the barcode. You will only see differences between the barcode read vs. the configured if you set up the CASAVA run (configureBclToFastq.pl) with --mismatches=1.

Hi thanks for the reply! I'm aware of the ability to grab the barcode sequence from the fastq files - I was not aware that it showed the actual variations for those barcodes that have 1 mismatch - thanks for that.

But I also want to see the barcodes for which mismatches are 2 and greater. Are those recorded somewhere other than the .bcl files?
Mouth_Breather is offline   Reply With Quote
Old 07-12-2012, 02:26 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,156
Default

Quote:
Originally Posted by Mouth_Breather View Post
But I also want to see the barcodes for which mismatches are 2 and greater. Are those recorded somewhere other than the .bcl files?
Those are the reads in the fastq files under the Undetermined/Sample_lanex directories.
kmcarr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO