SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Conversion from bcl format to fastq files kjaja Bioinformatics 5 09-14-2011 07:13 AM
What are BCL files for? dawe Illumina/Solexa 1 09-15-2010 04:58 AM
qseq files versus sequence.txt files drio Illumina/Solexa 3 11-09-2009 09:02 AM
s_#_sorted.txt files caza Bioinformatics 2 04-28-2009 08:12 AM
solexa output files | s_*_seq.txt vs. s_*_sequencece.txt lajoieb Illumina/Solexa 3 04-08-2009 05:52 PM

Reply
 
Thread Tools
Old 10-11-2011, 04:13 AM   #1
giampe
Member
 
Location: Bari, Italy

Join Date: Aug 2009
Posts: 22
Default convert base call files (*.bcl) into files (*_qseq.txt)

We have a set of data files coming from a multiplex sequencing run on HiScan SQ machine and now we need to obtain fastq file in txt format of our sample. Could someone indicates a process on commands line step by step to do this?This is the first experience with this kind of data.

Until now we are able to install OLB software and to launch the command as reported in the user-guide of Off-Line Basecaller v1.9 (nov.2010):
./bustard.py --CIF /srv/illumina/Runs/111006_H112_0131_AB0B0VABXX/Data/Intensities/ --make --with-qseq

At this point we don't know if the output directories containes the exact files in the exact format.

Thank's for your time and your help.
giampe is offline   Reply With Quote
Old 10-11-2011, 05:03 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

If you are going to use the new version of CASAVA (v.1.8.x) then the fastq conversion and de-multiplexing are done with a single command starting with the BCL files. You will need access to the entire flowcell folder for this to work.

Minimally the process will be something like this:

Quote:
configureBclToFastq.pl --input-dir provide_location_to_Basecalls_dir --sample-sheet Location_of_SampleSheet.csv
You appear to have access to the illumina software so you should be able to download the relevant manuals in PDF format. Since there are many options for the above command that could be relevant in your specific case it would be best to refer to the CASAVA manual for detailed help.

PS: "qseq" files are no longer produced by the new version of CASAVA. You will get "fastq" format sequence files with sanger-encoding for quality calls. By default all sequences (those that would fail quality filter) are included in these files. Look for other threads on this forum for discussions on this issue.

Last edited by GenoMax; 10-11-2011 at 05:07 AM.
GenoMax is offline   Reply With Quote
Old 10-11-2011, 06:13 AM   #3
giampe
Member
 
Location: Bari, Italy

Join Date: Aug 2009
Posts: 22
Default

Hi Genomax,
thanks for your quickly reply, but from the pdf CASAVA 1.7 user guide Rev A, the .bcl converter is not included in CASAVA.
where we can find this kind of script configureBclToFastq.plis it in CASAVA package?
We have setupBclToQseq.py in Off-Line Basecaller v 1.9. but we are not able to create his input files by bustard.py script as the user guide report.

Thanks a lot!
giampe is offline   Reply With Quote
Old 10-11-2011, 07:20 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

It sounds like you are going to stick with CASAVA v.1.7 for this processing (instead of v.1.8.x, which was the info I had provided before, so please ignore that info).

In that case, this will be a two step process. In the first step you will convert the BCL to qseq files. This will be followed by actual de-multiplexing.

Following assumes that you have the entire flowcell folder available, otherwise this will not work.

While in the "Basecalls" directory you can issue the following command to do step 1 of the process (bcl to qseq conversion).

Quote:
setupBclToQseq.py -b . -o . -P .clocs --in-place
Run "make/distmake" to actually run the bcl conversion in the Basecalls directory after executing setupBclToQseq.py command.

After this conversion is complete, you can do step 2 (de-multiplexing). You will need to provide a "SampleSheet.csv" file that has the info about tags you have used. It would be best to refer to the manual for the exact format of this file. Remember not to use any spaces (and/or special characters) in sample names. The actual command to do the de-multiplexing is below:

Quote:
demultiplex.pl --input-dir /Path_to/Basecalls_directory --sample-sheet /path_to/SampleSheet.csv --alignment-config /path_to/config.template.txt --qseq-mask "Replace_with_correct_qseq_mask_code"
You can eliminate the --qseq-mask and the command will automatically determine this info.

A "Demultiplexed" directory will be created in the "Basecalls" directory after running the demultiplex.pl command. You will need to change to "Demultiplexed" directory and execute the "make/distmake" equivalent commands to complete the demultiplexing process.

The *qseq* files will be distributed in "bins" labelled as (001 .. 0xx) depending on number of indexes in your samples. You will find a SamplesDirectory.csv file at the end of demultiplexing process created in the "Demultiplexed" directory that will provide a "key" to where your samples are located in the "bin" directories.

Note: Both of these processes could take several hours each to complete (depending on how many clusters you had in the lanes) so you will need to be patient. You can use multiple CPU's. Provide the appropriate switch to the make (or SGE/distmake process).

Last edited by GenoMax; 10-11-2011 at 07:31 AM.
GenoMax is offline   Reply With Quote
Old 10-12-2011, 06:54 AM   #5
giampe
Member
 
Location: Bari, Italy

Join Date: Aug 2009
Posts: 22
Default

Hi Genomax,
thanks for your helpful suggestions, sorry but we are biologist without a good informatic skills so we have attached a pdf file showing the structure of our linux server, could you take a look at this file and check if the software and data folder are in the correct position?

At this moment according your suggestion we have launched this command in this way:

[serlab-carso:bin]# ./setupBclToQseq.py -b /srv/illumina/Runs/111006_H112_0131_AB0B0VABXX/Data/Intensities/BaseCalls/ -o --in-place -P .clocs INFO:setupBclToQseq:setupBclToQseq.py version 1.9.0
INFO:setupBclToQseq:Creating output directory /root/OLB_1.9/OLB-1.9.0/bin/--in-place
INFO:setupBclToQseq:Configuring /root/OLB_1.9/OLB-1.9.0/share/makefiles/bclToQseq/Makefile to /root/OLB_1.9/OLB-1.9.0/bin/--in-place/Makefile
INFO:setupBclToQseq:Creating the 'Makefile.config'
INFO:setupBclToQseq:Output directory successfully initialized. Type 'make' in /root/OLB_1.9/OLB-1.9.0/bin/--in-place to start the conversion

and we obtained qseq.txt files as you can see in the pdf file.
But now the second step of demultiplexing doesn't work! why? Have you some explanations?
Sorry I realized we are getting too much request, but at the moment you are the only person giving us help!
Attached Files
File Type: pdf helpgenomax (2).pdf (478.6 KB, 49 views)
giampe is offline   Reply With Quote
Old 10-12-2011, 10:48 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

I am glad that at least part 1 has worked correctly.

Based on the error you attached it appears that your samplesheet file may not be formatted correctly.

Is it in "comma separated value (csv)" format? If you are making this file on a windows machine and then moving it to your server then use the "dos2unix" utility on your unix server to convert the "dos" format to unix.

Make sure you have no spaces/special characters (things like $,#,@) anywhere in the samplesheet file. Replace the spaces with "_" (underscore) that works well.



Quote:
Originally Posted by giampe View Post
But now the second step of demultiplexing doesn't work! why? Have you some explanations?
Sorry I realized we are getting too much request, but at the moment you are the only person giving us help!
GenoMax is offline   Reply With Quote
Old 10-13-2011, 05:50 AM   #7
giampe
Member
 
Location: Bari, Italy

Join Date: Aug 2009
Posts: 22
Default

dear GenoMax,
thanks for your suggestion, the problem in the demultiplexing command was effectively in the sample sheet.csv.
In this moment we have otained by demultiplexing.pl command output directories in the demultiplexed folder as the 001 showed in the pdf file, but we don't understand in which order are the our sample libraries (you can find attached our sample sheet.csv), and the format of file seems to be again qseq.txt and not fastq fileformat.
How do we get one single fastq.txt file( 4 row for each sequence) for each our sample?

Sorry for too much requests!
Attached Files
File Type: pdf 001.pdf (561.2 KB, 17 views)
File Type: pdf samplesheet.pdf (114.8 KB, 17 views)
giampe is offline   Reply With Quote
Old 10-13-2011, 07:31 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Giampe,

There should be a SamplesDirectories.csv file created in the "Demultiplexed" directory after the demultiplexing step completion that will tell you which "bin" (001, 002 etc) each sample was put in. Look for that info in the last column.

You will need to run at least "sequence" only analysis to get the sequence files. This is specified in the "config.template.txt" file. Again check with the manual or send the example of the file you used.

There should be a "GERALD_*" directory in each of the bins (001, 002 etc). That directory will contain final sequence files. Unfortunately they will be called s_*_sequence.txt, so you will need to appropriately rename them (we rename with sample name/tag info) before you copy them out of each bin/GERALD* dir.

Quote:
Originally Posted by giampe View Post
dear GenoMax,
thanks for your suggestion, the problem in the demultiplexing command was effectively in the sample sheet.csv.
In this moment we have otained by demultiplexing.pl command output directories in the demultiplexed folder as the 001 showed in the pdf file, but we don't understand in which order are the our sample libraries (you can find attached our sample sheet.csv), and the format of file seems to be again qseq.txt and not fastq fileformat.
How do we get one single fastq.txt file( 4 row for each sequence) for each our sample?

Sorry for too much requests!
GenoMax is offline   Reply With Quote
Old 10-13-2011, 10:08 AM   #9
giampe
Member
 
Location: Bari, Italy

Join Date: Aug 2009
Posts: 22
Default

Hi GenoMax,
ok we have found a SamplesDirectories.csv file created in the "Demultiplexed" directory where we can see six 00_ directories with several qseq.txt files for each one but some of these files are empty (0 Kb) and we noticed that there are some qseq.txt files in the same directory with the same lane number and the same barcode, so for each sample are there more than one file?
How we do run "sequence" only analysis to get the sequence files? We don't see the "config.template.txt" file and the GERALD_ directory where are they?

Thank you again !
giampe is offline   Reply With Quote
Old 10-13-2011, 11:20 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Here is the relevant bit of info I had originally included with the command line for demultiplex.pl. You have to provide the configuration file for creating the final sequence files.

--alignment-config /path_to/config.template.txt

This configuration file is for GERALD where you will specify that you want a sequence only analysis (ANALYSIS sequence). You will find exact information about how to format this file in the manual (page 23 of CASAVA v.1.7 manual).

Please re-run the demultiplex.pl step with this command line option (providing the config file) to get the actual sequence files. You will need to specify an additional option for your "make" command as follows: "make -j no_of_cpu ALIGN=yes" (this is required to get the GERALD to run).


Quote:
Originally Posted by giampe View Post
Hi GenoMax,
ok we have found a SamplesDirectories.csv file created in the "Demultiplexed" directory where we can see six 00_ directories with several qseq.txt files for each one but some of these files are empty (0 Kb) and we noticed that there are some qseq.txt files in the same directory with the same lane number and the same barcode, so for each sample are there more than one file?
How we do run "sequence" only analysis to get the sequence files? We don't see the "config.template.txt" file and the GERALD_ directory where are they?

Thank you again !

Last edited by GenoMax; 10-13-2011 at 11:25 AM.
GenoMax is offline   Reply With Quote
Old 10-14-2011, 03:37 AM   #11
giampe
Member
 
Location: Bari, Italy

Join Date: Aug 2009
Posts: 22
Default

Hi Genomax,
we are frustated!!!! providing a config.template.txt in the demultiplexing command we haven't obtained the expected result, moreover it returns different error message! there is something wrong in the our config.template.txt file! We are sending you our samplesheet file, could you edit a config.template.txt file for us? We have read the page 24 from the manual of CASAVA but it seems for us confused about formatting explanation. We want perform the ANALYSIS sequence for all samples.
An other question: in which folder we shoud put the config.template.txt file?

sorry and thanks for your help, we hope in your quickly reply!

P.S. you can also send information to my email address: annalisa79@hotmail.it
or skype account: giampe79
Attached Files
File Type: txt sample_sheet.txt (1.9 KB, 15 views)
giampe is offline   Reply With Quote
Old 10-20-2011, 08:26 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

I am sorry I did not see your last message till just now. Let me have a look and I will respond.

Note: See the response below. I will attach a config.txt file to it soon.

Quote:
Originally Posted by giampe View Post
Hi Genomax,
we are frustated!!!! providing a config.template.txt in the demultiplexing command we haven't obtained the expected result, moreover it returns different error message! there is something wrong in the our config.template.txt file! We are sending you our samplesheet file, could you edit a config.template.txt file for us? We have read the page 24 from the manual of CASAVA but it seems for us confused about formatting explanation. We want perform the ANALYSIS sequence for all samples.
An other question: in which folder we shoud put the config.template.txt file?

sorry and thanks for your help, we hope in your quickly reply!

P.S. you can also send information to my email address: annalisa79@hotmail.it
or skype account: giampe79

Last edited by GenoMax; 10-20-2011 at 08:47 AM.
GenoMax is offline   Reply With Quote
Old 10-20-2011, 08:45 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Try using the attached samplesheet file. I have already converted it into unix format. I had to "gzip" so you will need to unzip it before using.

Both files can be in any location. Just provide the full path to the respective files for corresponding command line switches (if not present in the local directory) when you run the demultiplex.pl command.
Attached Files
File Type: gz sample_sheet_giampe.csv.gz (457 Bytes, 15 views)
File Type: txt config.template.txt (503 Bytes, 14 views)

Last edited by GenoMax; 10-20-2011 at 08:57 AM.
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO