SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Questions for barcode file splitting, forward/reverse data sorting and .... rzeng Illumina/Solexa 0 08-19-2013 11:28 AM
forward and reverse reads with BWA jwhite Bioinformatics 3 02-20-2013 08:41 AM
how to define a forward or reverse read file poorphd Illumina/Solexa 3 11-22-2011 12:34 PM
Illumina1.8 Paired-End Barcode Splitting? pbatzel Bioinformatics 2 10-25-2011 02:08 PM
forward and reverse sequance asankaf General 5 05-27-2009 07:48 AM

Reply
 
Thread Tools
Old 08-19-2013, 11:45 AM   #1
rzeng
Member
 
Location: houston

Join Date: Aug 2013
Posts: 19
Default Questions for barcode file splitting, forward/reverse data sorting

HI, i am a pretty new for sequence analysis and totally new comer here. My questions might be too basic for you to answer but it will help me start my sequence analysis work with a good beginning. Anyone can help me? Very appreciate!!!

All I have are three fastq format separate raw data (each has more than 40 million read sequence) which are forward read data1, barcode read data2 and reverse read data 3. All of them three are corresponding each other from beginning with the same order. I used extra barcode file (6 different barcode) to split data 2 into 6 groups of different files (by galaxy barcode splitter). Now, I was stuck here and can't keep going until I figure out the following questions,

1. How can I sort the forward data1 and reverse data2 using my 6 files generated by barcode splitter. Is there software to do this? By the way, I do not have much bioinformatics background, any good suggestion?

2. How do I know where is the adapt sequences or if there are adapt sequences in the forward/reverse sequence from data 1 and 3 because this is very helpful for me to do adapt trim from original sequence?

following is just one example I extracted from my original data . All I have are only following 3 data with a separate barcode file. I do not have extra information like how is the barcode been designed, library construction or other..



Data 1 (forward read)

@IPAR1:2:1:4029:1196:1#0/1 ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA
+ BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF


Data 2 (barcode read; TGACCTTG is the barcode tag and do not know yet what is ATCTCGT after tag)

@IPAR1:2:1:4029:1196:1#0/2
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC


Data 3 (reverse read)

@IPAR1:2:1:4029:1196:1#0/3 GATATAATGGATGGGATTATTTCAATCTTTTATCTATTGAGGCTTCTTTTGTGTCCTATTATATGATATATTTTAGAGAAGATTCCAT
+ IIHIIIIHIIDEGGGEBG>GIIFIHHIHIIIIIFIDE4G@GG<GGEGBGG?AACCIIBIIBDIIIFDII>IIIIDIH@DFIGBI@IEE
rzeng is offline   Reply With Quote
Old 08-19-2013, 11:53 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
Originally Posted by rzeng View Post
I used extra barcode file (6 different barcode) to split data 2 into 6 groups of different files (by galaxy barcode splitter). Now, I was stuck here and can't keep going until I figure out the following questions,

1. How can I sort the forward data1 and reverse data2 using my 6 files generated by barcode splitter. Is there software to do this? By the way, I do not have much bioinformatics background, any good suggestion?

2. How do I know where is the adapt sequences or if there are adapt sequences in the forward/reverse sequence from data 1 and 3 because this is very helpful for me to do adapt trim from original sequence?
If you managed to get the forward and reverse reads into separate files for each sample then you have made good progress. At this stage you probably want to do some QC on the files.

Here is a link for some practical info to get you started: http://en.wikibooks.org/wiki/Next_Ge...Pre-processing

As for some of other questions use the "search" functionality on this site along with clever combinations of key words. You will find many past threads that have the answers you need (and also additional info you may not have thought about to ask).
GenoMax is offline   Reply With Quote
Old 08-19-2013, 12:35 PM   #3
rzeng
Member
 
Location: houston

Join Date: Aug 2013
Posts: 19
Default Reply GenoMax

GenoMax,

Thanks much for your answer. However, my problem now is how to manage and separate the forward/reverse reads by using my separate barcode files (generated by data 2), considering more than 40millions read sequences in both forward/reverse reads.






Quote:
Originally Posted by GenoMax View Post
If you managed to get the forward and reverse reads into separate files for each sample then you have made good progress. At this stage you probably want to do some QC on the files.

Here is a link for some practical info to get you started: http://en.wikibooks.org/wiki/Next_Ge...Pre-processing

As for some of other questions use the "search" functionality on this site along with clever combinations of key words. You will find many past threads that have the answers you need (and also additional info you may not have thought about to ask).
rzeng is offline   Reply With Quote
Old 08-19-2013, 05:33 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Can you provide some additional information as to where (and what format) did you get the original data? What did the names of the files look like? Generally a provider will de-multiplex the samples for you (using the illumina pipeline software). It is much simpler to do it that way.

You may be able to use the script from the Qiime package: http://qiime.org/scripts/split_libraries_fastq.html to do the demultiplexing as suggested in this thread: http://seqanswers.com/forums/showthread.php?t=24215
GenoMax is offline   Reply With Quote
Old 08-20-2013, 09:42 AM   #5
rzeng
Member
 
Location: houston

Join Date: Aug 2013
Posts: 19
Default

Thank you GenoMax,

Pretty sad is, I took over someone's project without know much about the background/information of the original data (I can not get contact with that guy who prepare these data even).

These Fastq data are consisted of three data (forward sequence data1, barcode sequence data2 and reverse sequence data3) with each data contains more than 40,000,000 different sequences . the name/ID for these sequences in three data are very similar except the XXXX as showed as belows

Name/ID of sequence data

@IPAR1:2:1:XXXX:XXXX:1#0/1 forward read data 1
@IPAR1:2:1:XXXX:XXXX:1#0/2 barcode read data 2
@IPAR1:2:1:XXXX:XXXX:1#0/3 forward read data 3

All the sequences in each data are organized by the same order

For example,

the 18th sequence in data 1 is
@IPAR1:2:1:4029:1196:1#0/1 ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA
+ BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF

the 18th sequence (15bp) in the second data 2 is

@IPAR1:2:1:4029:1196:1#0/2
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC

the 18th sequence in the third data 3 is
@IPAR1:2:1:4029:1196:1#0/3 GATATAATGGATGGGATTATTTCAATCTTTTATCTATTGAGGCTTCTTTTGTGTCCTATTATATGATATATTTTAGAGAAGATTCCAT
+ IIHIIIIHIIDEGGGEBG>GIIFIHHIHIIIIIFIDE4G@GG<GGEGBGG?AACCIIBIIBDIIIFDII>IIIIDIH@DFIGBI@IEE

However, because the barcode sequence (15bp) in data 2 is not in the sequence in data 1 and 3. Barcode sequence in the data 2 I can not sort the sequence in data 1 and 3 by using from data 2 directly. However, I have an extra barcode information for splitting 400,000,000 barcode sequence in data 2. this barcode information is 6 different 8mer barcode sequence which is overlap with sequence in data 2. For example, TGACCTTG is overlap with sequence in

@IPAR1:2:1:4029:1196:1#0/2.
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC


So, I used this barcode information to split data 2 into 6 different file (each represent one sample). At this point, I need to go back forward/reverse sequence of data 1 and 3 and split them into 6 difference files too.

PS. Sorry for the complicated explain above, but my case is really different from other cases of illumina reads data I can find anywhere.

Any suggestion will be very appreciate!
rzeng is offline   Reply With Quote
Old 08-20-2013, 02:02 PM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
@IPAR1:2:1:4029:1196:1#0/1
This is the important part from three files that you need to be looking at. If you see the description for the fastq format (illumina sequence identifiers) that string uniquely identifies a cluster. The /1,/2,/3 on the end signify that these are R1 = forward read, R2 = Tag read and R3= Reverse read (as you have already figured out).

So for the following tag read:

@IPAR1:2:1:4029:1196:1#0/2
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC

The two corresponding real reads are in /1 and /3 parts. In illumina pipeline the tag read is automatically taken into consideration and then added to the ID lines of the R1 and R2 (reverse read takes the R2 designation) like so

Quote:
@HWUSI-EAS100R:6:73:941:1973#NNNNN/1 (NNN= Tag)
When you split the files (either with your own script or from qiime) make sure that you add the tag sequence to the ID otherwise it may be difficult to keep track of it later on.

You should also format the files so they are in the correct fastq format

Quote:
@ID
Sequence goes on this line
+
Quality values for corresponding bases on this line
GenoMax is offline   Reply With Quote
Old 08-20-2013, 02:41 PM   #7
rzeng
Member
 
Location: houston

Join Date: Aug 2013
Posts: 19
Default

Thanks GenoMax, That helps a lot!

So my data ID lines do NOT have tags on them, is that mean my data has not been processed by the Illumina pipeline?

Can I ask Illumina company to re-add tags on data ID lines by using Illumina pipeline or I can do it by downloading Illumina pipeline? This makes me confused because the tags are supposed to be added ALREADY when I got the raw data for R1 and R2 according to my understanding, right?
rzeng is offline   Reply With Quote
Old 08-20-2013, 03:19 PM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
Originally Posted by rzeng View Post
Thanks GenoMax, That helps a lot!

So my data ID lines do NOT have tags on them, is that mean my data has not been processed by the Illumina pipeline?

Can I ask Illumina company to re-add tags on data ID lines by using Illumina pipeline or I can do it by downloading Illumina pipeline? This makes me confused because the tags are supposed to be added ALREADY when I got the raw data for R1 and R2 according to my understanding, right?
Your files have been processed by the illumina pipeline but the samples have not been demultiplexed. If the samples were de-multiplexed by your sequence provider then they would have given you just two files (R1 and R2 reads). You seem have three files with the tag in a separate file.

If you are able to ask the provider to demultiplex the samples that would be the best solution but since this data is old it may not be feasible at this time.
GenoMax is offline   Reply With Quote
Old 08-21-2013, 12:33 PM   #9
rzeng
Member
 
Location: houston

Join Date: Aug 2013
Posts: 19
Default

GenoMax,

I have splitted 400,000,000 tag reads and grouped them into 6 separate files using OUTER 6 different barcode sequences. I want to confirm with you that the next step is to use my own script or from Qiime to split files in R1 and R2 using EACH of 6 separate files, right? since R1 and R2 do not have the barcode tags (or barcode sequence) but similar ID headline (highlighted as RED as follows).

R1 file
@IPAR1:2:1:4029:1196:1#0/1

Splitted barcode file
@IPAR1:2:1:4029:1196:1#0/2

R2 file
@IPAR1:2:1:4029:1196:1#0/3

Can script of Qiime help me to split R1 and R2 just by using these RED hightlight headline?

Last edited by rzeng; 08-21-2013 at 12:37 PM.
rzeng is offline   Reply With Quote
Old 08-21-2013, 01:15 PM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

It sounds like even though your files were not demultiplexed they were somehow sorted on the tags so that all the corresponding R1 and R3 (based on R2) were in the same order in the original files. If you have managed to separate the samples into 6 files are you able to write a script that will add the "tag" to the ID's of the separated files so that you end up with something that looks like below (see changes marked in red):

R1 file
@IPAR1:2:1:4029:1196:1#NNNNN/1

R2 file
@IPAR1:2:1:4029:1196:1#NNNNN/2

If you are not able to do this yourself then the better option is below

The script included in "qiime" package will take as input the R1 file along with the R2 (tag file) and then will produce separate files for each of your samples (sounds like you have 6). You will then repeat the process with R3 file along with the R2 (tag) file to produce the corresponding files that will contain the paired-end reads.

You can run the qiime script as follows (Disclaimer: I have not used qiime script myself but based on the info provided on the help page I expect the script to work as noted below).

Code:
$ split_libraries_fastq.py -i /path_to/Read1.fastq -b /path_to/Read2.fastq --store_demultiplexed_fastq --barcode-type 8 --sample-id replace_with_sample_name -o output_dir_name
Repeat for Paired-read
Code:
$ split_libraries_fastq.py -i /path_to/Read3.fastq -b /path_to/Read2.fastq --store_demultiplexed_fastq --barcode-type 8 --sample-id replace_with_sample_name -o output_dir_name

Last edited by GenoMax; 08-21-2013 at 01:53 PM.
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:59 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO