View Single Post
Old 08-20-2013, 09:42 AM   #5
rzeng
Member
 
Location: houston

Join Date: Aug 2013
Posts: 19
Default

Thank you GenoMax,

Pretty sad is, I took over someone's project without know much about the background/information of the original data (I can not get contact with that guy who prepare these data even).

These Fastq data are consisted of three data (forward sequence data1, barcode sequence data2 and reverse sequence data3) with each data contains more than 40,000,000 different sequences . the name/ID for these sequences in three data are very similar except the XXXX as showed as belows

Name/ID of sequence data

@IPAR1:2:1:XXXX:XXXX:1#0/1 forward read data 1
@IPAR1:2:1:XXXX:XXXX:1#0/2 barcode read data 2
@IPAR1:2:1:XXXX:XXXX:1#0/3 forward read data 3

All the sequences in each data are organized by the same order

For example,

the 18th sequence in data 1 is
@IPAR1:2:1:4029:1196:1#0/1 ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA
+ BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF

the 18th sequence (15bp) in the second data 2 is

@IPAR1:2:1:4029:1196:1#0/2
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC

the 18th sequence in the third data 3 is
@IPAR1:2:1:4029:1196:1#0/3 GATATAATGGATGGGATTATTTCAATCTTTTATCTATTGAGGCTTCTTTTGTGTCCTATTATATGATATATTTTAGAGAAGATTCCAT
+ IIHIIIIHIIDEGGGEBG>GIIFIHHIHIIIIIFIDE4G@GG<GGEGBGG?AACCIIBIIBDIIIFDII>IIIIDIH@DFIGBI@IEE

However, because the barcode sequence (15bp) in data 2 is not in the sequence in data 1 and 3. Barcode sequence in the data 2 I can not sort the sequence in data 1 and 3 by using from data 2 directly. However, I have an extra barcode information for splitting 400,000,000 barcode sequence in data 2. this barcode information is 6 different 8mer barcode sequence which is overlap with sequence in data 2. For example, TGACCTTG is overlap with sequence in

@IPAR1:2:1:4029:1196:1#0/2.
TGACCTTGATCTCGT
+
HIHIIGIIIH8CCDC


So, I used this barcode information to split data 2 into 6 different file (each represent one sample). At this point, I need to go back forward/reverse sequence of data 1 and 3 and split them into 6 difference files too.

PS. Sorry for the complicated explain above, but my case is really different from other cases of illumina reads data I can find anywhere.

Any suggestion will be very appreciate!
rzeng is offline   Reply With Quote