Hi,
I have gone through various SeqAns posts regarding duplicate removal but could not get desired answer. Since I am a mol biologist new to bioinformatics i have a few queries.
I am having illumina DNA 2x100 paired end reads. FAstQC analysis indicated a large number of duplicates which seem to be correct. Since the dataset is too big I wanted to remove the duplicates. Therefore, i used Galaxy. I first used Fastq groomer followed by FastX collapse for both R1 and R2 reads separately. My plan of action was : to first remove duplicates, filter and trim my seq and finally assemble them using velvet. As far as I know velvet requires shuffling of the paired end reads prior to assembly. Therefore I have few questions wrt my approach:
1) the fastX collapse tool gives its own headers to the seq. It seems that the paired end information is lost. Am I right OR it just that the headers have changed but the inf is still there. If so where is it?
2) I used R1 and R2 reads separately for grooming and FastX collapse analysis. Should i first shuffle my reads using velvet and than use the FastX collapse tool on the shuffled seq OR
3) I should first join the paired end data and then use FastX tool. But in this case how do i do shuffling with velvet?
I would appreciate if someone can answer the queries.
Regards,
Archana
I have gone through various SeqAns posts regarding duplicate removal but could not get desired answer. Since I am a mol biologist new to bioinformatics i have a few queries.
I am having illumina DNA 2x100 paired end reads. FAstQC analysis indicated a large number of duplicates which seem to be correct. Since the dataset is too big I wanted to remove the duplicates. Therefore, i used Galaxy. I first used Fastq groomer followed by FastX collapse for both R1 and R2 reads separately. My plan of action was : to first remove duplicates, filter and trim my seq and finally assemble them using velvet. As far as I know velvet requires shuffling of the paired end reads prior to assembly. Therefore I have few questions wrt my approach:
1) the fastX collapse tool gives its own headers to the seq. It seems that the paired end information is lost. Am I right OR it just that the headers have changed but the inf is still there. If so where is it?
2) I used R1 and R2 reads separately for grooming and FastX collapse analysis. Should i first shuffle my reads using velvet and than use the FastX collapse tool on the shuffled seq OR
3) I should first join the paired end data and then use FastX tool. But in this case how do i do shuffling with velvet?
I would appreciate if someone can answer the queries.
Regards,
Archana