Hi everyone,
I am a newbie to this forum. I have been dealing with the 100Gb data from the Illumina Hiseq 2000 recently. Before assembly, I want to remove some sequencing error or highly repetitive reads by counting the k-mer frequencies. I used Meryl to count the k-mers because it supported k-mer size larger than 32. I set the k-mer value to be 59 and obtained the output k-mers that counted more than 5 times. But after that, I totally had no idea about how to pick out reads where those low-abundant k-mers were from. Shall I use the CD-hit-est-2D to align the 101bp reads against the low-abundant k-mers? In case that the k-mers (eg: 59-mer) as reference are shorter than the query101bp reads, will it work correctly to separate the 101bp reads into the matched fold or mismatched fold? Could someone kindly give me any suggestion? I am really lost.
Best regards
I am a newbie to this forum. I have been dealing with the 100Gb data from the Illumina Hiseq 2000 recently. Before assembly, I want to remove some sequencing error or highly repetitive reads by counting the k-mer frequencies. I used Meryl to count the k-mers because it supported k-mer size larger than 32. I set the k-mer value to be 59 and obtained the output k-mers that counted more than 5 times. But after that, I totally had no idea about how to pick out reads where those low-abundant k-mers were from. Shall I use the CD-hit-est-2D to align the 101bp reads against the low-abundant k-mers? In case that the k-mers (eg: 59-mer) as reference are shorter than the query101bp reads, will it work correctly to separate the 101bp reads into the matched fold or mismatched fold? Could someone kindly give me any suggestion? I am really lost.
Best regards
Comment