Hi!
I am using QIIME to process my dataset of 2 million joined PE-reads (~500 bp).
I want to use cd-hit as my OTU picker. cd-hit requires a fastq file with every sequence startsing with the 5'-F-amplification-primer.
With QIIME, i could convert the fastq to fasta and a quality file.
I did a split_libraries.py:
split_libraries.py -f in.fasta -m mapping.txt -q qualityfile -l 200 --keep-primer -a 0 -H 6 -o Split_lib -b 8 -M 2
The --keep-primer option keeps the primer sequence inside the sequence, -l deletes short reads, -a keeps the number of Ns at zero, and -H removes long stretches of homopolymers.
The problem is that, while this command runs smoothly, the corresponding quality file is not processed, so i cannot convert back the resulting fasta to fastq to run cd-hit.
As far as i can see it, there are several options:
1) Run QIIME's split_libraries_fastq.py instead, but all these nice options available in the fasta-script are not available.
What would be the most equivalent command line for split_libraries_fastq.py with:
- removal of ambigous base calls
- removal of short reads
- removal of barcodes
- removal of reads with long homopolymer calls
- keeping the primers
?
2) Compare the headers in the resulting fasta to the headers in the qual.file, and do some whitelisting process. I have no idea how to do that, though.
3) Find a way telling split_libraries.py to process the quality file as well. I though it would be -q in the split_libraries.py, but that did actually nothing.
Any help is really, really appreciated.
I am using QIIME to process my dataset of 2 million joined PE-reads (~500 bp).
I want to use cd-hit as my OTU picker. cd-hit requires a fastq file with every sequence startsing with the 5'-F-amplification-primer.
With QIIME, i could convert the fastq to fasta and a quality file.
I did a split_libraries.py:
split_libraries.py -f in.fasta -m mapping.txt -q qualityfile -l 200 --keep-primer -a 0 -H 6 -o Split_lib -b 8 -M 2
The --keep-primer option keeps the primer sequence inside the sequence, -l deletes short reads, -a keeps the number of Ns at zero, and -H removes long stretches of homopolymers.
The problem is that, while this command runs smoothly, the corresponding quality file is not processed, so i cannot convert back the resulting fasta to fastq to run cd-hit.
As far as i can see it, there are several options:
1) Run QIIME's split_libraries_fastq.py instead, but all these nice options available in the fasta-script are not available.
What would be the most equivalent command line for split_libraries_fastq.py with:
- removal of ambigous base calls
- removal of short reads
- removal of barcodes
- removal of reads with long homopolymer calls
- keeping the primers
?
2) Compare the headers in the resulting fasta to the headers in the qual.file, and do some whitelisting process. I have no idea how to do that, though.
3) Find a way telling split_libraries.py to process the quality file as well. I though it would be -q in the split_libraries.py, but that did actually nothing.
Any help is really, really appreciated.