Is there an established command line parameter (or pre-written script) to specify that certain reads should be/could be present in multiple copies in an assembly? For instance, I have reason to suspect that the genome may have up to 8 copies of one of the short contigs from a previous round of assembly, but even if I copy the contig sequence file a couple times, it will only align with itself and the highest scoring bridging sequence from a gap closure PCR sequencing reaction. Sometimes the other bridges to other contigs co-assemble, with significant mismatch error when it doesn't match one of the flanking contigs, other times the lower quality scored sequencing reaction doesn't assemble with anything.
to give an example. I have a circular genome with order of sequenced contigs (assembled from 454-type reads)
A-B-C-D-E-C-D-F-G where the -s represent unsequenced intervening DNA (where I have PCR products for sanger-type gap-closure sequencing)
The assembly (with the intervening DNA sequenced) will give an assembly of something like A-B-C-D-F-G using "default" options. I'd rather see an assembly something like
A-B-C + C-D-E + C-D-F + D-F-G (+ represents separate assembly contigs) or even A-B-C-D-F-G + D-E-C
something with the -preassemble command line? I don't think it's -retain_duplicates? My method so far has been to manually create a separate chromat_dir for each pre-assembled contig or supercontig, place all my putative repeated contigs in all subdirectories (e.g. A, B, C, and D in one directory, C, D, F, G in another), then run the phrap assembly program on each directory and see (manually in consed) if I have multiple subdirectories that have a "duplicate" contig assembled in a different location than the other directories. This is both computationally time-consuming as well as time-consuming on the review end of things. Note also that the phred_crossmatch scripts don't seem to run properly on symbolically linked files, so I end up with hundreds to thousands of unneeded sequencing files on my computer (ie I'd rather use ln -s to point to the relevant copies in the main edit_dir/chromat_dir rather than cp).
I'm running 64-bit ubuntu on an intel i5. I'd be willing to use another free (for academic use) package on linux or PC (windows 7 or XP).
to give an example. I have a circular genome with order of sequenced contigs (assembled from 454-type reads)
A-B-C-D-E-C-D-F-G where the -s represent unsequenced intervening DNA (where I have PCR products for sanger-type gap-closure sequencing)
The assembly (with the intervening DNA sequenced) will give an assembly of something like A-B-C-D-F-G using "default" options. I'd rather see an assembly something like
A-B-C + C-D-E + C-D-F + D-F-G (+ represents separate assembly contigs) or even A-B-C-D-F-G + D-E-C
something with the -preassemble command line? I don't think it's -retain_duplicates? My method so far has been to manually create a separate chromat_dir for each pre-assembled contig or supercontig, place all my putative repeated contigs in all subdirectories (e.g. A, B, C, and D in one directory, C, D, F, G in another), then run the phrap assembly program on each directory and see (manually in consed) if I have multiple subdirectories that have a "duplicate" contig assembled in a different location than the other directories. This is both computationally time-consuming as well as time-consuming on the review end of things. Note also that the phred_crossmatch scripts don't seem to run properly on symbolically linked files, so I end up with hundreds to thousands of unneeded sequencing files on my computer (ie I'd rather use ln -s to point to the relevant copies in the main edit_dir/chromat_dir rather than cp).
I'm running 64-bit ubuntu on an intel i5. I'd be willing to use another free (for academic use) package on linux or PC (windows 7 or XP).
Comment