I am trying to make something out of a denovo, heterozygous plant WGS.
it won't go past a contig stage (scaffolding improved average contig length by 1 nt!) for several reasons.
However, the ref seq (species of one of the parents) is covered to 85% with contigs up a few hundred contigs deep in places. I would like to create localized consensuses by collapsing all those reads into one stretch of consensus, without "inserting" the refseq (like GATK alternate ref maker would do). this should be possible by checking bedtools genome coverage for stretches with coverage, isolating all reads for that region from SAM file, then creating a consensus for that stretch, essentially making a scaffold. This would make it easier to make some use of the assembly attempt, for instance for blasting against to find certain genes or promotor regions etc. however for an entire plant genome even I will run out of patience trying to do this by hand!!
Anybody good with python or other languages that could write a script? like I said, it should be possible by finding uninterrupted coverage stretches from bedtools genome coverage, grouping reads with their alignment info from SAM files and then using something like GATK alternate reference maker on each stretch.
Maybe all that's needed would be an executable Linux wrapper script?
Any suggestions? maybe there is a tool already out there that I missed stumbling upon?
it won't go past a contig stage (scaffolding improved average contig length by 1 nt!) for several reasons.
However, the ref seq (species of one of the parents) is covered to 85% with contigs up a few hundred contigs deep in places. I would like to create localized consensuses by collapsing all those reads into one stretch of consensus, without "inserting" the refseq (like GATK alternate ref maker would do). this should be possible by checking bedtools genome coverage for stretches with coverage, isolating all reads for that region from SAM file, then creating a consensus for that stretch, essentially making a scaffold. This would make it easier to make some use of the assembly attempt, for instance for blasting against to find certain genes or promotor regions etc. however for an entire plant genome even I will run out of patience trying to do this by hand!!
Anybody good with python or other languages that could write a script? like I said, it should be possible by finding uninterrupted coverage stretches from bedtools genome coverage, grouping reads with their alignment info from SAM files and then using something like GATK alternate reference maker on each stretch.
Maybe all that's needed would be an executable Linux wrapper script?
Any suggestions? maybe there is a tool already out there that I missed stumbling upon?
Comment