I'm trying to run the GATK Unified Genotyper on a set of bam files that I have coordinate sorted, but when I run the UG its giving me this error message:
"Input files reads and reference have incompatible contigs: Order of contigs differences, which is unsafe."
From reading, I believe I need to re-order my reference fasta file to match the order apparent in the coordinate sorted header of my bam files. But I'm not sure how to do that. Is there code somewhere that will let me re-order my reference file to match a given bam file's order?
I'm using a reference that comprises ~5k small genomes, some of which are in pieces (~188k total sequence records). The file size is 7.3Gb.
I also think that maybe (probably?) I need to pull out single specific references and run the UG on single references at a time. Its a metagenomic project, and I was hoping to get results for the whole thing at one time, but that might not be realistic. But even if I pull out single, well covered genome references, some of them will be in hundreds of pieces themselves. So I'd still need a way to order my reference. I could probably write up something in perl to do this, but I'm not too strong a coder, and I'm worried that I'd have memory issues trying to hash 188k sequences and juggle them around.
Can anyone offer me some guidance on this?
"Input files reads and reference have incompatible contigs: Order of contigs differences, which is unsafe."
From reading, I believe I need to re-order my reference fasta file to match the order apparent in the coordinate sorted header of my bam files. But I'm not sure how to do that. Is there code somewhere that will let me re-order my reference file to match a given bam file's order?
I'm using a reference that comprises ~5k small genomes, some of which are in pieces (~188k total sequence records). The file size is 7.3Gb.
I also think that maybe (probably?) I need to pull out single specific references and run the UG on single references at a time. Its a metagenomic project, and I was hoping to get results for the whole thing at one time, but that might not be realistic. But even if I pull out single, well covered genome references, some of them will be in hundreds of pieces themselves. So I'd still need a way to order my reference. I could probably write up something in perl to do this, but I'm not too strong a coder, and I'm worried that I'd have memory issues trying to hash 188k sequences and juggle them around.
Can anyone offer me some guidance on this?
Comment