tirohia 09-26-2017 06:31 PM

removing chromosomes from a bam file.
I'm trying to upload a bam file, from an alignment to GRCh38 that I've done, to a google genomics dataset, associating it with the reference set for GRCh38. The reason it fails given in the logfile reads:


reference names must be a subset of those of the requested
    reference set: missing ["chr1" "chr10" "chr11" "chr11_KI270721v1_random" "chr12"
    "chr13" "chr14" "chr14_GL000009v2_random" "chr14_GL000194v1_random" "chr14_GL000225v1_random" ...

It goes on to list another 40 or so fragments. If I do a quick


samtools idxstats cal1.bam
I do indeed get a whole bunch of chromosome fragments listed. The best I can come up with is that the referenceset on google genomics doesn't like all the fragments, thus the reference names must be a subset message.
The obvious workaround to test this, is to remove those chromosomes from the bam file. Unfortunately,


samtools view -b cal1.bam chr1 chr2 chr3 > cal-sub-1.bam
samtools index cal-sub-1.bam cal-sub-1.bai
samtools idxstats cal-sub-1.bam

returns a bam file that indeed, removes all the reads from the fragments from the bam file. It still lists the actual fragments though, which in turn, when I try and load to the google genomics dataset, gives me the same error.

How to I remove all references to the fragments from the bam file? Or is that not what the googlegenomics upload is objecting to.


GenoMax 09-27-2017 03:22 AM

Did you check the headers from the subset BAM files? Those may still contain the offending chromosomes.

