View Single Post
Old 09-26-2017, 06:31 PM   #1
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default removing chromosomes from a bam file.

I'm trying to upload a bam file, from an alignment to GRCh38 that I've done, to a google genomics dataset, associating it with the reference set for GRCh38. The reason it fails given in the logfile reads:

reference names must be a subset of those of the requested
    reference set: missing ["chr1" "chr10" "chr11" "chr11_KI270721v1_random" "chr12"
    "chr13" "chr14" "chr14_GL000009v2_random" "chr14_GL000194v1_random" "chr14_GL000225v1_random" ...
It goes on to list another 40 or so fragments. If I do a quick

samtools idxstats cal1.bam
I do indeed get a whole bunch of chromosome fragments listed. The best I can come up with is that the referenceset on google genomics doesn't like all the fragments, thus the reference names must be a subset message.
The obvious workaround to test this, is to remove those chromosomes from the bam file. Unfortunately,

samtools view -b cal1.bam chr1 chr2 chr3 > cal-sub-1.bam
samtools index cal-sub-1.bam cal-sub-1.bai
samtools idxstats cal-sub-1.bam
returns a bam file that indeed, removes all the reads from the fragments from the bam file. It still lists the actual fragments though, which in turn, when I try and load to the google genomics dataset, gives me the same error.

How to I remove all references to the fragments from the bam file? Or is that not what the googlegenomics upload is objecting to.

tirohia is offline   Reply With Quote