I have WGS data (2x100bp) that has been aligned to hg19. Using
HLA-VBSeq, I HLA typed the data, and now the researcher I am working with is interested in two allele sequences for HLA-E for gene editing purposes. I would really appreciate some feedback on my workflow since I have never done this type of analysis before.
Using samtools, I subset the list of reads covering the genomic region of HLA-E, then used picard FilterSamReads to output the filtered bam file. I converted the bam to paired fastq and then aligned the fastq to the HLA-E genomic DNA sequences for alleles E*01:03:01:01 and E*01:01:01:01 from the
IMGT/HLA database using BWA-MEM. I then sorted and indexed my bam files for each allele for viewing in IGV. Now, I have a couple questions:
1. Should I have aligned the filtered bam to the nucleotide coding sequences for both alleles instead of the genomic sequences?
2. The researcher wants all sequences aligned to E*01:03:01:01 and E*01:01:01:01. Would it be appropriate to convert the bam files to fasta files using
seqtk to deliver the sequences?
Code:
samtools bam2fq input.bam | seqtk seq -A > output.fa
I'm not sure if my workflow is correct, or if delivering the fasta for each allele is appropriate for gene editing downstream. Help?