We have written a little tool to sort reads in an allele-specific manner and thought we might share it in case someone else wants to do something similar.
SNPsplit is an allele-specific alignment sorter which is designed to read in alignment files in SAM/BAM format and determine the allelic origin of reads that cover known SNP positions. For this to work a library must have been aligned to a genome which had all SNP positions masked by the ambiguity base 'N', and aligned using aligners that are capable of using a reference genome which contains ambiguous nucleobases, such as Bowtie 2 or TopHat. In addition, a list of all known SNP positions between the two different genomes must be provided using the option --snp_file.
The SNP information to generate N-masked genomes needs to be acquired elsewhere, e.g. for different strains of mice you can find variant call files at the Mouse Genomes Project page at http://www.sanger.ac.uk/resources/mouse/genomes/. A description of how to generate N-masked genomes is beyond the scope of SNPsplit at the current time, but it might be added in the future.
It is probably worth mentioning that the determination of overlaps correctly handles the CIGAR operations M (match), D (deletion in the read), I (insertion in the read) and N (skipped regions, used for splice mapping by TopHat). Other CIGAR operations are currently not supported.
• Supports single-end and paired-end BAM/SAM alignment files
• In paired-end mode, paired and singleton alignments may be merged or treated separately
• Supports Hi-C BAM files generated by HiCUP
• Individual output files for genome 1-specific, genome 2-specific and unassigned alignments
• Output for conflicting alignments optionally
Here you can access the documentation for more information on the SNPsplit workflow SNPsplit User Guide
Here is an example paired-end SNPsplit report SNPsplit PE report
Here is an example Hi-C report SNPsplit report SNPsplit Hi-C report
SNPsplit is available for download from here: http://www.bioinformatics.babraham.a...ects/SNPsplit/. Comments welcome.
SNPsplit is an allele-specific alignment sorter which is designed to read in alignment files in SAM/BAM format and determine the allelic origin of reads that cover known SNP positions. For this to work a library must have been aligned to a genome which had all SNP positions masked by the ambiguity base 'N', and aligned using aligners that are capable of using a reference genome which contains ambiguous nucleobases, such as Bowtie 2 or TopHat. In addition, a list of all known SNP positions between the two different genomes must be provided using the option --snp_file.
The SNP information to generate N-masked genomes needs to be acquired elsewhere, e.g. for different strains of mice you can find variant call files at the Mouse Genomes Project page at http://www.sanger.ac.uk/resources/mouse/genomes/. A description of how to generate N-masked genomes is beyond the scope of SNPsplit at the current time, but it might be added in the future.
It is probably worth mentioning that the determination of overlaps correctly handles the CIGAR operations M (match), D (deletion in the read), I (insertion in the read) and N (skipped regions, used for splice mapping by TopHat). Other CIGAR operations are currently not supported.
• Supports single-end and paired-end BAM/SAM alignment files
• In paired-end mode, paired and singleton alignments may be merged or treated separately
• Supports Hi-C BAM files generated by HiCUP
• Individual output files for genome 1-specific, genome 2-specific and unassigned alignments
• Output for conflicting alignments optionally
Here you can access the documentation for more information on the SNPsplit workflow SNPsplit User Guide
Here is an example paired-end SNPsplit report SNPsplit PE report
Here is an example Hi-C report SNPsplit report SNPsplit Hi-C report
SNPsplit is available for download from here: http://www.bioinformatics.babraham.a...ects/SNPsplit/. Comments welcome.
Comment