Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kevinrue
    Member
    • Jan 2013
    • 25

    HISAT2 / hisat2_extract_snps_haplotypes_VCF.py: only first variant in output

    Dear all,

    I am trying to apply the hisat2_extract_snps_haplotypes_VCF.py script to merged VCF files: one file per human chromosome, all files contains genotypes for the same set of samples.

    A typical command I run is:

    hisat2_extract_snps_haplotypes_VCF.py --non-rs --verbose /scratch/genomes/grch37/84/fasta/Homo_sapiens.GRCh37.dna.primary_assembly.fa vcf/chr14_PASS_vep.vcf.gz HiSat2/extract/chr14_PASS_vep

    Facts:
    • The script typically take between 20min and 2h20min depending on the chromosome length.
    • However, the output SNP and haplotype files both contain a single line (i.e. variant). I compared, and each time it is the first variant in the VCF file processed.
    • I do not see any error or output message , despite the --verbose option.
    • Considering the run time, it seems that the script is processing all variants, just not outputting anything after the first variant


    Is there some aspect of my VCF files that I should check for compatibility with the script?

    Many thanks in advance.

    Kevin
  • kevinrue
    Member
    • Jan 2013
    • 25

    #2
    Problem identified (not solved)

    Got it!

    I drilled down into the code of (hisat2_extract_snps_haplotypes_VCF.py)
    and found the following bit of code that ignores variants with duplicated ID (3rd column).

    Code:
    if varID == prev_varID:
        continue
    In my case, the VCF was generated from WGS, and variants are not yet annotated with any RSID (existing or novel).
    As a result they all have ID ".", and leads the code above to skip all variants after the first one (and explains why the script still runs with a time proportional to the chromosome length, as it does properly scan all variants in the VCF file).

    My solution:
    I'll just build a genome index using the UCSC snp146Common.txt file (as the current pre-built index available on the HISAT website seems to be based on dbSNP 144)

    Any comment welcome!
    Kevin

    Comment

    Latest Articles

    Collapse

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by SEQadmin2, 06-09-2026, 11:58 AM
    0 responses
    24 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-05-2026, 10:09 AM
    0 responses
    29 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-04-2026, 08:59 AM
    0 responses
    39 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-02-2026, 12:03 PM
    0 responses
    61 views
    0 reactions
    Last Post SEQadmin2  
    Working...