Seqanswers Leaderboard Ad

**GobiJerboa** · 08-09-2018, 08:33 AM

I came across this issue while trying to run Annovar on Strelka2 output.
Annovar throws an error without the GT tag so I did the following at the command line to add in GT information:

Set filenames

Code:

strelka_output_file="somatic.indels.passed.vcf"
strelka_mod="somatic.indels.passed.GTmod.vcf"

Add GT FORMAT in VCF header
Find the first ##FORMAT line in header
grep for the line "n"umber and only 1 "m"atch
sed with a leading "number"i will insert into the file at the specified line. e.g. Leading 8i will insert at 8th line.

Code:

first_format_num=$(grep -n -m 1 '##FORMAT' "$strelka_output_file" | cut -d : -f 1)
sed "$first_format_num"'i##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">' "$strelka_output_file" > "$strelka_mod"

Use sed with extended "r"egular expression support to edit "i"nplace
All lines of my strelka output have the format string BCN50

P

P2

P50:FDP50:SUBDP50:TAR:TIR:TOR
Find BCN50, prepend "GT:" then replace BCN50 with \1 capture group
Find tab following TOR, and prepend 0/0 for Normal
From the same TOR starting point, match anything except tabs. Prepend 0/1 for tumor

Code:

sed -ri 's|(BCN50:)|GT:\1|g' "$strelka_mod"
sed -ri 's|(:TOR\t)|\10/0:|g' "$strelka_mod"
sed -ri 's|(:TOR\t[^\t]*\t)|\10/1:|g' "$strelka_mod"

This changes the lines of strelka output from this:

Code:

chr1	803750	.	TA	T	.	PASS	IC=10;IHP=12;MQ=54.88;MQ0=0;NT=ref;QSI=35;QSI_NT=35;RC=11;RU=A;SGT=ref->het;SOMATIC;SomaticEVS=6.65;TQSI=2;TQSI_NT=2	BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR	0.06:38:38:35.07:2.02:0.00:31,40:0,0:7,4	0.09:25:25:23.33:2.12:0.00:17,22:4,4:4,2

To this:

Code:

chr1	803750	.	TA	T	.	PASS	IC=10;IHP=12;MQ=54.88;MQ0=0;NT=ref;QSI=35;QSI_NT=35;RC=11;RU=A;SGT=ref->het;SOMATIC;SomaticEVS=6.65;TQSI=2;TQSI_NT=2	[B]GT:[/B]BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR	[B]0/0:[/B]0.06:38:38:35.07:2.02:0.00:31,40:0,0:7,4	[B]0/1:[/B]0.09:25:25:23.33:2.12:0.00:17,22:4,4:4,2

Admittedly, the 0/0 and 0/1 inserted aren't necessarily accurately representing homozygous / heterozygous status, but it was enough to get Annovar to run.

**GobiJerboa** · 08-10-2018, 07:03 AM

I came across this issue while trying to run Annovar on Strelka2 output.
Annovar throws an error without the GT tag so I did the following at the command line to add in GT information:

Set filenames

Code:

strelka_output_file="somatic.indels.passed.vcf"
strelka_mod="somatic.indels.passed.GTmod.vcf"

Add GT FORMAT in VCF header
Find the first ##FORMAT line in header
grep for the line "n"umber and only 1 "m"atch
sed with a leading "number"i will insert into the file at the specified line. e.g. Leading 8i will insert at 8th line.

Code:

first_format_num=$(grep -n -m 1 '##FORMAT' "$strelka_output_file" | cut -d : -f 1)
sed "$first_format_num"'i##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">' "$strelka_output_file" > "$strelka_mod"

Use sed with extended "r"egular expression support to edit "i"nplace
All lines of my strelka output have the format string BCN50

P

P2

P50:FDP50:SUBDP50:TAR:TIR:TOR
Find BCN50, prepend "GT:" then replace BCN50 with \1 capture group
Find tab following TOR, and prepend 0/0 for Normal
From the same TOR starting point, match anything except tabs. Prepend 0/1 for tumor

Code:

sed -ri 's|(BCN50:)|GT:\1|g' "$strelka_mod"
sed -ri 's|(:TOR\t)|\10/0:|g' "$strelka_mod"
sed -ri 's|(:TOR\t[^\t]*\t)|\10/1:|g' "$strelka_mod"

This changes the lines of strelka output from this:

Code:

chr1	803750	.	TA	T	.	PASS	IC=10;IHP=12;MQ=54.88;MQ0=0;NT=ref;QSI=35;QSI_NT=35;RC=11;RU=A;SGT=ref->het;SOMATIC;SomaticEVS=6.65;TQSI=2;TQSI_NT=2	BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR	0.06:38:38:35.07:2.02:0.00:31,40:0,0:7,4	0.09:25:25:23.33:2.12:0.00:17,22:4,4:4,2

To this:

Code:

chr1	803750	.	TA	T	.	PASS	IC=10;IHP=12;MQ=54.88;MQ0=0;NT=ref;QSI=35;QSI_NT=35;RC=11;RU=A;SGT=ref->het;SOMATIC;SomaticEVS=6.65;TQSI=2;TQSI_NT=2	[B]GT:[/B]BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR	[B]0/0:[/B]0.06:38:38:35.07:2.02:0.00:31,40:0,0:7,4	[B]0/1:[/B]0.09:25:25:23.33:2.12:0.00:17,22:4,4:4,2

Admittedly, the 0/0 and 0/1 inserted aren't necessarily accurately representing homozygous / heterozygous status, but it was enough to get Annovar to run.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Adding GT tag to Strelka Output

Comment

Comment

Latest Articles

ad_right_rmr

News