SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK - GenomeAnalysisTK - SomaticIndelDetector output Jane M Bioinformatics 2 08-20-2013 06:45 AM
read group: GATK or BWA option? m_elena_bioinfo Bioinformatics 9 12-09-2012 10:53 AM
GATK: GenomeAnalysisTK.jar file jorge Bioinformatics 2 06-05-2011 09:59 PM
No enrichment option Michael L. Altshuler Illumina/Solexa 2 01-03-2011 07:48 AM
Cufflinks with the -G option gen2prot Bioinformatics 0 12-23-2010 09:42 AM

Reply
 
Thread Tools
Old 01-13-2012, 02:51 AM   #1
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default GATK - GenomeAnalysisTK - option -R

Hello,

I am using GATK for the first time. I try to use SomaticIndelDetectorWalker. I have difficulties with the option -R ref.fasta.

I read:
Quote:
--refseq String NA Name of RefSeq transcript annotation file. If specified, indels will be annotated with GENOMIC/UTR/INTRON/CODING and with the gene name
From this, I understand that this option is not mandatory. When I try:
Quote:
java -Xmx2g -jar GenomeAnalysisTK.jar -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
I have the error:
Quote:
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.4-15-gcd43f01):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Walker requires a reference but none was provided.
##### ERROR ------------------------------------------------------------------------------------------
It seems that I should provide a reference fasta file. Then, I tried:

Quote:
java -Xmx2g -jar GenomeAnalysisTK.jar -R ~/hg19.fasta -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
but I have this error:
Quote:
##### ERROR A USER ERROR has occurred (version 1.4-15-gcd43f01):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Invalid command line: Failed to load reference dictionary
I think the problem is my fasta file. The one that I provide contains only:
Code:
chr1	249250621	6	50	51	
chr2	243199373	254235646	50	51	
chr3	198022430	502299013	50	51	
chr4	191154276	704281898	50	51	
chr5	180915260	899259266	50	51	
chr6	171115067	1083792838	50	51	
chr7	159138663	1258330213	50	51	
chr8	146364022	1420651656	50	51	
chr9	141213431	1569942965	50	51	
chr10	135534747	1713980672	50	51	
chr11	135006516	1852226121	50	51	
chr12	133851895	1989932775	50	51	
chr13	115169878	2126461715	50	51	
chr14	107349540	2243934998	50	51	
chr15	102531392	2353431536	50	51	
chr16	90354753	2458013563	50	51	
chr17	81195210	2550175419	50	51	
chr18	78077248	2632994541	50	51	
chr19	59128983	2712633341	50	51	
chr20	63025520	2772944911	50	51	
chr21	48129895	2837230949	50	51	
chr22	51304566	2886323449	50	51	
chrX	155270560	2938654113	50	51	
chrY	59373566	3097030091	50	51	
chrMt	16571	3157591135	50	51
Could someone explains me what is this refseq.fasta file precisely?
Thanks for any help
Jane M is offline   Reply With Quote
Old 01-13-2012, 03:41 AM   #2
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

I would like to add this question: I intend to use only the nucleotides which have a PHRED over 30. I haven't found an option to do that. Is it possible?
Jane M is offline   Reply With Quote
Old 01-15-2012, 11:04 PM   #3
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Any idea?
I compared my refSeq.fasta file to the exampleFASTA.fasta file. I've seen that the example file contains the nucleotides of the chromosome 3. Which is definitively not the format that I've provided...
In my data, I've a folder containing one file per chromosome. Each file contains the nucleotide for the given chromosome. Should I merge those files to get the one I need?
Thank you
Jane M is offline   Reply With Quote
Old 01-16-2012, 12:32 AM   #4
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

The "-R" option is for your reference genome fasta file. This is not to be confused with the refseq ROD for GATK (http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq)

I think you will need to generate a sequence dictionary using Picard that has the name
~/hg19.dict.

java -jar $PICARD/CreateSequenceDictionary R=~/hg19.fasta O=~/hg19.dict

This may solve it, however later versions of GATK do it on-the-fly.

The help desk at http://getsatisfaction.com/gsa are very responsive.
zee is offline   Reply With Quote
Old 01-16-2012, 02:12 AM   #5
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

So...
It seems the file you provided os no fasta file, as fasta files should look like
Code:
>header
ACGTACGACTCGACTACAGCAC
>header2
ACTAGCTGCATGCATC
and so on. You provided the Dictionary. You may download the fasta files for the chromosomes here:
http://hgdownload.cse.ucsc.edu/golde...9/chromosomes/

And then unzip them and merge them into one fasta file (GATK is picky about the order of the fasta file, it should be chr1, chr2 ,chr3, ...chrX, chrY), for example using the cat command in Linux; You may also try to download the hg19.2bit file and extract the sequence using the twoBitToFa binary.

Another source for a human reference genome would be the 1000genomes project:
ftp://ftp.1000genomes.ebi.ac.uk/vol1...cal/reference/

Hope that helps,
Peter
ulz_peter is offline   Reply With Quote
Old 01-16-2012, 08:17 AM   #6
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Thanks for the answers.
I downloaded chr1.fa.gz from http://hgdownload.cse.ucsc.edu/golde...9/chromosomes/, I unzipped it. I tried to rerun both:

Quote:
java -Xmx2g -jar GenomeAnalysisTK.jar --refseq ~/chr1.fasta -T SomaticIndelDetector --minCoverage 10 -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
Quote:
java -Xmx2g -jar GenomeAnalysisTK.jar --refseq ~/chr1.fa -T SomaticIndelDetector --minCoverage 10 -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
but I still have the same error message... Isn't it suppose to work, even with a part of the genome?
Jane M is offline   Reply With Quote
Old 01-16-2012, 09:17 AM   #7
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Ok, so I made some progress! I changed the --refseq into -R option and I changed my s_garma-fibros_converted.bam file into s_garma-fibros_converted_sorted.bam file which is the sorted bam file.
Now, I have a new error:

Quote:
##### ERROR MESSAGE: SAM/BAM file ../../../../data/patient1/s_garma-fibros_converted_sorted.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
I got the BAM file this way:
Quote:
samtools view -b -S -t /home/m/hg19.fa s_garma-fibros_converted.sam > s_garma-fibros_converted.bam
samtools sort s_garma-fibros_converted.bam s_garma-fibros_converted_sorted
samtools index s_garma-fibros_converted_sorted.bam
Here http://www.broadinstitute.org/gsa/wi...s_for_the_GATK, I read that:
The file must be binary (.bam).
The file must be indexed.
The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
Each read in the file must be associated with exactly one read group.

The first 2 steps are ok. For the index, I used the default option, it was enough to visualize the files in IGV but I don't know if it is the right index for this use.
And I haven't done the last 2 steps: I don"t know if the file has a proper bam header with read groups and if each read in the file is associated with exactly one read group, because I cannot open a bam file:

Code:
more s_garma-fibros_converted_sorted.bam
\�@��qrF�!���m>(Lj��m�c� 7�,7�cr"�1e��9���1X3�8�
                                                     ��5�PCY�N(ǒ���,6��^L�%D�<CN8ψ�{�/vϘ!,�.�g*��284��yf^L�Xa<s�S

[m@U1009-PCJane patient1]$ more s_garma-fibros_converted.sam
@PG	ID:illumina_export2sam.pl	VN:2.0.0	CL:/usr/local/bin/illumina_export2sam.pl --read1=s_gar
ma-fibros_1_export.txt --read2=s_garma-fibros_2_export.txt
HWI-ST584_81:4:1101:1198:2065	89	chr7	156837088	28	76M	*	0	0	GGCTTG
AACAACGGAAATGTGTCAAATGTGTCAGCTCCCAGCTCAGAGACTGGGAGACCAGGCCGAGGCGCCGGCN	######################################
######################################	BC:Z:AGTCAA	XD:Z:75A	SM:i:28	AS:i:0
HWI-ST584_81:4:1101:1198:2065	165	*	0	0	*	chr7	156837088	0	GGAGCC
CTGCTGCGTAGTNNNNNNNNNNACACGGTGTATTATTACTTTCCCAGGACCACCGTAACAAAGTAGCACA	CCCFFFFFHHHHGJHGIH####################
######################################	BC:Z:AGTCAA
HWI-ST584_81:4:1101:1174:2078	73	chr5	41048452	254	76M	*	0	0	AAACGT
GTTTTCCATAGGTCTACCAATTTTGGGTGAATTATCTCAGGCAGTATCTTCAAAAGCCCTATTGCACCAG	CCCFFFFDHHHHHIIIJJIJJIJJJJJJJJJJJGHGJJ
JIJIJJJJIJJJIIJJJJJJJIIJGIJJJIIIJJHHHA	BC:Z:AGTCAA	XD:Z:76	SM:i:359	AS:i:0
I can only open the SAM file. How can I check if my bam file is correctly formated if I cannot open it?
Jane M is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:50 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO