Seqanswers Leaderboard Ad

**mmmm** · 12-04-2013, 04:13 AM

yes, the alignment looks very good using IGV- I think, there is an issue in the GFF file as genes do not appear annotated using IGV (only the sequence appear without gene annotation), do not know how to fix the gff file?

**GenoMax** · 12-04-2013, 04:14 AM

Can you post an example of the GFF file (first few lines would be fine)?

**mmmm** · 12-04-2013, 04:22 AM

>complete genome
NC_022544.1 RefSeq region 1 4814801 . + . ID=id0;Dbxref=taxon:568709;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=DT2;sub-species=enterica
NC_022544.1 RefSeq gene 169 255 . + . ID=gene0;Name=thrL;Dbxref=GeneID:17155329;gbkey=Gene;gene=thrL;locus_tag=STMDT2_00011
NC_022544.1 RefSeq CDS 169 255 . + 0 ID=cds0;Name=YP_008642919.1;Parent=gene0;Dbxref=Genbank:YP_008642919.1,GeneID:17155329;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_008642919.1;transl_table=11
NC_022544.1 RefSeq gene 337 2799 . + . ID=gene1;Name=thrA;Dbxref=GeneID:17159252;gbkey=Gene;gene=thrA;locus_tag=STMDT2_00021
NC_022544.1 RefSeq CDS 337 2799 . + 0 ID=cds1;Name=YP_008642920.1;Parent=gene1;Dbxref=Genbank:YP_008642920.1,GeneID:17159252;gbkey=CDS;product=aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_008642920.1;transl_table=11
NC_022544.1 RefSeq gene 2801 3730 . + . ID=gene2;Name=thrB;Dbxref=GeneID:17159249;gbkey=Gene;gene=thrB;locus_tag=STMDT2_00031
NC_022544.1 RefSeq CDS 2801 3730 . + 0 ID=cds2;Name=YP_008642921.1;Parent=gene2;Dbxref=Genbank:YP_008642921.1,GeneID:17159249;gbkey=CDS;product=Homoserine kinase;protein_id=YP_008642921.1;transl_table=11
NC_022544.1 RefSeq gene 3734 5020 . + . ID=gene3;Name=thrC;Dbxref=GeneID:17159250;gbkey=Gene;gene=thrC;locus_tag=STMDT2_00041
NC_022544.1 RefSeq CDS 3734 5020 . + 0 ID=cds3;Name=YP_008642922.1;Parent=gene3;Dbxref=Genbank:YP_008642922.1,GeneID:17159250;gbkey=CDS;product=threonine synthase;protein_id=YP_008642922.1;transl_table=11
NC_022544.1 RefSeq gene 5114 5887 . - . ID=gene4;Name=yaaA;Dbxref=GeneID:17159251;gbkey=Gene;gene=yaaA;locus_tag=STMDT2_00051
NC_022544.1 RefSeq CDS 5114 5887 . - 0 ID=cds4;Name=YP_008642923.1;Parent=gene4;Dbxref=Genbank:YP_008642923.1,GeneID:17159251;gbkey=CDS;product=hypothetical protein;protein_id=YP_008642923.1;transl_table=11
NC_022544.1 RefSeq gene 5966 7396 . - . ID=gene5;Name=yaaJ;Dbxref=GeneID:17159391;gbkey=Gene;gene=yaaJ;locus_tag=STMDT2_00061
NC_022544.1 RefSeq CDS 5966 7396 . - 0 ID=cds5;Name=YP_008642924.1;Parent=gene5;Dbxref=Genbank:YP_008642924.1,GeneID:17159391;gbkey=CDS;product=putative amino-acid transport protein;protein_id=YP_008642924.1;transl_table=11
NC_022544.1 RefSeq gene 7665 8618 . + . ID=gene6;Name=talB;Dbxref=GeneID:17159395;gbkey=Gene;gene=talB;locus_tag=STMDT2_00071
NC_022544.1 RefSeq CDS 7665 8618 . + 0 ID=cds6;Name=YP_008642925.1;Parent=gene6;Dbxref=Genbank:YP_008642925.1,GeneID:17159395;gbkey=CDS;product=transaldolase B;protein_id=YP_008642925.1;transl_table=11
NC_022544.1 RefSeq gene 8729 9319 . + . ID=gene7;Name=mog;Dbxref=GeneID:17159215;gbkey=Gene;gene=mog;locus_tag=STMDT2_00081
NC_022544.1 RefSeq CDS 8729 9319 . + 0 ID=cds7;Name=YP_008642926.1;Parent=gene7;Dbxref=Genbank:YP_008642926.1,GeneID:17159215;gbkey=CDS;product=molybdopterin biosynthesis Mog protein;protein_id=YP_008642926.1;transl_table=11
NC_022544.1 RefSeq gene 9376 9942 . - . ID=gene8;Name=yaaH;Dbxref=GeneID:17158379;gbkey=Gene;gene=yaaH;locus_tag=STMDT2_00091
NC_022544.1 RefSeq CDS 9376 9942 . - 0

**GenoMax** · 12-04-2013, 04:34 AM

Can you remove the fist line from the file (please make a backup copy of original file in case something goes wrong)

Code:

>complete genome

and then try? That would make it a gff format file (http://www.sanger.ac.uk/resources/so.../gff/spec.html).

To make it a gff3 format file you will have to replace that first line with following two lines http://www.sequenceontology.org/gff3.shtml

Code:

##gff-version 3 
##sequence-region NC_022544.1 1 4814801

**mmmm** · 12-04-2013, 05:08 AM

after editing the first 2 lines in the gff file as you have kindly suggested. genes annotations can not be seen on IGV (can see only the sequence but not the genes)???- your advice is very appreciated

**GenoMax** · 12-04-2013, 05:26 AM

Have you renamed the GFF file as "your_file_name.gff3"?

IGV expects the GFF3 files to have that extension (http://www.broadinstitute.org/software/igv/GFF) and they also need to be tab-delimited (which your example above does not appear to be).

**GenoMax** · 12-04-2013, 06:28 AM

After taking out the first two lines from your example, adding the two lines for GFF3 meta-data and then converting to tab-delimited text the file appears to work with IGV.

Take these two lines out:

Code:

>complete genome
NC_022544.1 RefSeq region 1 4814801 . + . ID=id0;Dbxref=taxon:568709;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=DT2;sub-species=enterica

Replace with:

Code:

##gff-version 3 
##sequence-region NC_022544.1 1 4814801

Attached Files

IGV_Cap.PNG (14.2 KB, 31 views)

**mmmm** · 12-05-2013, 02:11 AM

how do you convert text (gff3) to tab-delimited, please?
I used:

awk '{ for(i=1;i<=NF;i++){if(i==NF){printf("%s\n",$NF);}else {printf("%s\t",$i)}}}' file.gff3

but did not convert the file to tab-delimited????

##gff-version 3
##sequence-region NC_022544.1 1 4814801

NC_022544.1 RefSeq gene 169 255 . + . ID=gene0;Name=thrL;Dbxref=GeneID:17155329;gbkey=Gene;gene=thrL;locus_tag=STMDT2_00011
NC_022544.1 RefSeq CDS 169 255 . + 0 ID=cds0;Name=YP_008642919.1;Parent=gene0;Dbxref=Genbank:YP_008642919.1,GeneID:17155329;gbkey=CDS;gene=thrL;product=thr operon leader peptide;protein_id=YP_008642919.1;transl_table=11
NC_022544.1 RefSeq gene 337 2799 . + . ID=gene1;Name=thrA;Dbxref=GeneID:17159252;gbkey=Gene;gene=thrA;locus_tag=STMDT2_00021
NC_022544.1 RefSeq CDS 337 2799 . + 0 ID=cds1;Name=YP_008642920.1;Parent=gene1;Dbxref=Genbank:YP_008642920.1,GeneID:17159252;gbkey=CDS;gene=thrA;product=aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_008642920.1;transl_table=11
NC_022544.1 RefSeq gene 2801 3730 . + . ID=gene2;Name=thrB;Dbxref=GeneID:17159249;gbkey=Gene;gene=thrB;locus_tag=STMDT2_00031
NC_022544.1 RefSeq CDS 2801 3730 . + 0 ID=cds2;Name=YP_008642921.1;Parent=gene2;Dbxref=Genbank:YP_008642921.1,GeneID:17159249;gbkey=CDS;gene=thrB;product=Homoserine kinase;protein_id=YP_008642921.1;transl_table=11
NC_022544.1 RefSeq gene 3734 5020 . + . ID=gene3;Name=thrC;Dbxref=GeneID:17159250;gbkey=Gene;gene=thrC;locus_tag=STMDT2_00041
NC_022544.1 RefSeq CDS 3734 5020 . + 0 ID=cds3;Name=YP_008642922.1;Parent=gene3;

**GenoMax** · 12-05-2013, 04:36 AM

Try this unix command. Adjust the file names accordingly.

Code:

$ tr ' ' \\t < original.gff3 > tab_converted.gff3

There is a "space" between the two single quotes in the command above.

Best to put the two top metadata lines in after the conversion.

**mmmm** · 12-05-2013, 05:35 AM

thanks you soo much for your advice and time. managed to view genes on IGV and have got a coverage.txt file showing genes that are covered/ missed from the reference

but have a very simple technichal issue when I open the coverage.txt using excel or libreoffice- I could not sort the coverage values in ascending order as values are written on a separte line

NC_022544.1 RefSeq gene 2096621 2097676 . - . ID=gene2037;Name=cbiG;Dbxref=GeneID:17157414;gbkey=Gene;gene=cbiG;locus_tag=STMDT2_20011
2505 1056 1056 1
NC_022544.1 RefSeq CDS 2096621 2097676 . - 0 "ID=cds1963;Name=YP_008644883.1;Parent=gene2037;Dbxref=Genbank:YP_008644883.1,GeneID:17157414;gbkey=CDS;gene=cbiG;product=cobalamin biosynthesis protein;protein_id=YP_008644883.1;transl_table=11"
2505 1056 1056 0.9
NC_022544.1 RefSeq gene 3145124 3146200 . - . ID=gene2977;Name=STMDT2_29251;Dbxref=GeneID:17156485;gbkey=Gene;locus_tag=STMDT2_29251
1716 1077 1077 0.6

**dpryan** · 12-05-2013, 05:52 AM

Something like the following will put each record on a single line.

Code:

cat coverage.txt | awk 'BEGIN{first=1; OFS='\t'; ORS='\t';}{if(first==1) {print $0; first=0;} else {print "\t",$0,"\n"; first=1}}' > coverage.single_line.txt

**mmmm** · 12-05-2013, 06:04 AM

I am afraid, that did not work

**GenoMax** · 12-05-2013, 06:21 AM

Devon's solution works for me. What is happening in your case?

**mmmm** · 12-05-2013, 06:24 AM

NC_022544.1 RefSeq gene 2096621 2097676 0 - 0 ID=gene2037;Name=cbiG;Dbxref=GeneID:17157414;gbkey=Gene;gene=cbiG;locus_tag=STMDT2_20011
2505 1056 1056 1 NC_022544.1 RefSeq CDS 2096621
2505 1056 1056 1
NC_022544.1 RefSeq gene 3145124 3146200 0 - 0 ID=gene2977;Name=STMDT2_29251;Dbxref=GeneID:17156485;gbkey=Gene;locus_tag=STMDT2_29251
1716 1077 1077 1 NC_022544.1 RefSeq CDS 3145124
1716 1077 1077 1

**GenoMax** · 12-05-2013, 06:27 AM

Wonder if this is a unix vs PC/Mac file format issue. Are you moving the file among machines before running the script? I pasted your original sample into a new file on unix and did not have any problem.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News