![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Beginner Question: Generate WIG file | el_Davido | Bioinformatics | 11 | 09-12-2012 09:08 AM |
samtools sort bam file error: generate non-existent file | mediator | Bioinformatics | 0 | 03-05-2012 09:42 PM |
how to generate ini file | cjose | Bioinformatics | 4 | 07-22-2011 11:18 AM |
generate data from ewbt file | zhaowei | Bioinformatics | 1 | 01-28-2011 01:58 PM |
Celera Assembler (WGS) - splice site file? | dan | Bioinformatics | 4 | 09-28-2009 03:56 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Netherlands Join Date: Jul 2012
Posts: 10
|
![]()
Hi,
I have been trying for a while to submit our WGS genome data to NCBI. I had to overcome some obstetricals, and I'm almost there, but this last obstetrical seems to be to big for me. I have to make a AGP file witch describes the assembly of a larger sequence object (scaffolds) from smaller objects (contigs). We have some sort of script to generate the AGP file. But it is't working correctly. And my expertise isn't not that great to make it work. So my question to the community is: Who has a working script and is willing to help to improve mine or share theirs? Thank you in advance! |
![]() |
![]() |
![]() |
#2 |
PhD Student
Location: Denmark Join Date: Jul 2012
Posts: 164
|
![]()
Can you post an example of what you current script is producing? May be it can be modified if there's just a few simple formatting issues.
Also some information on which assembler produced the contigs would be useful as well. |
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: California Join Date: Oct 2011
Posts: 4
|
![]()
I also have a similar question. I used SOAPdenovo to assemble a genome but I don't know how to generate an AGP file from the output. Has anyone done this before?
Thanks! |
![]() |
![]() |
![]() |
#4 | |
Member
Location: Netherlands Join Date: Jul 2012
Posts: 10
|
![]() Quote:
I created 7 nucleotide sequences different with N-regions to invoke error messages. Here is the outcome of the Perl script to create an AGP file: Code:
scaffold_1 1 265 1 W contig_1 1 265 + scaffold_1 266 530 2 W contig_2 1 265 + scaffold_2 1 407 1 W contig_3 1 407 + scaffold_3 1 309 1 W contig_4 1 309 + scaffold_3 310 674 2 W contig_5 1 365 + scaffold_4 1 340 1 N 340 scaffold yes paired-ends scaffold_5 1 249 1 N 249 scaffold yes paired-ends scaffold_5 250 509 2 W contig_6 1 260 + scaffold_5 510 928 3 N 419 scaffold yes paired-ends scaffold_6 1 504 1 N 504 scaffold yes paired-ends scaffold_7 1 402 1 W contig_7 1 402 + scaffold_7 403 702 2 N 300 scaffold yes paired-ends scaffold_7 703 1156 3 W contig_8 1 454 + scaffold_4 1 340 1 N 340 scaffold yes paired-ends [Warning (w32)] gap at the beginning of object scaffold_4 [Warning (w31)] gap at the end of object scaffold_4 [Warning (w34)] no components in object scaffold_4 scaffold_5 1 249 1 N 249 scaffold yes paired-ends [Warning (w32)] gap at the beginning of object scaffold_5 scaffold_5 510 928 3 N 419 scaffold yes paired-ends [Warning (w31)] gap at the end of object scaffold_5 scaffold_6 1 504 1 N 504 scaffold yes paired-ends [Warning (w32)] gap at the beginning of object scaffold_6 [Warning (w31)] gap at the end of object scaffold_6 [Warning (w34)] no components in object scaffold_6 Scaffold_5: contains nucleotide region (260), adjacent on both sides by a N-region. This is not allowed, because of the 'gaps' on both sides of the 'contig'. Scaffold_6: contains in the middle nucleotide region <200 nt, (so only the N-region is noticed in the AGP file. But only a N-region is also not allowed. My question: How can I get around these error messages? How can I modify the AGP script where it knows to skip the first or last N-region, or when there is only a N-region, to skip this scaffold, in the AGP file? (in the fasta file it is not printed, of course). Here is my input: Code:
>scaffold1 TACAATATATGTATACAAAACTGATAGATGCTACGTGGTGAATTGGGCATATTGTAAATATGTATATGTATGTATGTAAACCTAGGGTTTTAGCATTTTTGCATTGGTCAGTGGGGGCCATCTTCTCAGTTTTCTAATAATATTGTTTCAAGACTGATGTGTGTGGGATTTTCGAATGAATTATTCTGGTTCAACGTTATATAAGGTATATGAATTTAAAATAAAATATGCTGTTTCATTCTGTAGGGTGTGTGCCTCTGACTGANNNNNNNNAGTCTCTTTTTGTCGGAGTGGTAAGTCTAGGTTGTCAGATTAAATGACATTATTTATTTGCTGGCTTGTTCTGTGACACACACTGTAATTACTAGTAGCCTGTATTCTGGGGTTATGACGTCATTCATCGCAGTTACTGTAAGANNNNTTTCATCCAGTGGAGAGCTGTAATCTCTGCATCTGGGGATCAGAGGTATTGTGTCTGTAATCTCTCCATCTCAGTTTGCATGTCCCGTTCTTACCGCCTTCTCTGCTATTGAACCGGTTACAGATATATCATTACAATTAAACATTATTGATACACTATTACATATGTACACTACAAGTAGTGACTGACATATCTGAAAAGACACACAGGTGACCCCGTACAGAGAATTTTGTAATTGCCAATTTTGTGAATCCTTATACACAAC >scaffold2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCTGGAACATTTGACTTCCTTTAAAACCGGATTGACGGGTTTTATAGGACGATCAGCCTCCAAACGCACCCTAGCAATATAGTATAAGACATTACTTAGGTGCTTTTGTGTTAGCGTTTCTCAGACAACAATTTATTTGATATTGGTTTTGGGATTTCTCTGACAACAATTTATTTTATACATTGGTAATGCTGCAGGTTTGTGTGCACTTGAACACAACGTTACTTTACAAAGGTTATTGCTTCAAGAGATATGACCCATGAACCATTATTTCTAGGAAGCACCAAGTACCCTTAGTTCATCAACTTCCTGCAGCCCAAATGCCTTTCATATAGAGCAACTTCCCACAGTATACAATTGAATCAAACTAAAGAGAATATGCTTTTCTATATTTCTGCTCCCTAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >scaffold3 GATCTGGAACATTTGACTTCCTTTAAAACCGGATTGACGGGTTTTATAGGACGATCAGCCTCCAAACGCACCCTAGCAATATAGTATAAGACATTACTTAGGTGCTTTTGTGTTAGCGTTTCTCAGACAACAATTTATTTGATATTGGTTTTGGGATTTCTCTGACAACAATTTATTTTATACATTGGTAATGCTGCAGGTTTGTGTGCACTTGAACACAACGTTACTTTACAAAGGTTATTGCTTCAAGAGATATGACCCATGAACCATTATTTCTAGGAAGCACCAAGTACCCTTAGTTCATCAACTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTAATTTTATTTGGTACATGTACTTACACATACAATGATGACATACAATATATGTATACAAAACTGATAGATGCTACGTGGTGAATTGGGCATATTGTAAATATGTATATGTATGTATGTAAACCTAGGGTTTTAGCATTTTTGCATTGGTATGGCCTGTATAAGGTCAAAAAACGTAGTCTCTTTTTGTCGGAGTGGTAAGTCTAGGTAGACAGTCAGGCTGACATGCAGACAAACAGTATGCCCCGGCTGGAAAATTCAGTGCTGTCCACTGGTTGGGGTAGAGGCAGACTTTTCAGGAGGGGGCTGTTAATACATTTGCCCATTTTGTAATAAAACTGTCACTTGCAGTCTCCGTAAAGGG >scaffold4 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >scaffold5 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCTGGAACATTTGACTTCCTTTAAAACCGGATTGACGGGTTTTATAGGACGATCAGCCTCCAAACGCACCCTAGCAATATAGTATAAGACATTACTTAGGTGCTTTTGTGTTAGCGTTTCTCAGACAACAATTTATTTGATATTGGTTTTGGGATTTCTCTGACAACAATTTATTTTATACATTGGTAATGCTGCAGGTTTGTGTGCACTTGAACACAACGTTACTTTACAAAGGTTATTGCTTCAAGAGATATGACCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >scaffold6 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCTGGAACATTTGACTTCCTTTAAAACCGGATTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >scaffold7 TTTATTTTATACATTGGTAATGCTGCAGGTTTGTGTGCACTTGAACACAACGTTACTTTACAAAGGTTATTGCTTCAAGAGATATGACCCATGAACCATTATTTCTAGGAAGCACCAAGTACCCTTAGTTCATCAACTTCCTGCAGCCCAAATGCCTTTCATATAGAGCAACTTCCCACAGTATACAATTGAATCAAACTAAAGAGAATATGCTTTTCTATATTTCTGCTCCCTAAAGGGATGCTGGCTGGTGGCATATTTTGCAGCTCCCCATACTTTGAAAAACGCTGATAAATGGCTACTAAAGGCAAGTATTCATTATCTTTATAACTTTGAATGTATAGTTGCAAATGGGCCCTTGGCCAAATAATACATAATAGAATAAGCTGGGTAATGCATAATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTAATTTTATTTGGTACATGTACTTACACATACAATGATGACATACAATATATGTATACAAAACTGATAGATGCTACGTGGTGAATTGGGCATATTGTAAATATGTATATGTATGTATGTAAACCTAGGGTTTTAGCATTTTTGCATTGGTATGGCCTGTATAAGGTCAAAAAACGTAGTCTCTTTTTGTCGGAGTGGTAAGTCTAGGTAGACAGTCAGGCTGACATGCAGACAAACAGTATGCCCCGGCTGGAAAATTCAGTGCTGTCCACTGGTTGGGGTAGAGGCAGACTTAAAATATTGTTTTACCACAGCCCATGGGGCAGTGTATGGCCTGTATAAGGTCAAAAAACGTCCAGAATTCTTTGGCATTTTCAATTGTTAATTTAAGTTGAACTGAAGAGGAATAGAGGTGCTTGTGTTTGTGTGTTTGTGTGTGTGTGTGCATATGTG |
|
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: usa Join Date: Dec 2010
Posts: 6
|
![]()
i am not sure if this code can help. It generates contig .agp file by using scaffolds.
www.hmpdacc.org/doc/fasta2apg.pl
__________________
wwq ![]() |
![]() |
![]() |
![]() |
#6 | |
Member
Location: Netherlands Join Date: Jul 2012
Posts: 10
|
![]() Quote:
Dear WWQ413, thank you for trying. I will try this on my scaffolds. Did you already try to submit your file generated by your script to NCBI? |
|
![]() |
![]() |
![]() |
#7 | |
Member
Location: Netherlands Join Date: Jul 2012
Posts: 10
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: California Join Date: Oct 2011
Posts: 4
|
![]()
Try copying this link to your browser, that seemed to work for me:
http://www.hmpdacc.org/doc/fasta2apg.pl |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: NL, Leiden Join Date: Feb 2010
Posts: 245
|
![]()
There is a script in the Velvet package called fasta2agp.pl, it is located in the 'contrib' folder.
|
![]() |
![]() |
![]() |
#10 |
Member
Location: DE Join Date: Dec 2012
Posts: 65
|
![]()
fasta2agp.pl requires Bio::SeqIO...can this module be found seperately or do I need to install BioPerl?
I tried pip install with no luck. Bear with me I'm new at this. Hopefully this is the one and only time I need to submit a WGS assembly to NCBI. |
![]() |
![]() |
![]() |
#11 |
Member
Location: Belgium Join Date: Aug 2011
Posts: 18
|
![]()
Hi,
I tried to use the fasta2agp.pl script but I get errors when submitting to NCBI WGS. The script is from 2010 and there is a new AGPv2,0 version since 2012. The script worked but there is an error counting the fragments (column 4) and it gives errors when submitting to WGS. Can anyone help with this issue? See first lines: # Generated from SOAPdenovo assembly file B12041.fa using script fasta2agp.pl B12041_scaffold_1 1 8630 0 W B12041_1 1 8630 + B12041_scaffold_1 8631 8665 1 N 35 fragment yes B12041_scaffold_1 8666 30966 1 W B12041_2 1 22301 + B12041_scaffold_1 30967 30984 2 N 18 fragment yes B12041_scaffold_1 30985 89730 2 W B12041_3 1 58746 + B12041_scaffold_1 89731 89731 3 N 1 fragment yes B12041_scaffold_1 89732 105108 3 W B12041_4 1 15377 + |
![]() |
![]() |
![]() |
#12 | |
Member
Location: Belgium Join Date: Aug 2011
Posts: 18
|
![]() Quote:
#!/usr/bin/perl ### david.studholme@tsl.ac.uk ### Generates contigs (in FastA) and scaffolding information (in AGP) from Velvet 'contigs.fa' supercontigs file ### Use entirely at you own risk!! There may be bugs! ### modified by chienchi@lanl.gov 2010/09/13 ### add flags -i -size -o ### change naming convention for HMP assembly purpose ### modified by mpop@umd.edu 2010/9/15 ### added flag -name ### allowing to change names of sequences ### modified by mpop@umd.edu 2010/9/27 ### removed 10bp limit for inter-contig gaps ### modified by annelies.haegeman@ilvo.vlaanderen.be 2014/11/27 ### fixed counter $i to restart from 1 at each scaffold + unique fragment number generation use strict; use warnings ; use Bio::SeqIO ; use Getopt::Long; use File::Basename; my $file; my $outDir; my $scaf_size_cutoff=0; my $name_prefix="AGP"; GetOptions( 'i=s' => \$file, # scaf seq in fasta 'size=i' => \$scaf_size_cutoff, 'o=s' => \$outDir, 'name=s' => \$name_prefix); die "Usage: $0 -i <sequence file> -o <out_dir> [-size <scaf_size_cutoff>] [-name <name>]\n" unless $file; my ($file_name, $path, $suffix)=fileparse("$file", qr/\.[^.]*/); my ($sample_name,$center)=split /_/,$file_name; ### Output file for contigs in Fasta format ### The various output file names can be tuned here. my $contig_outfile = "$outDir/$name_prefix.contigs.fa"; my $scaffolds_outfile = "$outDir/$name_prefix.scaffolds.fa"; my $agp_outfile = "$outDir/$name_prefix.agp"; #open (FILE, ">$outdir/$contig_outfile") and # warn "Will write contigs to file '$contig_outfile' to $outDir Directory\n" or # die "Failed to write to file '$fasta_outfile'\n"; my $contig_out = Bio::SeqIO->new('-file' => ">$contig_outfile", '-format' => 'Fasta') and warn "Will write contigs to file '$contig_outfile' to $outDir Directory\n" or die "Failed to write to file '$contig_outfile'\n"; open (AGP_FILE, ">$agp_outfile") and warn "Will write contigs to file $agp_outfile to $outDir Directory\n" or die "Failed to write to file $agp_outfile\n"; #open (SCAFF_FILE, ">$outDir/$scaffolds_outfile") and my $scaff_out = Bio::SeqIO->new('-file' => ">$scaffolds_outfile", '-format' => 'Fasta') and warn "Will write scaffolds to file $scaffolds_outfile to $outDir Directory\n" or die "Failed to write to file $scaffolds_outfile\n"; print AGP_FILE "# Generated from SOAPdenovo assembly file $file using script $0\n"; warn "Scaffold Sequence cutoff = $scaf_size_cutoff nt\n" if ($scaf_size_cutoff); my $i = 0;# a counter, used for generating unique contig names my $inseq = Bio::SeqIO->new('-file' => "<$file", '-format' => 'Fasta' ) ; while (my $seq_obj = $inseq->next_seq ) { my $supercontig_id = $seq_obj->id ; my $supercontig_seq = $seq_obj->seq ; my $supercontig_desc = $seq_obj->description ; my $supercontig_length = length($supercontig_seq); $i=0; #reset each time you have a new scaffold # only process the long supercontigs next if ($supercontig_length < $scaf_size_cutoff); ### NCBI do not allow coverage and length information in the FastA identifier ### e.g. NODE_1160_length_397673_cov_14.469489 is an illegal FastA ID ### So we will replace these with simple numbers if ($supercontig_id =~ m/NODE_(\d+)_length_\d+_cov_\d+/ or $supercontig_id =~ m/^(\d+)$/ or $supercontig_id =~ m/^scaffold(\d+)$/) { $supercontig_id = "${name_prefix}_scaffold_$1"; } # print SCAFF_FILE ">$supercontig_id\n$supercontig_seq\n"; my $scf = Bio::PrimarySeq->new(-seq => "$supercontig_seq", -id => "$supercontig_id"); $scaff_out->write_seq($scf); my $start_pos = 1; # keep track of whereabouts in this supercontig we are my %substring_sequences; foreach my $substring_sequence ( split /(N+)/i, $supercontig_seq ) { #warn "\n$substring_sequence\n" if $supercontig_id eq '1160'; for #debugging only $i++; # a counter, used for generating unique contig names ### Define the AGP column contents my $object1 = $supercontig_id; my $object_beg2 = $start_pos; my $object_end3 = $start_pos + length($substring_sequence) - 1; my $part_number4 = $i; my $component_type5; my $component_id6a; my $gap_length6b; my $component_beg7a; my $gap_type7b; my $component_end8a; my $linkage8b; my $orientation9a; my $filler9b; if ( $substring_sequence =~ m/^N+$/i ) { ### This is poly-N gap between contigs $component_type5 = 'N'; $gap_length6b = length($substring_sequence); $gap_type7b = 'fragment'; $linkage8b = 'yes'; $filler9b = ''; } elsif ( $substring_sequence =~ m/^[ACGT]+$/i ) { ### This is a contig $component_type5 = 'W'; $component_id6a = "$supercontig_id"."_"."$i"; $component_beg7a = 1; $component_end8a = length($substring_sequence); $orientation9a = '+'; ### Print FastA formatted contig # print FILE ">$component_id6a\n$substring_sequence\n"; my $ctg = Bio::PrimarySeq->new(-seq => "$substring_sequence", -id => "$component_id6a"); $contig_out->write_seq($ctg); } else { die "Illegal characters in sequence\n$substring_sequence\n"; } $start_pos += length ($substring_sequence); if ($component_type5 eq 'N') { ### print AGP line for gap print AGP_FILE "$object1\t$object_beg2\t$object_end3\t$part_number4\t$component_type5\t$gap_length6b\t$gap_type7b\t$linkage8b\t$filler9b\n"; } else { ### print AGP line for contig print AGP_FILE "$object1\t$object_beg2\t$object_end3\t$part_number4\t$component_type5\t$component_id6a\t$component_beg7a\t$component_end8a\t$orientation9a\n"; } } } $contig_out->close(); close AGP_FILE; $scaff_out->close(); |
|
![]() |
![]() |
![]() |
Tags |
ncbi, wgs |
Thread Tools | |
|
|