SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Transcript mapping without a reference genome gringer Bioinformatics 3 09-02-2020 11:07 PM
Aligning/Mapping Illumina reads to reference in Geneious problem Illusive Man Illumina/Solexa 5 09-04-2013 04:04 AM
Hiseq reads -assembly n mapping on a reference genome Bgansw Illumina/Solexa 0 11-08-2012 08:43 PM
Assisted de novo genome assembly? Create new consensus mapping reads to reference? zmartine Bioinformatics 8 02-10-2012 12:31 AM
Targeted Genome Assembly for region poorly represented in reference genome? gumbos Bioinformatics 1 01-09-2012 04:01 PM

Reply
 
Thread Tools
Old 11-04-2013, 07:46 PM   #1
Bgansw
Member
 
Location: Sydney, Australia

Join Date: Nov 2011
Posts: 24
Default Illumina - mapping genome on reference - how to extract assembly?

Hello,

I'm having trouble extracting assembled reads after using soapaligner.

I mapped my raw reads to my reference genome using soap aligner, how do I then extract the assembly and submit for annotation? I tried converting soap to sam format, then using,

# This is for extracting mapping reads from bam files
samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam

converted sam to bam and bam to fastq, then fastq to fasta -
I need a genebank file to upload sequences to http://rast.nmpdr.org/rast.cgi

Please help!

bgansw
Bgansw is offline   Reply With Quote
Old 11-04-2013, 08:32 PM   #2
jimmybee
Senior Member
 
Location: Adelaide, Australia

Join Date: Sep 2010
Posts: 119
Default

Technically thats not really an assembly.

TO get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):

Quote:
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
jimmybee is offline   Reply With Quote
Old 11-05-2013, 02:27 AM   #3
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by jimmybee View Post
To get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):
But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.
gringer is offline   Reply With Quote
Old 11-19-2013, 05:54 PM   #4
Bgansw
Member
 
Location: Sydney, Australia

Join Date: Nov 2011
Posts: 24
Default

Hello Jimmybee and Gringer,

Thanx for your guidance with this.

I've been trying the commands (samtools, bcftools).. and I'm getting an error:

bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[afs] 0:0.000 1:0.000 2:0.000
Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.

I've tried installing samtools 1.19, and running this.. but I keep getting the same error.

Do you'll know what I'm doing wrong??

Really really appreciate your help. Thanx v v much

Bgansw
Bgansw is offline   Reply With Quote
Old 11-19-2013, 07:43 PM   #5
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by Bgansw View Post

Code:
bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
...
Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
I recommend checking your file names. I get an error one line after this when I give an incorrect fasta file name to samtools mpileup. There is a chance that your fasta file is called
Code:
533240.5.fa
rather than
Code:
533240.5.fna
gringer is offline   Reply With Quote
Old 11-19-2013, 08:08 PM   #6
Bgansw
Member
 
Location: Sydney, Australia

Join Date: Nov 2011
Posts: 24
Default

Hi Gringer,

Thanx f the tip. I've gone over all the file names, and have changed the 53340.5.fna to 533240.5.fa but I'm still getting the same error. Does this command work well for you? I'm trying to figure out what the problem could be.

Do u know if theres any other way to extract the consensus sequence after mapping one genome on another using soap aligner?
Bgansw is offline   Reply With Quote
Old 11-19-2013, 08:37 PM   #7
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Does this command work well for you?
I have a slightly different processing script, but the command you have seems to work for me on my mitochondrial data (i.e. it produces a fastq file at the end of it). Here's the actual command sequence that I run for my processing:

Code:
    echo "** getting VCF for ${y} (max depth = ${maxdepth}) **";
    pv results/${y}.bam | samtools mpileup -L ${maxdepth} -uf db/hs_mtDNA_ref.fasta - | \
        bcftools view -v -g - > results/${y}.vcf
    echo "** getting consensus FASTQ for ${y} **";
    perl scripts/vcf2fq.pl -Q 20 -L 20 -f db/hs_mtDNA_ref.fasta results/${y}.vcf > results/con_${y}.fastq
    echo "** converting ${y} consensus FASTQ to plain FASTA **";
    perl scripts/fastq2fasta.pl results/con_${y}.fastq | perl -pe 'tr/acgt/ACGT/' > results/con_${y}.fasta
I've attached my modified vcf2fq script to this post [also the fairly trivial fastq2fasta], maybe it will help to diagnose the problem.
Attached Files
File Type: pl fastq2fasta.pl (578 Bytes, 41 views)
File Type: pl vcf2fq.pl (4.2 KB, 43 views)
gringer is offline   Reply With Quote
Old 01-09-2014, 06:03 PM   #8
xmubingo
Member
 
Location: China guangzhou

Join Date: Sep 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.
Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?
xmubingo is offline   Reply With Quote
Old 01-09-2014, 07:18 PM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?
Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.
gringer is offline   Reply With Quote
Old 01-10-2014, 02:06 AM   #10
xmubingo
Member
 
Location: China guangzhou

Join Date: Sep 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.
hi gringer, thanks for your reply. I use standard vcfutils.pl of samtools(Version: 0.1.19-44428cd). Here is my code:

Code:
samtools mpileup -uf ${refFile} ${uniquelyMappedBAMFile} > ${mpileupFile}
bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}
seqtk seq -l 0 -A ${cnsFileFQ} > ${cnsFileFA}
the cns.fq file contains "A,T,C,G,a,t,c,g,n". but i check length of sequences in fastq file, it doesn't same with length in reference sequence. so i guess the error coming from command "bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}". i will check the command again.

Last edited by xmubingo; 01-12-2014 at 03:26 PM.
xmubingo is offline   Reply With Quote
Old 01-10-2014, 04:37 PM   #11
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Is this behaviour consistent regardless of what BAM file you use?

Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.
gringer is offline   Reply With Quote
Old 01-12-2014, 03:08 AM   #12
Tong.W
Member
 
Location: China

Join Date: Jul 2012
Posts: 14
Default

Do not use a reference sequence to assembly. They are different. If you want to get a consistent after soap,you can use SOAPsnp. This software will generate a new consistent sequence for you.
Tong.W is offline   Reply With Quote
Old 01-12-2014, 04:42 PM   #13
xmubingo
Member
 
Location: China guangzhou

Join Date: Sep 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
Is this behaviour consistent regardless of what BAM file you use?

Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.
Thanks, gringer!! I think you figure out the problem. I use the following command to check the input of vcf2fq. I found that the output of bcftools did not cover each base.
Code:
bcftools view -cg ${mpileupFile} | more
For example, i pick up several lines of bcftools ouput. You can see that contig "AGTP01133720.1" ends in position of 23,295bp, but actually "AGTP01133720.1" has a length of 23,339bp. More clearly, contig "AGTP01133827.1" starts in position of 30bp, not in the beginning.
Code:
AGTP01133720.1  23292   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
AGTP01133720.1  23293   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
AGTP01133720.1  23294   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
AGTP01133720.1  23295   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
AGTP01133827.1  30      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
AGTP01133827.1  31      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
AGTP01133827.1  32      .       C       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
AGTP01133827.1  33      .       T       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
AGTP01133827.1  34      .       A       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
AGTP01133827.1  35      .       G       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
So, gringer, what is the cause of my problem?
1. the bam file i used? for reads didn't cover each base in reference sequence.
2. or the parameters i used in "samtools mpileup"?

Great thanks!!

Last edited by xmubingo; 01-12-2014 at 04:51 PM.
xmubingo is offline   Reply With Quote
Old 01-12-2014, 05:42 PM   #14
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by xmubingo View Post
1. the bam file i used? for reads didn't cover each base in reference sequence.
2. or the parameters i used in "samtools mpileup"?
This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

Code:
samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less
gringer is offline   Reply With Quote
Old 01-12-2014, 06:06 PM   #15
xmubingo
Member
 
Location: China guangzhou

Join Date: Sep 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

Code:
samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less
No, you can see:
Code:
AGTP01133720.1  23292   A       1       ,       >
AGTP01133720.1  23293   A       1       ,       ;
AGTP01133720.1  23294   A       1       ,       9
AGTP01133720.1  23295   A       1       ,$      8
AGTP01133827.1  30      G       1       ^S.     C
AGTP01133827.1  31      G       1       .       C
AGTP01133827.1  32      C       1       .       C
AGTP01133827.1  33      T       2       .^S,    F>
AGTP01133827.1  34      A       2       .,      FG
AGTP01133827.1  35      G       2       .,      FI
xmubingo is offline   Reply With Quote
Old 01-12-2014, 06:57 PM   #16
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

I've filed a bug for this in github:

https://github.com/samtools/samtools/issues/112

In the meantime, you'll either have to use my script with the reference fasta file (in which case INDELs will change the length of the generated sequence), or add a dummy SAM line for each sequence in the reference fasta file that covers the entire range of the sequence.
gringer is offline   Reply With Quote
Old 01-13-2014, 10:16 PM   #17
xmubingo
Member
 
Location: China guangzhou

Join Date: Sep 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
I've filed a bug for this in github:

https://github.com/samtools/samtools/issues/112

In the meantime, you'll either have to use my script with the reference fasta file (in which case INDELs will change the length of the generated sequence), or add a dummy SAM line for each sequence in the reference fasta file that covers the entire range of the sequence.
Hi gringer, Thanks a lot!! I will try your code.
xmubingo is offline   Reply With Quote
Old 01-15-2014, 10:53 PM   #18
xmubingo
Member
 
Location: China guangzhou

Join Date: Sep 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
I've filed a bug for this in github:

https://github.com/samtools/samtools/issues/112

In the meantime, you'll either have to use my script with the reference fasta file (in which case INDELs will change the length of the generated sequence), or add a dummy SAM line for each sequence in the reference fasta file that covers the entire range of the sequence.
Hi gringer, i think find a way to solve this problem. Although the consensus sequence(cns.fa -> cns.fa) generated by vcf2fq doesn't have same length as it in reference sequence. but i find the sequence in cns.fa just lacks of some 'n' in its tail. So, we can add some 'n' to the sequences to make its length equal to its original length. for example:

cns.fa
Code:
>seq1
nnnnnnnnnnnnnnnAAAATTTTTCCCCGGGGgggccccGGTTTg
cns.fa.fixed
Code:
>seq1
nnnnnnnnnnnnnnnAAAATTTTTCCCCGGGGgggccccGGTTTgnnnnnnnnnnnnn
xmubingo is offline   Reply With Quote
Old 12-09-2014, 06:03 PM   #19
jiewencai
Junior Member
 
Location: China

Join Date: Oct 2012
Posts: 3
Smile It's amazing

Quote:
Originally Posted by gringer View Post
I have a slightly different processing script, but the command you have seems to work for me on my mitochondrial data (i.e. it produces a fastq file at the end of it).

I've attached my modified vcf2fq script to this post [also the fairly trivial fastq2fasta], maybe it will help to diagnose the problem.
The original vcf2fq can't producing sequence containing indels, and it's really bother me a lot,but this one can produce sequence containing indels,that's amazing.
jiewencai is offline   Reply With Quote
Old 11-17-2015, 04:43 PM   #20
Yuhuan Meng
Junior Member
 
Location: Guangzhou

Join Date: Nov 2015
Posts: 2
Default

I have a problem, if the new genome.fa contain indels, the position would be changed, if I call the genes with gtf from new genome, the called genes will not with the same regions compared with original genes in the original genome.
Yuhuan Meng is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:13 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO