Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina - mapping genome on reference - how to extract assembly?

    Hello,

    I'm having trouble extracting assembled reads after using soapaligner.

    I mapped my raw reads to my reference genome using soap aligner, how do I then extract the assembly and submit for annotation? I tried converting soap to sam format, then using,

    # This is for extracting mapping reads from bam files
    samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam

    converted sam to bam and bam to fastq, then fastq to fasta -
    I need a genebank file to upload sequences to http://rast.nmpdr.org/rast.cgi

    Please help!

    bgansw

  • #2
    Technically thats not really an assembly.

    TO get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):

    samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

    Comment


    • #3
      Originally posted by jimmybee View Post
      To get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):
      But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

      I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.

      Comment


      • #4
        Hello Jimmybee and Gringer,

        Thanx for your guidance with this.

        I've been trying the commands (samtools, bcftools).. and I'm getting an error:

        bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
        [mpileup] 1 samples in 1 input files
        <mpileup> Set max per-file depth to 8000
        [afs] 0:0.000 1:0.000 2:0.000
        Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
        Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.

        I've tried installing samtools 1.19, and running this.. but I keep getting the same error.

        Do you'll know what I'm doing wrong??

        Really really appreciate your help. Thanx v v much

        Bgansw

        Comment


        • #5
          Originally posted by Bgansw View Post

          Code:
          bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
          ...
          Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
          Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
          I recommend checking your file names. I get an error one line after this when I give an incorrect fasta file name to samtools mpileup. There is a chance that your fasta file is called
          Code:
          533240.5.fa
          rather than
          Code:
          533240.5.f[b][i]n[/i][/b]a

          Comment


          • #6
            Hi Gringer,

            Thanx f the tip. I've gone over all the file names, and have changed the 53340.5.fna to 533240.5.fa but I'm still getting the same error. Does this command work well for you? I'm trying to figure out what the problem could be.

            Do u know if theres any other way to extract the consensus sequence after mapping one genome on another using soap aligner?

            Comment


            • #7
              Does this command work well for you?
              I have a slightly different processing script, but the command you have seems to work for me on my mitochondrial data (i.e. it produces a fastq file at the end of it). Here's the actual command sequence that I run for my processing:

              Code:
                  echo "** getting VCF for ${y} (max depth = ${maxdepth}) **";
                  pv results/${y}.bam | samtools mpileup -L ${maxdepth} -uf db/hs_mtDNA_ref.fasta - | \
                      bcftools view -v -g - > results/${y}.vcf
                  echo "** getting consensus FASTQ for ${y} **";
                  perl scripts/vcf2fq.pl -Q 20 -L 20 -f db/hs_mtDNA_ref.fasta results/${y}.vcf > results/con_${y}.fastq
                  echo "** converting ${y} consensus FASTQ to plain FASTA **";
                  perl scripts/fastq2fasta.pl results/con_${y}.fastq | perl -pe 'tr/acgt/ACGT/' > results/con_${y}.fasta
              I've attached my modified vcf2fq script to this post [also the fairly trivial fastq2fasta], maybe it will help to diagnose the problem.
              Attached Files

              Comment


              • #8
                Originally posted by gringer View Post
                But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

                I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.
                Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?

                Comment


                • #9
                  Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?
                  Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.

                  Comment


                  • #10
                    Originally posted by gringer View Post
                    Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.
                    hi gringer, thanks for your reply. I use standard vcfutils.pl of samtools(Version: 0.1.19-44428cd). Here is my code:

                    Code:
                    samtools mpileup -uf ${refFile} ${uniquelyMappedBAMFile} > ${mpileupFile}
                    bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}
                    seqtk seq -l 0 -A ${cnsFileFQ} > ${cnsFileFA}
                    the cns.fq file contains "A,T,C,G,a,t,c,g,n". but i check length of sequences in fastq file, it doesn't same with length in reference sequence. so i guess the error coming from command "bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}". i will check the command again.
                    Last edited by xmubingo; 01-12-2014, 04:26 PM.

                    Comment


                    • #11
                      Is this behaviour consistent regardless of what BAM file you use?

                      Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.

                      Comment


                      • #12
                        Do not use a reference sequence to assembly. They are different. If you want to get a consistent after soap,you can use SOAPsnp. This software will generate a new consistent sequence for you.

                        Comment


                        • #13
                          Originally posted by gringer View Post
                          Is this behaviour consistent regardless of what BAM file you use?

                          Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.
                          Thanks, gringer!! I think you figure out the problem. I use the following command to check the input of vcf2fq. I found that the output of bcftools did not cover each base.
                          Code:
                          bcftools view -cg ${mpileupFile} | more
                          For example, i pick up several lines of bcftools ouput. You can see that contig "AGTP01133720.1" ends in position of 23,295bp, but actually "AGTP01133720.1" has a length of 23,339bp. More clearly, contig "AGTP01133827.1" starts in position of 30bp, not in the beginning.
                          Code:
                          AGTP01133720.1  23292   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23293   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23294   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23295   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  30      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  31      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  32      .       C       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  33      .       T       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          AGTP01133827.1  34      .       A       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          AGTP01133827.1  35      .       G       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          So, gringer, what is the cause of my problem?
                          1. the bam file i used? for reads didn't cover each base in reference sequence.
                          2. or the parameters i used in "samtools mpileup"?

                          Great thanks!!
                          Last edited by xmubingo; 01-12-2014, 05:51 PM.

                          Comment


                          • #14
                            Originally posted by xmubingo View Post
                            1. the bam file i used? for reads didn't cover each base in reference sequence.
                            2. or the parameters i used in "samtools mpileup"?
                            This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

                            Code:
                            samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less

                            Comment


                            • #15
                              Originally posted by gringer View Post
                              This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

                              Code:
                              samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less
                              No, you can see:
                              Code:
                              AGTP01133720.1  23292   A       1       ,       >
                              AGTP01133720.1  23293   A       1       ,       ;
                              AGTP01133720.1  23294   A       1       ,       9
                              AGTP01133720.1  23295   A       1       ,$      8
                              AGTP01133827.1  30      G       1       ^S.     C
                              AGTP01133827.1  31      G       1       .       C
                              AGTP01133827.1  32      C       1       .       C
                              AGTP01133827.1  33      T       2       .^S,    F>
                              AGTP01133827.1  34      A       2       .,      FG
                              AGTP01133827.1  35      G       2       .,      FI

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X