Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Bgansw
    Member
    • Nov 2011
    • 24

    Illumina - mapping genome on reference - how to extract assembly?

    Hello,

    I'm having trouble extracting assembled reads after using soapaligner.

    I mapped my raw reads to my reference genome using soap aligner, how do I then extract the assembly and submit for annotation? I tried converting soap to sam format, then using,

    # This is for extracting mapping reads from bam files
    samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam

    converted sam to bam and bam to fastq, then fastq to fasta -
    I need a genebank file to upload sequences to http://rast.nmpdr.org/rast.cgi

    Please help!

    bgansw
  • jimmybee
    Senior Member
    • Sep 2010
    • 119

    #2
    Technically thats not really an assembly.

    TO get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):

    samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

    Comment

    • gringer
      David Eccles (gringer)
      • May 2011
      • 845

      #3
      Originally posted by jimmybee View Post
      To get a consensus sequence, you should look at the variant calling section of the samtools manual (http://samtools.sourceforge.net/mpileup.shtml):
      But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

      I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.

      Comment

      • Bgansw
        Member
        • Nov 2011
        • 24

        #4
        Hello Jimmybee and Gringer,

        Thanx for your guidance with this.

        I've been trying the commands (samtools, bcftools).. and I'm getting an error:

        bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
        [mpileup] 1 samples in 1 input files
        <mpileup> Set max per-file depth to 8000
        [afs] 0:0.000 1:0.000 2:0.000
        Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
        Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.

        I've tried installing samtools 1.19, and running this.. but I keep getting the same error.

        Do you'll know what I'm doing wrong??

        Really really appreciate your help. Thanx v v much

        Bgansw

        Comment

        • gringer
          David Eccles (gringer)
          • May 2011
          • 845

          #5
          Originally posted by Bgansw View Post

          Code:
          bgansw@BioinfoSecond:~/bwa/samtools-0.1.18$ samtools mpileup -uf 533240.5.fna N3_on505.bam |bcftools/bcftools view -cg - |bcftools/vcfutils.pl vcf2fq > N3_on_505_assembly_extracted.fastq
          ...
          Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
          Use of uninitialized value in length at bcftools/vcfutils.pl line 544, <> line 27.
          I recommend checking your file names. I get an error one line after this when I give an incorrect fasta file name to samtools mpileup. There is a chance that your fasta file is called
          Code:
          533240.5.fa
          rather than
          Code:
          533240.5.f[b][i]n[/i][/b]a

          Comment

          • Bgansw
            Member
            • Nov 2011
            • 24

            #6
            Hi Gringer,

            Thanx f the tip. I've gone over all the file names, and have changed the 53340.5.fna to 533240.5.fa but I'm still getting the same error. Does this command work well for you? I'm trying to figure out what the problem could be.

            Do u know if theres any other way to extract the consensus sequence after mapping one genome on another using soap aligner?

            Comment

            • gringer
              David Eccles (gringer)
              • May 2011
              • 845

              #7
              Does this command work well for you?
              I have a slightly different processing script, but the command you have seems to work for me on my mitochondrial data (i.e. it produces a fastq file at the end of it). Here's the actual command sequence that I run for my processing:

              Code:
                  echo "** getting VCF for ${y} (max depth = ${maxdepth}) **";
                  pv results/${y}.bam | samtools mpileup -L ${maxdepth} -uf db/hs_mtDNA_ref.fasta - | \
                      bcftools view -v -g - > results/${y}.vcf
                  echo "** getting consensus FASTQ for ${y} **";
                  perl scripts/vcf2fq.pl -Q 20 -L 20 -f db/hs_mtDNA_ref.fasta results/${y}.vcf > results/con_${y}.fastq
                  echo "** converting ${y} consensus FASTQ to plain FASTA **";
                  perl scripts/fastq2fasta.pl results/con_${y}.fastq | perl -pe 'tr/acgt/ACGT/' > results/con_${y}.fasta
              I've attached my modified vcf2fq script to this post [also the fairly trivial fastq2fasta], maybe it will help to diagnose the problem.
              Attached Files

              Comment

              • xmubingo
                Member
                • Sep 2013
                • 13

                #8
                Originally posted by gringer View Post
                But bear in mind that the vcf2fq script is designed for SNPs, not INDELs. If there is an INDEL in your sequence relative to the reference, then the INDEL and a few flanking bases will be changed to lower case, but not replaced. This means that any fastq file generated from this script will have the same length as the input reference sequence.

                I sent a patch to their sourceforge page to fix this (and allow more compact partial vcf files with a provided reference sequence), but I don't think it's been implemented yet.
                Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  Hi gringer, I got a problem, i found that fastq file generated from vcf2fq have the different length as the input reference sequence. Do you have any update news about vcf2fq ?
                  Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.

                  Comment

                  • xmubingo
                    Member
                    • Sep 2013
                    • 13

                    #10
                    Originally posted by gringer View Post
                    Is this with the standard vcftools script, or the modified version that I wrote? What is the length difference? My modified version will produce a different sequence length because of INDELs -- any insertion or deletion will change the length of the sequence.
                    hi gringer, thanks for your reply. I use standard vcfutils.pl of samtools(Version: 0.1.19-44428cd). Here is my code:

                    Code:
                    samtools mpileup -uf ${refFile} ${uniquelyMappedBAMFile} > ${mpileupFile}
                    bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}
                    seqtk seq -l 0 -A ${cnsFileFQ} > ${cnsFileFA}
                    the cns.fq file contains "A,T,C,G,a,t,c,g,n". but i check length of sequences in fastq file, it doesn't same with length in reference sequence. so i guess the error coming from command "bcftools view -cg ${mpileupFile} | vcfutils.pl vcf2fq > ${cnsFileFQ}". i will check the command again.
                    Last edited by xmubingo; 01-12-2014, 04:26 PM.

                    Comment

                    • gringer
                      David Eccles (gringer)
                      • May 2011
                      • 845

                      #11
                      Is this behaviour consistent regardless of what BAM file you use?

                      Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.

                      Comment

                      • Tong.W
                        Member
                        • Jul 2012
                        • 14

                        #12
                        Do not use a reference sequence to assembly. They are different. If you want to get a consistent after soap,you can use SOAPsnp. This software will generate a new consistent sequence for you.

                        Comment

                        • xmubingo
                          Member
                          • Sep 2013
                          • 13

                          #13
                          Originally posted by gringer View Post
                          Is this behaviour consistent regardless of what BAM file you use?

                          Your command line looks correct (it at least matches the signature of the samtools man page), so the next thing I would check would be inputs/outputs. Note that the current vcfutils vcf2fq implementation needs every base to be specified, so the output of bcftools should be one "variant" line per base. Unfortunately I can't really help you on this much more without actual examples.
                          Thanks, gringer!! I think you figure out the problem. I use the following command to check the input of vcf2fq. I found that the output of bcftools did not cover each base.
                          Code:
                          bcftools view -cg ${mpileupFile} | more
                          For example, i pick up several lines of bcftools ouput. You can see that contig "AGTP01133720.1" ends in position of 23,295bp, but actually "AGTP01133720.1" has a length of 23,339bp. More clearly, contig "AGTP01133827.1" starts in position of 30bp, not in the beginning.
                          Code:
                          AGTP01133720.1  23292   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23293   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23294   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133720.1  23295   .       A       .       33      .       DP=1;AF1=0;AC1=0;DP4=0,1,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  30      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  31      .       G       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  32      .       C       .       33      .       DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=50;FQ=-30       PL      0
                          AGTP01133827.1  33      .       T       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          AGTP01133827.1  34      .       A       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          AGTP01133827.1  35      .       G       .       36      .       DP=2;AF1=0;AC1=0;DP4=1,1,0,0;MQ=50;FQ=-33       PL      0
                          So, gringer, what is the cause of my problem?
                          1. the bam file i used? for reads didn't cover each base in reference sequence.
                          2. or the parameters i used in "samtools mpileup"?

                          Great thanks!!
                          Last edited by xmubingo; 01-12-2014, 05:51 PM.

                          Comment

                          • gringer
                            David Eccles (gringer)
                            • May 2011
                            • 845

                            #14
                            Originally posted by xmubingo View Post
                            1. the bam file i used? for reads didn't cover each base in reference sequence.
                            2. or the parameters i used in "samtools mpileup"?
                            This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

                            Code:
                            samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less

                            Comment

                            • xmubingo
                              Member
                              • Sep 2013
                              • 13

                              #15
                              Originally posted by gringer View Post
                              This looks like it could be a bug in samtools mpileup, in not reporting all locations when a fasta file is specified. Does the mpileup output include all locations in the fasta file?

                              Code:
                              samtools mpileup -f ${refFile} ${uniquelyMappedBAMFile} | less
                              No, you can see:
                              Code:
                              AGTP01133720.1  23292   A       1       ,       >
                              AGTP01133720.1  23293   A       1       ,       ;
                              AGTP01133720.1  23294   A       1       ,       9
                              AGTP01133720.1  23295   A       1       ,$      8
                              AGTP01133827.1  30      G       1       ^S.     C
                              AGTP01133827.1  31      G       1       .       C
                              AGTP01133827.1  32      C       1       .       C
                              AGTP01133827.1  33      T       2       .^S,    F>
                              AGTP01133827.1  34      A       2       .,      FG
                              AGTP01133827.1  35      G       2       .,      FI

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              32 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...