Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Consensus FASTA from BAM files

    Hi,

    I'm currently trying to get a consensus sequence that is specific to individual humans, from the 1000 genomes project data available in BAM format at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/

    I want to do multiple sequence alignments of certain loci between individuals, so I need the individual sequence in FASTA format. Currently, I've tried to do samtools pileup -c, followed by conversion to fastq (samtools) and then to fasta (maq tools):

    samtools pileup -c -f /gcm/opt/genomes/hg18-chromosomes.fa ${file}.bam > ${file}.pileup
    /opt/samtools/samtools-0.1.7a/misc/samtools.pl pileup2fq ${file}.pileup > ${file}.fastq
    /opt/maq/maq-0.7.1/bin/fq_all2std.pl fq2fa ${file}.fastq > ${file}.fasta

    The FASTQ generated from consensus pileup seems ok, although the whole sequence is on a single line. Using MAQ's fq2fa, however, this is converted into a much smaller FASTA file, with quality score data instead of sequence in there.

    So, any ideas what I may be doing wrong or alternative tool suggestions to accomplish this task, in case the converters are buggy, would be much appreciated.

    Thanks!

  • #2
    Hi,

    Take a look at this post, which may be helpful to your situation:



    Douglas

    Comment


    • #3
      good! thanks very much!

      Comment


      • #4
        I have complete genomics data which I converted to BAM using CGA tools. However the fastq conversion from .bam is giving an error "too many .bam files". This I believe is due to large size of the bam(~660 GB). The bam is large mainly due to numerous N's in the file(complete genomics data). Can these N's be cropped to make the file smaller for execution by samtools?

        Comment


        • #5
          My understanding was to use :
          samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

          With the reference from 1000 genomes being hs37d5.fa and the BAM file for the specific patient downloaded.

          Does that command get you a consensus sequence though (once you convert fq to fa)?

          And is there a way to just get a consensus chromosome (example chr4) instead of processing the whole thing?

          Comment


          • #6
            FQ to FA?

            Okay, this command indeed seems to produce a decent FQ file:

            samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

            I keep reading conflicting posts (fastx, awk, fastq2fasta.pl,etc.) on how to convert that FQ output to FASTA (fa file)

            My question
            ::::
            Given my source is from 1000 genomes, what would be the most reliable way of changing that FQ file to FA?
            ::::

            Keep in mind, my goal is to get a consensus chromosome.

            Comment


            • #7
              Originally posted by Gabeloooooo View Post

              My question
              ::::
              Given my source is from 1000 genomes, what would be the most reliable way of changing that FQ file to FA?
              ::::

              Keep in mind, my goal is to get a consensus chromosome.
              flexlex posted this tool from Heng Li in a recent thread that can do the conversion you are looking for.



              Look at the examples included on that page.

              Comment


              • #8
                Thank you! Will try it out and report back on success/failure

                Comment


                • #9
                  Conversion seems to work fine. Well, it gives no errors and data 'looks' good

                  Is there a way to get the consensus sequence aligned with the reference?

                  Say the reference FASTA file has 90,354,753 bp and the ouput from
                  'samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq'
                  gives a file with 243,189,260 bp

                  My question
                  ::::
                  Is there a process where I could end up with a consensus sequence that is also 90,354,753 bp (same as reference)?
                  ::::


                  For this particular case, I don't care if getting this result discards a lot of info about the patient.

                  Comment


                  • #10
                    How to generate consensus sequence using BAM file?

                    Hi! I have a BAM file, which is generated by using BWA, and I'd like to generate a consensus sequence or a set of contigs. How can I generate a consensus sequence from BAM file? Which tools can do it?
                    Thanks!

                    Comment


                    • #11
                      I have found that getting a fastq out of vcfutils is pretty much never what I want. So I altered the program so it will output fasta instead. Happily, it's a perl program, which means that it is a plain ordinary text file, so it's easy to modify.

                      You are looking for these lines

                      Code:
                        print "\@$chr\n"; &v2q_print_str($seq);
                        print "+\n"; &v2q_print_str($qual)
                      Change them to this

                      Code:
                        print "\[COLOR="Red"]>[/COLOR]$chr\n"; &v2q_print_str($seq);
                        [COLOR="Red"]#[/COLOR]print "+\n"; &v2q_print_str($qual)
                      You are changing the @ of fastq format to the > of fasta format, and the # means "Skip this line". That's the line where the script would print the quality scores.

                      Comment


                      • #12
                        Dear swbarnes2,
                        Can the program you said generate consensus sequence file from alignment file(BAM file)? And how can I get the perl script? Can you send it to me via e-mail? My e-mail adress is [email protected].I'm looking forward your reply as soon as possible. Thanks very much!

                        Comment


                        • #13
                          Originally posted by binlangman View Post
                          Dear swbarnes2,
                          Can the program you said generate consensus sequence file from alignment file(BAM file)? And how can I get the perl script? Can you send it to me via e-mail? My e-mail adress is [email protected].I'm looking forward your reply as soon as possible. Thanks very much!
                          @swbarnes2 was referring to editing the "samtools-0.1.19/bin/vcfutils.pl" script (that you will find in the samtools package). The lines quoted are almost at the end of that script file.

                          Comment


                          • #14
                            generate consensus from a BAM file

                            I run the following command in oder to generate consensus from a BAM file:
                            samtools mpileup -uf NC_010473.fasta sample.sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fq
                            Why is the output not a fastq file? And the output looks strange. The format of output is as follows:
                            @NC_010473
                            AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
                            TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
                            TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
                            ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
                            AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
                            CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
                            ACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
                            ...............................................................
                            Why does the result look so strange?

                            Comment


                            • #15
                              How is that not a fastq file? It's just multiline. The first line is the name, beginning with "@". Then follows the DNA sequence. Then you'll see a +, maybe the name will repeat, and then you'll get the quality string.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X