Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert 1000-Genomes-proje BAM to FASTA (aligned to reference, grouped by chromosome)

    Hi,

    I would like to use some data from the 1000 Genome project. As I understand, the BAM files for the individuals contain difference-reads to a reference sequence (HG18/HG19).

    Now, given the reference sequence as (compressed) FASTA, one file for each chromosome, and the BAM file for let's say NA12283, I would like to obtain the FASTA files for each chromosome of NA12283. Is this possible, and how?

    I tried to download a bulk of (s/b)am2X tools, but they only extract all the reads from the BAM file and write them into one FASTA file, without (re-)creating the actual chromosomes with respect to the reference.

    Thanks and best Regards,
    Daniel

  • #2
    I'm not quite sure what you're asking. BAM files contain reads (mapped and potentially unmapped) to a reference sequence. You indicate you already have the reference, so is the question "How can I extract reads from individual chromosomes from a BAM file?". If not, then this answer might be no use

    You could just subset the bam file (http://www.1000genomes.org/faq/how-d...ction-bam-file) by chromosome and then extract the reads with bam2fastq (http://www.hudsonalpha.org/gsl/software/bam2fastq.php) I'm assuming you want fastq and not fasta, as if you convert to fasta you will lose quality information. If you do want fasta then the fastq->fasta conversion is trivial and implemented in many forms.

    Comment


    • #3
      Well, maybe I am just not quite sure, whether what I want to do really makes sense

      For instance, given reference Chromosome 1 (of a/the reference genome), and some reads of human NA12283 from the 1000 Genome project, I just ask myself: "What does the Chromosome 1 of NA12283 look like". I would assume (maybe this is my mistake?) that I can just overlay all the mapped reads to the reference Chromosome 1 and then obtain "the" Chromosome 1 of NA12283 - at least roughly.

      If all this makes (biologically) sense, then I would be interested in the FASTA file of Chromosome 1 of NA12283 - computed from the reference chromosome and the mapped reads.

      About your solution: Once converted to FASTQ - don't I lose all the mapping information? I really want to *combine* the reads and the corresponding reference chromosome, not only convert reads to reads.

      Comment


      • #4
        Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?

        So you could subset your bam's by chromosome and then do something like;

        samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq

        Comment


        • #5
          For 4X coverage, there is essentially no way to generate a good consensus.

          Comment


          • #6
            Originally posted by Bukowski View Post
            Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?

            So you could subset your bam's by chromosome and then do something like;

            samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
            I tried to run this but then I realized pileup was changed to mpileup and -c is not supported. so i run mpileup -uf, but I get may errors.

            Comment


            • #7
              perhaps you can use the variant calls from this sample along with the reference sequence to sort of come up with 'the' sequence for this individual
              --
              bioinfosm

              Comment


              • #8
                I'm trying to do the exact same thing! Did you ever find a way to get a decent fasta complete chromosome from one of the 1000 genomes samples?

                My understanding is the reference they use is hs37d5.fa.

                So, can you use a BAM/BAI file combo from patient HG00XXX and map the reads to hs37d5.fa to get the rough 'genome' of patient HG00XXX?

                Comment


                • #9
                  See here for one way to get a .vcf file with SNPs and indels from the .bam file, or a consensus sequence:


                  The consensus sequence generated by this method has the problem that it only applies the SNPs to the reference sequence, but not the indels.
                  The .vcf file is better since it includes both SNPs and indels.

                  The .vcf file can be converted to a .fasta sequence using this tool:

                  However, note that this tool will only take into account indels of length up to 2 bases (as of January 2013). You may want to write your own script to insert all the indels (including the longer ones) from the .vcf into the .fasta.

                  This method should get the whole sequence from the .bam file, however, I don't know how to extract individual chromosomes from it.

                  Comment


                  • #10
                    The mpileup method seems to work... I've two questions.

                    1. Is there a way to know how many SNPs I should expect in this 'consensus' sequence? Say I want to know how many SNPs HGxxxx has in his chromosome 2, etc.

                    2. When you say the VCF file is better, is there another way to get a VCF from 1000 genomes, other than using the mpileup method?

                    I was under the impression the ftp site only contained BAM files?

                    Thanks a lot!

                    Comment


                    • #11
                      1. There could be, but I don't know of one. I usually just check if the last position numbers in the .vcf file are close to the number of base pairs in the chromosome. (Note that if you use the whole genome, then the last positions will be the last positions in the last supercontig, not the whole genome.)

                      2. 1000 Genomes has some .vcf files in their /release folder (description of contents here: http://www.1000genomes.org/faq/what-bas-file). As for other methods to generate a .vcf file, there might be some, but I personally don't know of any.

                      I'm not sure what you mean by the 1000 Genomes FTP site only containing .bam files. You can see the link above for all the file types they have. But, sorry, I'm not sure what you're asking, maybe you can clarify?

                      I'd also like to mention that I found out that when you give mpileup a whole genome .bam file from the 1000 Genomes site, and generate a .vcf from it, then the first number of each line in the .vcf is the number of the chromosome (1...22, X, Y or the ID of a supercontig). Therefore, it is possible to write a simple script to split the .vcf into several .vcf-s, one for each chromosome, and by mapping the changes to individual chromosome fasta files, get the sequences of each chromosome separately.

                      Comment


                      • #12
                        I mean, do the VCF files in the release folder contain all SNPs for a given sample (ex: all SNPs for HG00254)?

                        Say I want all the SNPs of a consensus chromosome 3 for HG00254, my understanding was that the easiest method it to get the BAM, create mpileup, create fasta from that

                        Comment


                        • #13
                          I downloaded one of the .vcf files to see, and, as far as I can tell, they don't contain the SNPs of any samples, just the collective SNPs of all the samples, with the population frequencies of each SNP. Something like this for each SNP:
                          pos: 123456, ref: A, alt: T, african_freq: 0.12, american_freq: 0.01, european_freq: 0.5, asian_freq: 0.14

                          I think you can only get all SNPs and indels of a given sample by doing what you said, i.e. getting the .bam, running mpileup on it, and generating a .fasta from the .vcf.

                          Comment


                          • #14
                            Thanks a lot! Will do it this way.

                            The other thing I don't get is how they sometimes indicate in the 1000 genomes browser a certain genotype, yet I can't find it in the 1000 genomes data.

                            An example is rs1805007 for sample HG00108. It says T|C in the genome browser (see url below), yet I can't find any instances of a T in the 3 tracks provided by 1000 genomes.

                            URL in 1000 genomes browser:

                            Comment


                            • #15
                              Sorry, I don't have much experience with the genome browser. I see C in two of the three HG00108 tracks shown in the genome browser (the exome one is empty). There reference genome shows C|G. Where are you seeing the T|C? (I think linking to the genome browser doesn't work, by the way, so it would be better if you described it.)

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X