Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is the reference sequence? ( can I find it in a .bai index?)

    Working with the new 1000 genomes (pilot2) release, there are .bai files associated with the new .bam files. I've used 'samtools pileup' to create the pileup files, but the reference column is populated with 'N'. I am trying to determine the reference base (without simply looking at the ncbi human reference and comparing positions), but can't seem to find out how. Is there a way to a) use the .bai file in the process of making the pileup so the reference base column is filled correctly, or b) extract the reference directly from the .bai file?

    Sorry if I've missed something obvious, but I can't find anything in the samtools documentation that answers my question.

    Thanks,
    Jonathan

  • #2
    Going to try setting the -f flag when running samtools pileup..

    Comment


    • #3
      Hi,

      have you managed to solve this problem at all? I am seeing exactly the same and can't quite figure out where I've gone wrong.

      Thanks,

      Jacky

      Comment


      • #4
        I ended up finding downloading the NCBI human reference build 36 in fasta format (split by chromosome), and then using the -f flag when creating the pileup. The reference column seems to be correct after doing this.

        samtools pileup -f ref.fasta alignment.bam > alignment.pu

        Hope that helps!
        Jonathan

        Comment


        • #5
          Mh, that's curious. I have tried that with RefSeq as a reference but I still don't see the reference base. Maybe it's an issue with the format of the Fasta header (contains a colon in my case).

          Thanks for your fast reply anyhow!
          Jacky

          Comment


          • #6
            Still can't get correct reference sequence column

            Hi,

            I still can't get a pileup file where the reference sequence shows bases instead of "N"s. I'd like to create pileup files of sequences from the 1,000 genomes project aligned with NCBI human reference sequences. I am using the -f flag--indicating that the reference sequence is in FASTA format--and also need to use the -c flag--indicating that the pileup file should have the consensus sequence for the original .bam file. In the main samtools-0.1.7a folder of a GNU/Linux computer, I've typed many variants of the following:

            ./samtools pileup -cf /ifs/scratch/.../humanReferenceGenome/UCSCBuild36/chrX.fa /ifs/scratch/.../fatherAlignment/NA12891.chromX.ILLUMINA.bwa.CEU.high_coverage.20100517.bam > NA12891.ChrX.UCSC36.pileup

            The resulting pileup file contains the NA12891 consensus sequence. I've tried using a number of reference sequences, including builds 36.1, 36.2, 36.3 and 37 of the NCBI reference genome and build 36 of the UCSC reference genome, in the hope that one of these reference sequences would also appear in the pileup file. I would very much appreciate any suggestions.

            Thanks,
            Rebecca

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X