Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting sequence for a gene from a BAM file

    I've done RNA-seq of a number of samples, I've aligned them with Tophat2 and I've done some interesting analyses with DESeq2, but now I would like to look at the actual sequences. As far as I can understand, that should be possible, using the accepted_hits.bam files I have after the alignment.

    What I want to do is basically look at the sequences for each bam-file I have, for an arbitrary gene, and see how that gene's sequence differs in my samples from a reference and its sequence for that gene (I'm using hg19 from iGenomes).

    Exactly does one go about doing this? I've looked around, and using SAMTOOLS FAIDX seems to be a part of it, though I don't understand how I get my actual bam-files into that workflow...

  • #2
    Do you just want to visually look, or do you want to extract all reads overlapping a gene or do you just want to call variants from your RNAseq data?

    The solutions, in order, would be IGV, samtools view (use a BED file file the exon coordinates), and GATK or one of the other common variant callers.

    Comment


    • #3
      promo alert ...

      This thing is pretty good and fast for viewing bam files ...
      https://github.com/NCIP/alview Native GUI for view BAM files.

      Web Demo : https://cgwb.nci.nih.gov/cgi-bin/alview

      Comment


      • #4
        @Richard: Looks cool. Can you look at multiple BAM files at once?

        Comment


        • #5
          Coming soon !

          The command line generate beaucoup images and place in slide show and just flip through real quick is the current solution; but I'm getting enough requests that I'll put that in. It's designed to accommodate multiple images; I just didn't test it so I disabled it for now.

          But I'll fix that in next few days.

          stay tuned.

          Comment


          • #6
            Originally posted by dpryan View Post
            Do you just want to visually look, or do you want to extract all reads overlapping a gene or do you just want to call variants from your RNAseq data?

            The solutions, in order, would be IGV, samtools view (use a BED file file the exon coordinates), and GATK or one of the other common variant callers.
            Still new to bioinformatics, so I'm not sure exactly which of those applies... I'm in essence looking for a way to check if there are mutations in a single gene I'm interested in in my different samples. This was just an idea I had for further analysis of my samples, i.e. divide them into different mutations in this gene, and do DE-analysis on them as per the resulting grouping.

            Comment


            • #7
              The first or third solution. Visually looking at things is probably the simplest if you just want a quick answer and you only have one or two shortish genes of interest. Variant calling (i.e., finding mutations) itself would be more appropriate for the whole genome (you'll likely get better results that way rather than just calling variants in a couple genes).

              Comment


              • #8
                Ah, thanks!

                I've done tried to learned the IGV program, and I have successfully gotten to the point where I can see the reference (hg19, from their server) and my own reads (from my bam-files + indexed bam files, .bai) on top of them. Now I'm having some trouble actually checking the sequence, in that they seem to be... wrong.

                I'm interested in the known cancer somatic mutation in KRAS: http://cancer.sanger.ac.uk/cosmic/mu...verview?id=532. As far as COSMIC tells me, it's supposed to be situated at chr12:25398276-25398286, and it's supposed to be a Glycine. Looking at the reference, it says that the amino acid is a Glycine, but looking at the actual codon it reads as GCC. And, unless I'm super stupid or have totally misread my codon chart, is supposed to be a Alanine. Or is the "Sequence" row in the reference the consensus sequence for my actual sample? It does say "reference sequence" when I hover over it...
                Attached Files
                Last edited by ErikFas; 09-04-2014, 03:40 AM.

                Comment


                • #9
                  Don't forget to consider the strand

                  Comment


                  • #10
                    Oh, yeah... I've worked too much with proteins Just to be clear, the sequence GCC is reverse complementary to the original sequence, i.e. GGC, which is Glycine? What is the reasoning for the software to correctly write out Glycine for the amino acid, but GCC for the sequence? I'm assuming there's a good reason for it?

                    Comment


                    • #11
                      Sequence is always shown 5'->3' and typically only for the + strand (unless you're looking at a transcript sequence by itself, in which case it'll always be the strand you're interested in). So it's showing the + strand sequence and knows from the annotation file that there's a protein coding gene on the opposite strand and only reverses that for you (otherwise, it'd have to swap which way it orients things as you scroll...and that'd get confusing).

                      Comment


                      • #12
                        This has been a long overdue read-up on strands, apparently. So, if I understand you and my old coursebook correctly...
                        1. DNA template strand (coding, -) is transcribed into mRNA sense strand (+)
                        2. mRNA (+) is reverse transcribed into cDNA (both +/-), which is then sequenced
                        3. IGV is showing the cDNA sense (+) strand as the sequence (GCC), which is non-coding, but showing the complementary (-) strand as the amino acid (Glycine=GGC), which is coding

                        ... or? Sorry for being so slow on the uptake, I really do appreciate the help!

                        Comment


                        • #13
                          Originally posted by ErikFas View Post
                          This has been a long overdue read-up on strands, apparently. So, if I understand you and my old coursebook correctly...
                          1. DNA template strand (coding, -) is transcribed into mRNA sense strand (+)
                          2. mRNA (+) is reverse transcribed into cDNA (both +/-), which is then sequenced
                          3. IGV is showing the cDNA sense (+) strand as the sequence (GCC), which is non-coding, but showing the complementary (-) strand as the amino acid (Glycine=GGC), which is coding

                          ... or? Sorry for being so slow on the uptake, I really do appreciate the help!
                          When a DNA sequence is stored, we only write down the + strand, so that's what'll normally be shown...regardless of the strand you're actually interested in. If a gene is on the - strand and a program knows that and is showing you the resulting protein sequence, it'll be smart enough to reverse complement things (yes, this is due to the sense strand of the mRNA matching the - strand of the DNA...again, because the gene is on the - strand) before translating into amino acids.

                          In short, DNA sequence = + strand, protein sequence = whatever is appropriate given the annotation. So yeah, you might have to count from the right and reverse complement things if, for example, you want to look for the 3rd base in codon.

                          Comment


                          • #14
                            Alright, thanks a lot! That does clear that up. However...

                            I just checked another mutation I'm interested in, the PIK3CA (H1047R) mutation (http://cancer.sanger.ac.uk/cosmic/sa...iew?id=1998442). Without changing anything in the view, I go over to that location, and look at the sequence and the mutation, which is supposed to be a H to an R. The sequence reads CAT, which IS actually an Histidine, without having to do any reading from the right and complementary! The same goes for the mutation, CGT, into an Arginine, also without any non-default reading.

                            Why is it that some of the sequence seems to need a non-standard reading of the sequence to be right, while others do not? Again, the only thing I changed was the actual zoom in IGV (i.e. the chromosal coordinates), nothing else.

                            Comment


                            • #15
                              PIK3CA is on the plus strand. You'd have to do things differently for genes on the - strand (like KRAS).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:49 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X