Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computer Scientist diving into bioinformatics... where to start?

    I'm a professor of Computer Science looking to learn the basics of sequencing technology. In particular, I'm interested in understanding the various file formats.

    The C.S. stuff is straightforward for me, but the bio stuff is rather... challenging: I have a high-school level of understanding of biology, which only gets me so far. For example, I know what a nucleotide is, I understand the basic idea of how codons produce proteins, and I know roughly how DNA differs from RNA. But this newer terminology around sequencing technology is hard to learn because there doesn't seem to be a good set of references. For example, I searched for a while trying to learn what a CIGAR string is, only to get tons of smoking and Freud references. The lexicon you folks use is unfortunately borrowed from mainstream English and therefore hard to google ("lane", "read", "run").

    My question: is there a good reference, free or not, that quickly takes one through the lexicon required to absorb file-format descriptions? I have already read Larry Hunter's excellent "The Processes of Life" but it doesn't spend much time specifically on sequencing.

  • #2
    Check out Kimball's biology for a good tutorial on the science.



    Youtube is also great, it's fun to have a visualization of the problem at hand.

    The basic situation is this: we now have reference genomes for many organisms, including mouse and human. These were brand new as of about 10 years ago. Since then, scientists have hunkered down over them trying to make sense of them. We speak of the RNA and DNA and their products in the cell in terms of various "omics" : genome, transcriptome, exome, proteome, etc. It was once hard to get measurements of these, but now it's easy and the data IS HUGE. It's a tsunami.

    The tsunami does wind up in big files: fasta, fastq, sam, bigwig, bed, csv, xls, magetab, etc. I was just chasing down bigbed and the spec. is buried in the supplementary section of the scholarly publication. Most folks are pretty happy to find a command line tool or friendly library that handles a particular file type and only hunt down the spec when they really need to crack it open. Google is your friend, as is seqanswers.

    I'd recommend looking at fasta, which you can figure out by staring at it and SAM/BAM which is much more complicated but is the place where alignment data winds up these days. The sourceforge page for samtools has the spec.

    Comment


    • #3
      Originally posted by Richard Finney View Post
      Check out Kimball's biology for a good tutorial on the science.

      Thank you, Richard, for pointing me at this resource. It might prove useful, but as an experiment I typed "cigar" into its search appliance. No results found.

      Youtube is also great, it's fun to have a visualization of the problem at hand.
      Agreed, it is great. Finding the right stuff to watch, however, seems daunting for a newcomer.

      Most folks are pretty happy to find a command line tool or friendly library that handles a particular file type and only hunt down the spec when they really need to crack it open. Google is your friend, as is seqanswers.
      I agree Google (and this new-found forum) are wonderful resources. However, I can't find a definition for "cigar" in either place. For Google, the term is overwhelmed by mainstream usage. For seqanswers, the term is so familiar and implicitly understood in this community that no one bothers to define it.

      Wikipedia is often the best friend of folks in my situation; however it too is ignorant here (http://en.wikipedia.org/wiki/Cigar_%28disambiguation%29 .)

      I'd recommend looking at fasta, which you can figure out by staring at it and SAM/BAM which is much more complicated but is the place where alignment data winds up these days. The sourceforge page for samtools has the spec.
      Reading the SAM/BAM spec it begins by discussing things like "clipped alignments", "chimeric reads", "split alignments", and "flow orders". There are no definitions for these terms.

      Please don't misconstrue my posts as complaining or whining. I know that ongoing research is often not accompanied by tutorial-level resources. I am just hoping that these terms are defined somewhere, and that you folks might know where.

      Cheers.

      Comment


      • #4
        Originally posted by Fixee View Post
        My question: is there a good reference, free or not, that quickly takes one through the lexicon required to absorb file-format descriptions?
        Hi Fixee, I don't think this is the answer you are looking for but the answer is: No. There simply is no such reference. Even when you ask on SEQanswers or elsewhere, the best answer you often get is "you have to take a look at the sourcecode" or look at the spec, if there's one and if it's in sync with the source, which it ain't necessarily. Or contact the developers directly.

        The reason is, these fileformats and the way they are used change quickly with the technology, and people make up tools and pipelines in quite an ad hoc manner. Now most work with sam/bam because there are tools that support it and it somehow turned out to be the consensus. As soon as a different technology (stroboscope sequencing or something entirely new) comes along, file formats, tools will change again. Also many sequencing centers have in-house file formats which are only needed for in-house tools which adress in-house problems, and many questions revolve on how to convert in and out of the in-house formats into sam/bam and back and so on.

        It's absolutely not like in c.s. where you carefully draw up specificiations for file formats covering all possible cases, being future-proof etc. Quite the opposite.

        My recommendation for you would be, as you are a professor: find a sequencing center / genomic research center with which you can collaborate on a specific, concrete project. This way you will learn what the problems are the researchers in the trenches face, and you can identify actual needs which you can address in your own research.

        Trying to "understand file formats" will be pretty difficult and boring without having a specific question you want to answer about the data contained in the files. I always try to understand only what I need to solve the current problem, then move on to the next one. File formats are just syntax and technicalities anyway, unfortunately we have to deal with them all the time for practical reasons, but from a scientific viewpoint nothing could be less interesting.

        2.) Regarding CIGAR, just enter

        cigar string

        in Google that improves the search results tremendously. The first hit for example is a PDF of the SAM spec.

        Comment


        • #5
          Hi Fixee,

          I am in a similar situation as you - I have a computer science background and have been working on analyzing next generation sequencing data. From a CS point of view, the file formats themselves can be quite interesting (yeah, cs people are weird like that). Sorted BAM files can be indexed with .BAI files to allow you to quickly find all intervals that overlap a query range. It uses a binning scheme that is also used for the UCSC genome browser. Other approaches to do quick range queries include Nested Containment Lists (NCL) and R-trees. Another interesting thing about the BAM format is that it uses BGZF compression, which supports random access seeking and adheres to the GZIP specs. You might find the Tabix tool interesting - it takes some of the ideas from the BAM spec and applies it to generic tab-delimited files of genomic ranges.

          As for the specific question of what is the CIGAR, I think the SAM/BAM spec has some good information about it. After lots of searching, I found that it stands for "Compact Idiosyncratic Gapped Alignment Report". Maybe doing a google search of "cigar match insertion" would help. Basically, the CIGAR describes how to align one sequence to another and includes things like matches, insertions, and deletions. And as you delve deeper into learning about the CIGAR string, the acronym "MIDNSHP" may become part of your vocabulary as well.

          Hope that helps. I can point you to some paper references, as the binning scheme and BGZF compression are somewhat hidden in other papers and references.

          BAMseek

          Comment


          • #6
            Originally posted by Azazel View Post
            My recommendation for you would be, as you are a professor: find a sequencing center / genomic research center with which you can collaborate on a specific, concrete project.
            Hi Azazel. Thanks for the response and the info on the vagaries of file formats these days.

            I indeed already have a project I'm engaged with. We've settled on SAM/BAM, so I'm trying to drill down and understand the bio meanings of the various fields (I understand the CS stuff just fine).

            Comment


            • #7
              Originally posted by BAMseek View Post
              I am in a similar situation as you - I have a computer science background and have been working on analyzing next generation sequencing data.
              All right! Now we're talking... where did you get the bio background to get up to speed on this stuff?

              From a CS point of view, the file formats themselves can be quite interesting (yeah, cs people are weird like that).
              Wait, I'm confused... how is that weird?

              Hope that helps. I can point you to some paper references, as the binning scheme and BGZF compression are somewhat hidden in other papers and references.
              Thanks BAMseek. However, what I'm REALLY interested in is learning a little bit of the bio stuff so I understand what is stored in these BAM files and how the data are used. What's a "clipped alignment?" What is "deep resequencing?" What's a "lane?" These things are highly google-resistant.

              Cheers.

              Comment


              • #8
                However, what I'm REALLY interested in is learning a little bit of the bio stuff so I understand what is stored in these BAM files and how the data are used. What's a "clipped alignment?" What is "deep resequencing?" What's a "lane?" These things are highly google-resistant.
                Yeah, that is a tough one - I am not aware of one source that would cover all of this stuff. SeqAnswers has been a great resource in helping me piece together this information. Maybe others can point to some good summary posts, such as this.

                For a broad overview, a google search of "next generation sequencing" might be helpful. To learn about lanes, you may want to look up information about the Illumina flow cell. There are 8 lanes on a flow cell, and one sample can be sequenced per lane (more if you barcode/multiplex the samples). If you are working with ABI SOLiD technology then an understanding of color space (as compared to base space) would be essential. Also, knowing what single-end reads and paired-end reads are will be helpful in understanding the reads in a BAM file. Deep resequencing is one of the applications of next gen sequencing, along with ChIP-Seq, RNA-Seq, ... Reading the initial papers on these applications (Wold for ChIP-Seq and Mortazavi for RNA-seq) helped me alot. I'll let you know if I can think of anything more specific.

                Comment


                • #9
                  All right, well thanks to everyone who pitched in here. I'm going to dig into the resources mentioned, read some more posts here at seqanswers, and perhaps bug you all when I get stuck. Many thanks!

                  Comment


                  • #10
                    If you're not already familiar...qualifying a google search with "site:seqanswers.com" will likely make the results a lot more relevant....

                    Comment


                    • #11
                      Originally posted by ECO View Post
                      If you're not already familiar...qualifying a google search with "site:seqanswers.com" will likely make the results a lot more relevant....
                      Cheers... I've already collected some reading material with "site:seqanswers.com" and "filetype: pdf" (another useful google feature).

                      [You must remove the space after the "filetype:" above. I added a space because without it, this vBulletin site rewrites the colon-p as an emoticon and HTML encoding the colon with &#59; wasn't working.]
                      Last edited by Fixee; 06-23-2011, 01:06 PM.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 11:49 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      61 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X