Seqanswers Leaderboard Ad

**Richard Finney** · 06-22-2011, 07:47 PM

Check out Kimball's biology for a good tutorial on the science.

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/

Youtube is also great, it's fun to have a visualization of the problem at hand.

The basic situation is this: we now have reference genomes for many organisms, including mouse and human. These were brand new as of about 10 years ago. Since then, scientists have hunkered down over them trying to make sense of them. We speak of the RNA and DNA and their products in the cell in terms of various "omics" : genome, transcriptome, exome, proteome, etc. It was once hard to get measurements of these, but now it's easy and the data IS HUGE. It's a tsunami.

The tsunami does wind up in big files: fasta, fastq, sam, bigwig, bed, csv, xls, magetab, etc. I was just chasing down bigbed and the spec. is buried in the supplementary section of the scholarly publication. Most folks are pretty happy to find a command line tool or friendly library that handles a particular file type and only hunt down the spec when they really need to crack it open. Google is your friend, as is seqanswers.

I'd recommend looking at fasta, which you can figure out by staring at it and SAM/BAM which is much more complicated but is the place where alignment data winds up these days. The sourceforge page for samtools has the spec.

**Fixee** · 06-22-2011, 09:02 PM

Originally posted by Richard Finney View Post

Check out Kimball's biology for a good tutorial on the science.

404 Not Found

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/

Thank you, Richard, for pointing me at this resource. It might prove useful, but as an experiment I typed "cigar" into its search appliance. No results found.

Youtube is also great, it's fun to have a visualization of the problem at hand.

Agreed, it is great. Finding the right stuff to watch, however, seems daunting for a newcomer.

Most folks are pretty happy to find a command line tool or friendly library that handles a particular file type and only hunt down the spec when they really need to crack it open. Google is your friend, as is seqanswers.

I agree Google (and this new-found forum) are wonderful resources. However, I can't find a definition for "cigar" in either place. For Google, the term is overwhelmed by mainstream usage. For seqanswers, the term is so familiar and implicitly understood in this community that no one bothers to define it.

Wikipedia is often the best friend of folks in my situation; however it too is ignorant here (http://en.wikipedia.org/wiki/Cigar_%28disambiguation%29 .)

I'd recommend looking at fasta, which you can figure out by staring at it and SAM/BAM which is much more complicated but is the place where alignment data winds up these days. The sourceforge page for samtools has the spec.

Reading the SAM/BAM spec it begins by discussing things like "clipped alignments", "chimeric reads", "split alignments", and "flow orders". There are no definitions for these terms.

Please don't misconstrue my posts as complaining or whining. I know that ongoing research is often not accompanied by tutorial-level resources. I am just hoping that these terms are defined somewhere, and that you folks might know where.

Cheers.

**Azazel** · 06-22-2011, 09:41 PM

Originally posted by Fixee View Post

My question: is there a good reference, free or not, that quickly takes one through the lexicon required to absorb file-format descriptions?

Hi Fixee, I don't think this is the answer you are looking for but the answer is: No. There simply is no such reference. Even when you ask on SEQanswers or elsewhere, the best answer you often get is "you have to take a look at the sourcecode" or look at the spec, if there's one and if it's in sync with the source, which it ain't necessarily. Or contact the developers directly.

The reason is, these fileformats and the way they are used change quickly with the technology, and people make up tools and pipelines in quite an ad hoc manner. Now most work with sam/bam because there are tools that support it and it somehow turned out to be the consensus. As soon as a different technology (stroboscope sequencing or something entirely new) comes along, file formats, tools will change again. Also many sequencing centers have in-house file formats which are only needed for in-house tools which adress in-house problems, and many questions revolve on how to convert in and out of the in-house formats into sam/bam and back and so on.

It's absolutely not like in c.s. where you carefully draw up specificiations for file formats covering all possible cases, being future-proof etc. Quite the opposite.

My recommendation for you would be, as you are a professor: find a sequencing center / genomic research center with which you can collaborate on a specific, concrete project. This way you will learn what the problems are the researchers in the trenches face, and you can identify actual needs which you can address in your own research.

Trying to "understand file formats" will be pretty difficult and boring without having a specific question you want to answer about the data contained in the files. I always try to understand only what I need to solve the current problem, then move on to the next one. File formats are just syntax and technicalities anyway, unfortunately we have to deal with them all the time for practical reasons, but from a scientific viewpoint nothing could be less interesting.

2.) Regarding CIGAR, just enter

cigar string

in Google that improves the search results tremendously. The first hit for example is a PDF of the SAM spec.

**BAMseek** · 06-23-2011, 12:53 AM

Hi Fixee,

I am in a similar situation as you - I have a computer science background and have been working on analyzing next generation sequencing data. From a CS point of view, the file formats themselves can be quite interesting (yeah, cs people are weird like that). Sorted BAM files can be indexed with .BAI files to allow you to quickly find all intervals that overlap a query range. It uses a binning scheme that is also used for the UCSC genome browser. Other approaches to do quick range queries include Nested Containment Lists (NCL) and R-trees. Another interesting thing about the BAM format is that it uses BGZF compression, which supports random access seeking and adheres to the GZIP specs. You might find the Tabix tool interesting - it takes some of the ideas from the BAM spec and applies it to generic tab-delimited files of genomic ranges.

As for the specific question of what is the CIGAR, I think the SAM/BAM spec has some good information about it. After lots of searching, I found that it stands for "Compact Idiosyncratic Gapped Alignment Report". Maybe doing a google search of "cigar match insertion" would help. Basically, the CIGAR describes how to align one sequence to another and includes things like matches, insertions, and deletions. And as you delve deeper into learning about the CIGAR string, the acronym "MIDNSHP" may become part of your vocabulary as well.

Hope that helps. I can point you to some paper references, as the binning scheme and BGZF compression are somewhat hidden in other papers and references.

BAMseek

**Fixee** · 06-23-2011, 09:55 AM

Originally posted by Azazel View Post

My recommendation for you would be, as you are a professor: find a sequencing center / genomic research center with which you can collaborate on a specific, concrete project.

Hi Azazel. Thanks for the response and the info on the vagaries of file formats these days.

I indeed already have a project I'm engaged with. We've settled on SAM/BAM, so I'm trying to drill down and understand the bio meanings of the various fields (I understand the CS stuff just fine).

**Fixee** · 06-23-2011, 10:01 AM

Originally posted by BAMseek View Post

I am in a similar situation as you - I have a computer science background and have been working on analyzing next generation sequencing data.

All right! Now we're talking... where did you get the bio background to get up to speed on this stuff?

From a CS point of view, the file formats themselves can be quite interesting (yeah, cs people are weird like that).

Wait, I'm confused... how is that weird?

Hope that helps. I can point you to some paper references, as the binning scheme and BGZF compression are somewhat hidden in other papers and references.

Thanks BAMseek. However, what I'm REALLY interested in is learning a little bit of the bio stuff so I understand what is stored in these BAM files and how the data are used. What's a "clipped alignment?" What is "deep resequencing?" What's a "lane?" These things are highly google-resistant.

Cheers.

**BAMseek** · 06-23-2011, 10:55 AM

However, what I'm REALLY interested in is learning a little bit of the bio stuff so I understand what is stored in these BAM files and how the data are used. What's a "clipped alignment?" What is "deep resequencing?" What's a "lane?" These things are highly google-resistant.

Yeah, that is a tough one - I am not aware of one source that would cover all of this stuff. SeqAnswers has been a great resource in helping me piece together this information. Maybe others can point to some good summary posts, such as this.

For a broad overview, a google search of "next generation sequencing" might be helpful. To learn about lanes, you may want to look up information about the Illumina flow cell. There are 8 lanes on a flow cell, and one sample can be sequenced per lane (more if you barcode/multiplex the samples). If you are working with ABI SOLiD technology then an understanding of color space (as compared to base space) would be essential. Also, knowing what single-end reads and paired-end reads are will be helpful in understanding the reads in a BAM file. Deep resequencing is one of the applications of next gen sequencing, along with ChIP-Seq, RNA-Seq, ... Reading the initial papers on these applications (Wold for ChIP-Seq and Mortazavi for RNA-seq) helped me alot. I'll let you know if I can think of anything more specific.

**Fixee** · 06-23-2011, 11:30 AM

All right, well thanks to everyone who pitched in here. I'm going to dig into the resources mentioned, read some more posts here at seqanswers, and perhaps bug you all when I get stuck. Many thanks!

**ECO** · 06-23-2011, 12:02 PM

If you're not already familiar...qualifying a google search with "site:seqanswers.com" will likely make the results a lot more relevant....

**Fixee** · 06-23-2011, 01:00 PM

Originally posted by ECO View Post

If you're not already familiar...qualifying a google search with "site:seqanswers.com" will likely make the results a lot more relevant....

Cheers... I've already collected some reading material with "site:seqanswers.com" and "filetype: pdf" (another useful google feature).

[You must remove the space after the "filetype:" above. I added a space because without it, this vBulletin site rewrites the colon-p as an emoticon and HTML encoding the colon with ; wasn't working.]

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Computer Scientist diving into bioinformatics... where to start?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News