SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
aligning .sra files in colorspace with fasta reference genome masme Bioinformatics 2 09-25-2013 05:58 AM
strand-specificity in paired-end data arabidopsis Bioinformatics 6 05-30-2012 07:29 AM
Strand specificity Marcel RNA Sequencing 0 09-22-2011 06:13 AM
How to infer Illumina paired-end strand specificity from SAM output? David Harmin Bioinformatics 0 02-16-2011 08:34 AM
strand specificity sequser09 Sample Prep / Library Generation 0 11-22-2009 08:42 AM

Reply
 
Thread Tools
Old 07-21-2014, 05:12 AM   #1
SquirrelSeq
Member
 
Location: Germany

Join Date: May 2013
Posts: 10
Default Strand specificity of genome fasta files (hg19)

Hello everybody,

do genomic reference FASTA files, such as famous hg19.fa,
usually contain
one continuous physical strand, meaning that sense (plus) and antisense (minus) sections are shown in their natural sequence,
or a virtual concatenation of only the sense (plus) sections, which physically alternate between both single strands?

Regards,
SquirrelSeq
SquirrelSeq is offline   Reply With Quote
Old 07-21-2014, 09:02 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Reference fasta files contain (or rather, define) only the plus strand of a genome. The sequences are only concatenated where there is evidence that they are physically joined, so for example hg19.fa only contains 25 main sequences, one for each of the 22 autosomes, X, Y, and mitochondrial. There can also be some shorter additional sequences but those represent human variation or the imperfection of sequenced genome and can be safely ignored for the purpose of this discussion.

The direction from which genes are read is totally unrelated to a fasta genome file; there will be some genes on the plus strand, and some on the minus strand, and of course (for human) the majority is non-coding anyway. A transcriptome fasta is different, though - it should have one sequence per gene or gene isoform, representing the sense in which it is read, rather than however it appears in the genome.

Last edited by Brian Bushnell; 07-21-2014 at 09:06 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-22-2014, 04:53 AM   #3
SquirrelSeq
Member
 
Location: Germany

Join Date: May 2013
Posts: 10
Default

Hello Brian,

thank you for the answer.

The whole confusion is caused by the mix-up of two different definitions of "plus"/"+" and "minus"/"-", when people are talking about single genes on the one hand and genes on whole chromosomal fasta references on the other hand.

My scenario of concatenation did of course not mean to join physically unjoined sequences, as I know very well the typical chomosome format structure of FASTA files. Instead, thinking that the usage of "+"/"-" terminology was consistently used, I imagined a concatenation of adjacent sequence regions in sense (protein-coding) orientation w.r.t. the alteration of strand usage for coding along the chromosome as a complicated, but consistent solution. This would of course mean an arbitrary definition of which strand is "+" and which one "-" in noncoding regions and furthermore positional jumping w.r.t. the dsDNA molecule.

From the perspective of FASTA files and gene annotation practices, your explanation/understandig is useful. However, since it is not consistent, with the independent definition of "plus" and "minus", people are confused and misunderstand each other as I saw in many cases.

Therefore, some facts:

1. "In genomic FASTA reference files, all lines are from the same strand".
https://www.biostars.org/p/78884/
Which molecular strand sequence is selected for a genomic dsDNA chromosomal FASTA reference, is fully arbitrary and has nothing to do with "+" or "-".

2. The only way to however give the published strand of the assembly an identity, is to refer to the GENES that are coded in "sense" on this selected strand, meaning that...

3. if gene annotation tools specify a gene as coded on the "-"strand, it solely means that this gene is coded on the strand antisense to the arbitrarily published one. This is not to be confused with the reference-independent definition of "+" and "-" strand, which is...

4. "Molecular biologists call a single strand of DNA sense (or positive (+) ) if an RNA version of the same sequence is translated or translatable into protein.”
“The two complementary strands of double-stranded DNA (dsDNA) are usually differentiated as the "sense" strand and the "antisense" strand. The DNA sense strand looks like the messenger RNA (mRNA) and can be used to read the expected protein code by human eyes (e.g. ATG codon = Methionine amino acid).”
http://en.wikipedia.org/wiki/Sense_(molecular_biology)

I hope, this helps for future questions on this topic.

Best regards,
SquirrelSeq
SquirrelSeq is offline   Reply With Quote
Old 07-22-2014, 08:48 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hmm, I guess "+" and "-" are overloaded terms. When dealing with the human genome, the people I worked with generally talked about reads mapping to the plus strand, or having the 'A' allele of a SNP on the plus strand, with the assumption that plus meant the strand represented in the fasta file, NOT any gene that happened to be at that location. Because a majority of the human genome is noncoding, most of it cannot be described as plus or minus using a gene-centric definition, but the strands still need to be described somehow for clarity.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO