SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Direction/Orientation of Illumina read in SAM file flobpf Bioinformatics 3 11-12-2013 04:06 AM
dwgsim read orientation av_d Bioinformatics 3 08-15-2011 09:48 PM
orientation of 454's reads louis7781x 454 Pyrosequencing 6 06-01-2011 09:32 AM
454 Paired End orientation problems pr0t3us Bioinformatics 5 06-03-2010 01:25 AM

Reply
 
Thread Tools
Old 04-20-2009, 11:11 AM   #1
behoward
Member
 
Location: USA

Join Date: Mar 2009
Posts: 13
Default 454 read orientation

Hi everyone,

I am looking at a 454 dataset and I am wondering whether the read sequences (they are in a FASTA file) are ususually in the same direction as the original mRNAs or can they be reverse complement?

This will determine what BLAT parameters I use during alignment. Either q=rna or q=dna. I think with standard (non-454) ESTs you don't know the orientation, so you have to use q=dna. However, this can give you unwanted duplicated alignments.
behoward is offline   Reply With Quote
Old 04-20-2009, 11:28 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

The typical protocol for sequencing RNA with 454 is to make ds cDNA, fragment it (nebulizer, covaris, etc.) then use a standard genomic library prep kit from Roche. This means polishing (blunting) the ends and attaching the sequencing adapters in a non-directional manner. Thus the reads you get will be a mixture of both directions.
kmcarr is offline   Reply With Quote
Old 04-20-2009, 12:01 PM   #3
behoward
Member
 
Location: USA

Join Date: Mar 2009
Posts: 13
Default

Thanks! I guess I have to use q=dna, then.

The dataset I am looking at is a public 454 GS20 dataset from the paper "Sampling the Arabidopsis Transcriptome with massively parallel pyrosequencing" (Weber et al, Plant Physiology May 2007). Kmcarr, I think I remember from a previous post that you have some experience with this particular dataset.

Do you have any guess whether the original researchers used q=rna in the BLAT alignment? I remember they had about 11% of the reads that don't map to the genome. But if I use q=dna, I get a larger percent mapping to TAIR7.

Also, if I do use q=dna, I guess I will only want to 'count' reads once when they map to a gene and its reverse complement. However, I would want to keep both matches when a read maps to multiple genes (say paralogs, or duplicate genes) I'm not sure how to tell these two cases apart... Anyone have any suggestions?
behoward is offline   Reply With Quote
Old 04-20-2009, 01:10 PM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by behoward View Post
The dataset I am looking at is a public 454 GS20 dataset from the paper "Sampling the Arabidopsis Transcriptome with massively parallel pyrosequencing" (Weber et al, Plant Physiology May 2007). Kmcarr, I think I remember from a previous post that you have some experience with this particular dataset.

Do you have any guess whether the original researchers used q=rna in the BLAT alignment? I remember they had about 11% of the reads that don't map to the genome. But if I use q=dna, I get a larger percent mapping to TAIR7.

Also, if I do use q=dna, I guess I will only want to 'count' reads once when they map to a gene and its reverse complement. However, I would want to keep both matches when a read maps to multiple genes (say paralogs, or duplicate genes) I'm not sure how to tell these two cases apart... Anyone have any suggestions?
Man! That dataset just won't die. When I said I had some familiarity with the data I was understating it a bit. I was one of the authors, performing all of the bioinformatics. I used the default BLAT settings for query and target type, i.e. both -q ant -t=dna. However BLAT will only output a single alignment for a read at a given location; it will not report both the forward and reverse alignment of a read. You don't have to worry about that.

Your are correct that you will find equally good alignments to paralogous genes. You will have to decide how you want to approach assigning or counting those reads.

You will also find many poor alignments of reads to the genome. You should play with the pslReps program to filter your initial BLAT output. pslReps is meant to retain only the best alignment if a query sequence aligns to multiple target locations. If there are a group of alignments which are equally good (or nearly so) they will all be retained.
kmcarr is offline   Reply With Quote
Old 04-25-2009, 01:17 PM   #5
behoward
Member
 
Location: USA

Join Date: Mar 2009
Posts: 13
Default

Well, thanks again

I guess I came to the right person! I suppose the good thing about a dataset that won't die is that you must get a ton of citations.

Cheers,
Brian
behoward is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO