SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Aligning paired end Illumina data with Bowtie kopardev Bioinformatics 5 03-29-2012 08:46 AM
Please help aligning SOLID PE data christophpale Bioinformatics 2 12-08-2011 01:20 AM
Has anyone tried RUM for aligning/counting Illumina RNA-Seq data? fabrice RNA Sequencing 4 12-06-2011 07:50 AM
Using trimmed reads in bwa (PE100bp data) angelawu Bioinformatics 0 05-02-2011 07:32 PM
aligning solid data with bwa -- Seg Fault drio Bioinformatics 2 10-26-2009 04:38 AM

Reply
 
Thread Tools
Old 01-03-2012, 02:42 PM   #1
RogerH
Junior Member
 
Location: Townsville, Australia

Join Date: Dec 2011
Posts: 6
Default Aligning transcritptome with trimmed or untrimmed data

I recently aligned the transcriptomes of 5 different algal species, using oases.

I found that I get higher N50 values and maximal contig lengths if I use the untrimmed data. Furthermore, even though the percentage of reads used is higher if I use trimmed data, overall with the massive reduction in data if I trim it, I use more of my reads if I assemble the transcriptomes with untrimmed data.

I know one side effect of sequencing errors is a higher RAM requirement, but besides that, is there any other negative (or positive) effect if I use untrimmed data for my assembly?

RAM wasn't really an issue for me, since I had access to a high performance computer with several nodes with 64 GB RAM each.
RogerH is offline   Reply With Quote
Old 01-12-2012, 02:46 PM   #2
RogerH
Junior Member
 
Location: Townsville, Australia

Join Date: Dec 2011
Posts: 6
Default

Is there really nobody with an answer to this question?
RogerH is offline   Reply With Quote
Old 01-12-2012, 03:37 PM   #3
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Quote:
Originally Posted by RogerH View Post
I recently aligned the transcriptomes of 5 different algal species, using oases.

I found that I get higher N50 values and maximal contig lengths if I use the untrimmed data. Furthermore, even though the percentage of reads used is higher if I use trimmed data, overall with the massive reduction in data if I trim it, I use more of my reads if I assemble the transcriptomes with untrimmed data.

I know one side effect of sequencing errors is a higher RAM requirement, but besides that, is there any other negative (or positive) effect if I use untrimmed data for my assembly?

RAM wasn't really an issue for me, since I had access to a high performance computer with several nodes with 64 GB RAM each.
Hi,

I'm not sure what type of reads you are using, but if you are using Illumina reads, you should always trim off the first 12 to 15 bases, as it presents substantial biases.
Did you do a FastQC quality check?
If you see some severe biases in the 5' end, you should trim this off. I also trim off some 3' end bases, depending on whether the quality of the reads falls off dramatically. In addition I filter out reads containing even one base that drops below a certain Q score.

If you use untrimmed reads, while you may get more contigs from this it will be quite unreliable due to misassemblies and possibly chimeras.

Cheers
Kennels is offline   Reply With Quote
Old 01-12-2012, 03:56 PM   #4
RogerH
Junior Member
 
Location: Townsville, Australia

Join Date: Dec 2011
Posts: 6
Default

Hi,

Thanks for the reply. Yes, I'm using Illumina 100bp paired-end data.

My supervisor told me that I should just try trimmed and untrimmend, and then suggested that I use the untrimmed assembly for annotation. But I did fear that there might be a problem with that.

I used FastQC on my data, there is a bit of a problem with the GCAT content in the first 10 bp (due to the not-so-random random primers that are used for Illumina library preparation I believe). And the Q value of the last 15-20 bases drops off considerably.

The problem is that I'm pressed for time, so before Christmas I decided to stop working on the assembly and go ahead with the annotation step (which takes a considerable amount of time, using Blast2go).
RogerH is offline   Reply With Quote
Old 01-12-2012, 04:13 PM   #5
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Quote:
Originally Posted by RogerH View Post
Hi,

Thanks for the reply. Yes, I'm using Illumina 100bp paired-end data.

My supervisor told me that I should just try trimmed and untrimmend, and then suggested that I use the untrimmed assembly for annotation. But I did fear that there might be a problem with that.

I used FastQC on my data, there is a bit of a problem with the GCAT content in the first 10 bp (due to the not-so-random random primers that are used for Illumina library preparation I believe). And the Q value of the last 15-20 bases drops off considerably.

The problem is that I'm pressed for time, so before Christmas I decided to stop working on the assembly and go ahead with the annotation step (which takes a considerable amount of time, using Blast2go).
Sounds like you are doing exactly the same thing as me.
I am also using blast2go now, and I have around 240k transcripts, and this will probably take 2 weeks or more as i am doing it through the web-interface.
I also have 100-nt paired end Illumina reads, and the first 10 bases or so is like yours. It is indeed due to the not-so-random nature of the random hexamers used the library prep. I trimmed off the first 12 for good measure, though trimming off 15 is not unusual.

Unfortunately I don't think the annotation for the untrimmed data would be reliable, particularly since you say the Q score of the 3' end also drops off alot. I would recommend using trimmed data.
Kennels is offline   Reply With Quote
Old 01-12-2012, 04:33 PM   #6
RogerH
Junior Member
 
Location: Townsville, Australia

Join Date: Dec 2011
Posts: 6
Default

Thanks, this is really helpful.

I'm mainly interested to find a handful of housekeeping genes and another handful of genes of interest to design qPCR primers for unsequenced species, but not a complete transcriptome at this stage.

I did manage to find some sequences that did match published sequences of my key enzyme, but there were also some weird results.

As I said, I'm a bit behind with my PhD (who isn't) and I'm hard pressed for time this year. So I think I will just go ahead and try to find my enzymes with the assembly I have, but also assemble a better transcriptome at the side.

240k transcripts in 2 weeks is fairly ambitious based on my experience. I annotated 5 different species in parallel, and the one with 80k transcripts took over a month. But maybe I was doing something wrong. I might look into a local blast to speed up things for my annotation of the untrimmed data.
RogerH is offline   Reply With Quote
Old 01-12-2012, 04:38 PM   #7
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Actually, thank you for the info. It's my first time using blast2go, so i'll keep in mind the time expected.
Cheers.
Kennels is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO