SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Aligning transcritptome with trimmed or untrimmed data (http://seqanswers.com/forums/showthread.php?t=16588)

RogerH 01-03-2012 02:42 PM

Aligning transcritptome with trimmed or untrimmed data
 
I recently aligned the transcriptomes of 5 different algal species, using oases.

I found that I get higher N50 values and maximal contig lengths if I use the untrimmed data. Furthermore, even though the percentage of reads used is higher if I use trimmed data, overall with the massive reduction in data if I trim it, I use more of my reads if I assemble the transcriptomes with untrimmed data.

I know one side effect of sequencing errors is a higher RAM requirement, but besides that, is there any other negative (or positive) effect if I use untrimmed data for my assembly?

RAM wasn't really an issue for me, since I had access to a high performance computer with several nodes with 64 GB RAM each.

RogerH 01-12-2012 02:46 PM

Is there really nobody with an answer to this question?

Kennels 01-12-2012 03:37 PM

Quote:

Originally Posted by RogerH (Post 60813)
I recently aligned the transcriptomes of 5 different algal species, using oases.

I found that I get higher N50 values and maximal contig lengths if I use the untrimmed data. Furthermore, even though the percentage of reads used is higher if I use trimmed data, overall with the massive reduction in data if I trim it, I use more of my reads if I assemble the transcriptomes with untrimmed data.

I know one side effect of sequencing errors is a higher RAM requirement, but besides that, is there any other negative (or positive) effect if I use untrimmed data for my assembly?

RAM wasn't really an issue for me, since I had access to a high performance computer with several nodes with 64 GB RAM each.

Hi,

I'm not sure what type of reads you are using, but if you are using Illumina reads, you should always trim off the first 12 to 15 bases, as it presents substantial biases.
Did you do a FastQC quality check?
If you see some severe biases in the 5' end, you should trim this off. I also trim off some 3' end bases, depending on whether the quality of the reads falls off dramatically. In addition I filter out reads containing even one base that drops below a certain Q score.

If you use untrimmed reads, while you may get more contigs from this it will be quite unreliable due to misassemblies and possibly chimeras.

Cheers

RogerH 01-12-2012 03:56 PM

Hi,

Thanks for the reply. Yes, I'm using Illumina 100bp paired-end data.

My supervisor told me that I should just try trimmed and untrimmend, and then suggested that I use the untrimmed assembly for annotation. But I did fear that there might be a problem with that.

I used FastQC on my data, there is a bit of a problem with the GCAT content in the first 10 bp (due to the not-so-random random primers that are used for Illumina library preparation I believe). And the Q value of the last 15-20 bases drops off considerably.

The problem is that I'm pressed for time, so before Christmas I decided to stop working on the assembly and go ahead with the annotation step (which takes a considerable amount of time, using Blast2go).

Kennels 01-12-2012 04:13 PM

Quote:

Originally Posted by RogerH (Post 61711)
Hi,

Thanks for the reply. Yes, I'm using Illumina 100bp paired-end data.

My supervisor told me that I should just try trimmed and untrimmend, and then suggested that I use the untrimmed assembly for annotation. But I did fear that there might be a problem with that.

I used FastQC on my data, there is a bit of a problem with the GCAT content in the first 10 bp (due to the not-so-random random primers that are used for Illumina library preparation I believe). And the Q value of the last 15-20 bases drops off considerably.

The problem is that I'm pressed for time, so before Christmas I decided to stop working on the assembly and go ahead with the annotation step (which takes a considerable amount of time, using Blast2go).

Sounds like you are doing exactly the same thing as me.
I am also using blast2go now, and I have around 240k transcripts, and this will probably take 2 weeks or more as i am doing it through the web-interface.
I also have 100-nt paired end Illumina reads, and the first 10 bases or so is like yours. It is indeed due to the not-so-random nature of the random hexamers used the library prep. I trimmed off the first 12 for good measure, though trimming off 15 is not unusual.

Unfortunately I don't think the annotation for the untrimmed data would be reliable, particularly since you say the Q score of the 3' end also drops off alot. I would recommend using trimmed data.

RogerH 01-12-2012 04:33 PM

Thanks, this is really helpful.

I'm mainly interested to find a handful of housekeeping genes and another handful of genes of interest to design qPCR primers for unsequenced species, but not a complete transcriptome at this stage.

I did manage to find some sequences that did match published sequences of my key enzyme, but there were also some weird results.

As I said, I'm a bit behind with my PhD (who isn't) and I'm hard pressed for time this year. So I think I will just go ahead and try to find my enzymes with the assembly I have, but also assemble a better transcriptome at the side.

240k transcripts in 2 weeks is fairly ambitious based on my experience. I annotated 5 different species in parallel, and the one with 80k transcripts took over a month. But maybe I was doing something wrong. I might look into a local blast to speed up things for my annotation of the untrimmed data.

Kennels 01-12-2012 04:38 PM

Actually, thank you for the info. It's my first time using blast2go, so i'll keep in mind the time expected.
Cheers.


All times are GMT -8. The time now is 05:57 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.