Dear all,
I have some questions considering steps prior a de novo assembly. I have normalized cDNA Miseq (pair end) data from two marine nematode species (no reference genome available of any marine nematodes) which I want to assembly to create a transcriptome. The sequencing company has done some things for me already:
1. Quality trimming: We trim low quality ends (< Q20) with FastX 0.0.13 [1].
2. Adapter trimming: The adapters are trimmed only at the end (at least 10bp
overlap and 90% match) with cutadapt 1.2.1 [3].
3. Quality fltering: Using FastX 0.0.13 and ShortRead 1.16.3, we remove in
succession small reads (length < 50 bp), polyA-reads (more than 90% of the
bases equal A), ambiguous reads (containing N), low quality reads (more than
50% of the bases < Q25) and artifact reads (all but 3 bases in the read equal one
base type).
4. Making pairing consistent: Filtering reads may remove one read of a pair and
make paired fastq-?les inconsistent. In this step we remove reads that belong
to broken pairs and save them in separate fastq
5. Removal of contaminants: Using bowtie 2.0.0-beta5, we identify reads that
align to phixillumina and remove them.
So here it ends and I step in. I have uploaded my sequences in CLCbio and trimmed the sequences for the cDNA adapters, which were required to amplify my normalized cDNA libraries to increase the amount of cDNA.
My questions are:
- Prior to a de novo assembly there is the option to merge pair end reads giving two data sets: one with merged sequences and one without. Is it a good option to merge paired end reads or should the de novo assembly start from the original fastq files? Or should we do both, merging the pair end data and using these merged sequences together with the original data for my de novo assembly?
- During de novo assembly there is the option of scaffolding. I'm not sure whether this option is good. It indeed will create longer contigs but does it give downstream problems during annotation. I mean: If two genes are in very close proximity (or even on oposite strands) there is a possibility that they will end up in 1 contig. When blasting this contig won't you miss 1 of the 2 genes?
- How is it possible that when mapping reads back to the transcriptome 10% was not mapped?
Thanks in advance
I have some questions considering steps prior a de novo assembly. I have normalized cDNA Miseq (pair end) data from two marine nematode species (no reference genome available of any marine nematodes) which I want to assembly to create a transcriptome. The sequencing company has done some things for me already:
1. Quality trimming: We trim low quality ends (< Q20) with FastX 0.0.13 [1].
2. Adapter trimming: The adapters are trimmed only at the end (at least 10bp
overlap and 90% match) with cutadapt 1.2.1 [3].
3. Quality fltering: Using FastX 0.0.13 and ShortRead 1.16.3, we remove in
succession small reads (length < 50 bp), polyA-reads (more than 90% of the
bases equal A), ambiguous reads (containing N), low quality reads (more than
50% of the bases < Q25) and artifact reads (all but 3 bases in the read equal one
base type).
4. Making pairing consistent: Filtering reads may remove one read of a pair and
make paired fastq-?les inconsistent. In this step we remove reads that belong
to broken pairs and save them in separate fastq
5. Removal of contaminants: Using bowtie 2.0.0-beta5, we identify reads that
align to phixillumina and remove them.
So here it ends and I step in. I have uploaded my sequences in CLCbio and trimmed the sequences for the cDNA adapters, which were required to amplify my normalized cDNA libraries to increase the amount of cDNA.
My questions are:
- Prior to a de novo assembly there is the option to merge pair end reads giving two data sets: one with merged sequences and one without. Is it a good option to merge paired end reads or should the de novo assembly start from the original fastq files? Or should we do both, merging the pair end data and using these merged sequences together with the original data for my de novo assembly?
- During de novo assembly there is the option of scaffolding. I'm not sure whether this option is good. It indeed will create longer contigs but does it give downstream problems during annotation. I mean: If two genes are in very close proximity (or even on oposite strands) there is a possibility that they will end up in 1 contig. When blasting this contig won't you miss 1 of the 2 genes?
- How is it possible that when mapping reads back to the transcriptome 10% was not mapped?
Thanks in advance
Comment