Seqanswers Leaderboard Ad

**TiborNagy** · 02-17-2014, 02:51 AM

It is a small RNA sequenceing?

**Zapages** · 02-17-2014, 02:01 PM

Originally posted by TiborNagy View Post

It is a small RNA sequenceing?

It was just total mRNA to cDNA and then sequencing using Illumina pipeline. The data was downloaded from NCBI GEO/SRA.

**mastal** · 02-17-2014, 03:39 PM

Originally posted by Zapages View Post

I was wondering what I was doing wrong? Should I trim the sequences so that there are only 30 bp left? I would be losing approximately 50 bps.

Also all the quality of the reads are great by being between 38 and 24.

Any ideas will be appreciated on how I could go about fixing this issue.

Thank you,

Zapages

EDIT: I was able to fix the Kmers, by removing the last 30 bps in the reads. I was wondering if this was correct manner of handling this or should I have used a different strategy for this? Everything passed except for high duplication levels, which is to be expected due to the data set being RNA-Seq.

Having a look at the sequence of the kmers at the 3' end in your plot, it looks like it is part of the Illumina adapter sequence, so maybe the trimming procedure didn't remove all the adapters. Try running trimmomatic with the option '-k 10' (default is '-k 5'), it will give a better idea of what the over-represented kmers are.

**GenoMax** · 02-17-2014, 04:36 PM

Originally posted by mastal View Post

Try running trimmomatic with the option '-k 10' (default is '-k 5'), it will give a better idea of what the over-represented kmers are.

@Maria: Did you mean to say fastqc?

**mastal** · 02-18-2014, 02:19 AM

@GenoMax, Yes, I meant FastQC, thanks for spotting the error.

**relipmoc** · 02-18-2014, 06:29 AM

Originally posted by Zapages View Post

It was just total mRNA to cDNA and then sequencing using Illumina pipeline. The data was downloaded from NCBI GEO/SRA.

Could you tell us the SRR number?

**Zapages** · 02-23-2014, 01:40 PM

Thank you GenoMax and mastal. I tried to use the non-integrative mode for FastQC and unfortunately I received the heap out of memory error in terminal was trying to extract kmer (-k 10 values).

Regardless, I tried different settings and I ended up doing various tests and trimming the datasets to 50bps. The original length was 101 or 76 bps. I believe this should be fine, right?

@relipmoc, the SRR data set was: SRR330569

**relipmoc** · 02-25-2014, 01:37 AM

try skewer

Originally posted by Zapages View Post

@relipmoc, the SRR data set was: SRR330569

The adapter sequence of the first pair is different from canonical TruSeq3 adapter.
>Real/1
AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
>Real/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT

the following are TruSeq3 adapters
>TruSeq3/1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>TruSeq3/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

Nevertheless, when I used trimmomatic for trimming, I got similar Kmer Content plot as yours.

Then I used skewer for trimming instead and got a better Kmer Content plot.

The following are log information:
COMMAND LINE: skewer -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG -y AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT -t 8 SRR330569_1.fastq SRR330569_2.fastq
Input file: SRR330569_1.fastq
Paired file: SRR330569_2.fastq
trimmed: SRR330569_1.trimmed-Q0L18-pair1.fastq, SRR330569_1.trimmed-Q0L18-pair2.fastq

Parameters used:
-- 3' end adapter sequence (-x): AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- minimum read length allowed after trimming (-l): 18
-- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
-- number of concurrent threads (-t): 8
Tue Feb 25 11:32:32 2014 >> started
|=================================================>| (100.00%)
Tue Feb 25 11:34:14 2014 >> done (102.664s)
27005344 read pairs processed; of these:
1058 ( 0.00%) short read pairs filtered out after trimming by size control
924 ( 0.00%) empty read pairs filtered out after trimming by size control
27003362 (99.99%) read pairs available; of these:
11959533 (44.29%) trimmed read pairs available after processing
15043829 (55.71%) untrimmed read pairs available after processing
log has been saved to "SRR330569_1.trimmed-Q0L18.log".

Attached Files

SRR330569_1.trimmed-Q0L18.log.txt (3.0 KB, 36 views)

**tonybolger** · 03-04-2014, 08:11 AM

Originally posted by Zapages View Post

Hi Everyone,

I have weird situation. I am getting high Kmers on my 3' end (right end) of the Illumina reads. I have been using trimmomatic 0.32 and its been fairly easy to work with. Also I have paired end sequences and this is present with both forward and reverse reads.

Based on relipmoc's reply, and the vintage of the dataset, i would say these are TruSeq2 adapters (i haven't downloaded the dataset to verify) - have you tried using the TruSeq2 PE adapter file supplied with Trimmomatic?

**relipmoc** · 03-06-2014, 01:33 AM

Comparison of the trimming results of skewer and trimmomatic

The trimmed reads of skewer were aligned to dsim reference using tophat, with the following command:

Code:

$ tophat dsim SRR330569_1.trimmed-Q0L18-pair1.fastq SRR330569_1.trimmed-Q0L18-pair2.fastq

where dsim is the prefix of bowtie2 index files which were built from the dsim-r1.4 reference genome (ftp://ftp.flybase.net/genomes/Drosop...-r1.4.fasta.gz).

the content of "tophat_out/align_summary.txt" is:

PHP Code:


Left reads:

          Input     :  27003362

           Mapped   :  22637908 (83.8% of input)

            of these:  13345683 (59.0%) have multiple alignments (2504776 have >20)

Right reads:

          Input     :  27003362

           Mapped   :  21275973 (78.8% of input)

            of these:  12805304 (60.2%) have multiple alignments (2504737 have >20)

81.3% overall read mapping rate.



Aligned pairs:  20542511

     of these:  12581418 (61.2%) have multiple alignments

                 2509666 (12.2%) are discordant alignments

66.8% concordant pair alignment rate.

The trimmed reads of trimmomatic were aligned to dsim reference using tophat, with the following command:

Code:

$ tophat dsim real-10-pair1.paired.fq real-10-pair2.paired.fq

where real-10-pair1.paired.fq real-10-pair2.paired.fq were produced by trimmomatic-0.32.jar using default parameter (ILLUMINACLIP:real-adapters:2:30:10:1:1 MINLEN:18).

the content of "tophat_out/align_summary.txt" is:

PHP Code:


Left reads:

          Input     :  26987807

           Mapped   :  17819297 (66.0% of input)

            of these:  10586873 (59.4%) have multiple alignments (2255543 have >20)

Right reads:

          Input     :  26987807

           Mapped   :  16885216 (62.6% of input)

            of these:  10213191 (60.5%) have multiple alignments (2255529 have >20)

64.3% overall read mapping rate.



Aligned pairs:  15637748

     of these:   9694646 (62.0%) have multiple alignments

                 2074039 (13.3%) are discordant alignments

50.3% concordant pair alignment rate.

**Apexy** · 03-06-2014, 10:06 PM

Hi Zapages and other,
If you plan on performining a de novo assembly of the trimmed reads, it may be good for you to have a look at this paper. http://journal.frontiersin.org/Journ...014.00017/full

**tonybolger** · 03-27-2014, 05:35 AM

I've finally gotten around to testing this and it appears that the appropriate included adapter sequences (TruSeq2-PE) do a perfectly fine job with this data. Of course, it is important to actually use the correct adapters

Using just adapter trimming (ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18), the aligned pairs are 20,335,738, slightly behind (~1.0%) the 20,542,511 achieved by skewer.

When the suggested LEADING:3 / TRAILING:3 steps are added, to remove the super-low quality stretches, 22,222,781 pairs align, about 8.2% more than skewer.

Addition of further quality trimming, e.g using MAXINFO:40:0.9, brings that up to 22,910,973 pairs aligning, ~11.5% more.

In any case, as pointed out by Apexy above, very aggressive trimming isn't always a good idea - it depends on the purpose, amount of data etc.

**modi2020** · 04-13-2014, 11:35 AM

Hi Toney,

I noticed your ILLUMINACLIP command and wanted to verify I got the manual correctly. So the manual say its:
ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip
threshold>:<simple clip threshold>:<minAdapterLength>:<keepBothReads>

and the command you specified is ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18

As I understand, 30 is the seed mismatch, 12 is the palindrome clip threshold, 1 is the simple clip thrshold but I don't know what true is when I compare it to the command in the manual. I suppose its for keepBothReads but isn't that specified last.

Thank you

Originally posted by tonybolger View Post

I've finally gotten around to testing this and it appears that the appropriate included adapter sequences (TruSeq2-PE) do a perfectly fine job with this data. Of course, it is important to actually use the correct adapters

Using just adapter trimming (ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18), the aligned pairs are 20,335,738, slightly behind (~1.0%) the 20,542,511 achieved by skewer.

When the suggested LEADING:3 / TRAILING:3 steps are added, to remove the super-low quality stretches, 22,222,781 pairs align, about 8.2% more than skewer.

Addition of further quality trimming, e.g using MAXINFO:40:0.9, brings that up to 22,910,973 pairs aligning, ~11.5% more.

In any case, as pointed out by Apexy above, very aggressive trimming isn't always a good idea - it depends on the purpose, amount of data etc.

**mastal** · 04-13-2014, 11:56 AM

True is for 'keep both reads', and it is specified last in the ILLUMINACLIP command.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

High amounts of Kmer at the 3' end after Trimming

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News