SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimmomatic quality trimming kga1978 Bioinformatics 26 11-24-2015 10:14 AM
Illumina Adapter trimming Ramet Bioinformatics 8 03-27-2013 11:05 PM
Illumina read trimming figo1019 Illumina/Solexa 3 07-12-2012 09:38 AM
Vector trimming: are flanking sequences sufficient? sulicon Bioinformatics 1 09-20-2010 07:02 AM
trimming barcoded sequences dawe Bioinformatics 7 03-26-2010 06:14 AM

Reply
 
Thread Tools
Old 08-27-2013, 05:56 PM   #1
nicole_01
Junior Member
 
Location: Brisbane

Join Date: Aug 2013
Posts: 5
Default Trimming Illumina PE sequences with Trimmomatic

Hi all,

I'm a newbie to NGS work and would have a few questions that I hope someone can help me with.

I got strand-specific PE Illumina data (100bp). The company already did a clean up (adaptor filtering..). I checked my two files with fastqc and quality wise they look good (over Q30 on average), but have a slight drop at the 3' end to about Q28 and from the other graphs I had a bit more variation for the first 5-7 bases. I just did a test-run with PrinSeq and Trimmomatic using a few 100 sequences and Trimmomatic seems to give me the nicer output (PrinSeq adds the sequence identifier to the quality information - so 1 line. sequence identifier, 2 line. actual read, 3 line. + sequence identifier again, 4 line. quality information. Trimmomatic doesn't, it only has the + in line 3 which matches with the input file.) that Trinity might like better.

The two things I'm interested in doing is a headcrop of 7 nucleotides and I'd love to use trailing to cut for quality on the 3' end. Now according to the manual it seems to work with a quality score of 1, 2 or 3 (3 should be used) - what does that mean? I'd like to cut anything below Q30 on my 3' end. Some posts here related to other questions with Trimmomatic seem to suggest I can write 30 as well, is that true? Could I just say trailing:30 or does it have to be 3 (whatever that means)?

Strand-specificity doesn't really matter here, does it (my data is RF directionality)? Can I still write the /1 file as my first input file and the /2 as my second input file or would I have to change that?

I'm also struggeling with phred33/phred64. I read 33 is for Illumina version 1.8 and I also read the wiki post most seem to refer to in that regard, but the one my seq id matches to doesn't clearly say what version it belongs to. It's very difficult getting information from my sequencing company, so I hope to figure out myself what version they might have used. My sequence id is like this:
instrument:run id:1101:1374:1950#ATCAGAA/1
Is there a way to figure out the Illumina version based on this?

Thank you so much for your help and apologies for the lengthy post.

Nicole
nicole_01 is offline   Reply With Quote
Old 08-27-2013, 07:01 PM   #2
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

"TRAILING:30" will work
However, Q20 is almost the universally accepted acceptance threshold (99% base call accuracy...If I remember correctly). Although this probably stems from the wide use of 454 in the growing stages of NGS. Q30 (99.9%) is a good min for Illumina, but you could justify keeping any bases >Q20

Take a look here: http://en.wikipedia.org/wiki/FASTQ_format
Were your samples run on a HiSeq?
If there is any confusion (your seq ids may have been edited without you knowing, for example) you can figure out the phred encoding from the characters used in your quality data
JackieBadger is offline   Reply With Quote
Old 08-27-2013, 09:11 PM   #3
nicole_01
Junior Member
 
Location: Brisbane

Join Date: Aug 2013
Posts: 5
Default

Thanks Jackie.

Yes, my samples were run on a HiSeq 2000 - the company just got back to me, supposedly it was run through the Illumina pipeline v1.5 and that the base quality values run from 2 to 41.

I tried to figure out the ASCII codes - with v1.5 (so phred +64 if I found the right information) I'd have to start looking from 66, because 0 and 1 don't exist anymore and 2 is that weird "B", correct? And that's where I don't understand it anymore really. With v1.5 B is supposed to only happen at the end and is Q<15 without specific quality value attached - yet looking at the first 100 sequences about 90% start with a BP/ or BS/. How can that be? Do you think they told me the wrong version number?

Thanks
nicole_01 is offline   Reply With Quote
Old 08-28-2013, 05:39 AM   #4
ddb
Member
 
Location: Europe

Join Date: Feb 2012
Posts: 13
Default

If it is just the 3rd line header that is the problem with Prinseq then you can use the command line flag

-no_qual_header

your output should then just show a + in the 3rd line. It seems this is quite a common cause of confusion in the default prinseq output.
ddb is offline   Reply With Quote
Old 08-28-2013, 09:54 AM   #5
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 505
Default

Nicole,

The sequences are not listed randomly, and the first reads are usually low quality (i.e., lots of Bs). Check the quality scores of reads from the middle of the data set for a more accurate representation of the whole.
HESmith is offline   Reply With Quote
Old 08-28-2013, 05:12 PM   #6
nicole_01
Junior Member
 
Location: Brisbane

Join Date: Aug 2013
Posts: 5
Default

Thanks HESmith and ddb!
nicole_01 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:02 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO