Seqanswers Leaderboard Ad

**Adrian_H** · 07-23-2010, 08:40 AM

You can use something like PIQA http://bioinfo.uh.edu/PIQA/ to generate boxplots showing the average read quality. The low quality at the end of the reads is not unusual, and you might well get better results after trimming the sequences back - it's definitely worth a try. Maq can do this itself - just set the sequence length to something shorter with the -1 option, or you can use a program like fastx_trimmer (http://hannonlab.cshl.edu/fastx_toolkit/ )

**R diggity** · 08-04-2010, 10:01 AM

Thanks for directing me towards PIQA, that's exactly the sort of package I was looking for. I can't seem to get it to work, but for now I'm using Maq's built-in read length feature to "shorten" reads. I assume this increases the possibility of false mapping, but it has increased my overall percentage mapped greatly (near 50%).

**Adrian_H** · 08-04-2010, 11:03 AM

Just looked at this message again - I think that your quality scores are in the newer Illumina FASTQ format. (This is different from the old Solexa format). Those B's at the end are Illumina's QC signal saying not to use that base (although I've certainly seen that those are sometimes still mappable)

As far as I know, Maq expects qualities to be in Sanger Fastq encoding, which means that you should to convert them before using Maq. Some of the other aligners have built in support for adjusting the scales, but most still assume the Sanger/Phred FASTQ scale unless told otherwise.

there are some little scripts to do the conversion here: http://seqanswers.com/forums/showthread.php?t=5210

See the wikipedia article on fastq for the details on all the different formats: http://en.wikipedia.org/wiki/FASTQ_format

**R diggity** · 08-04-2010, 02:59 PM

Ah, thanks for the lead. It appears as though I do have Illumina 1.5+. I'm not certain I understand how to run the scripts to convert these reads. Can I place the code within an existing Maq script?

**raela** · 08-05-2010, 04:41 AM

MAQ will actually do this for you - look into maq sol2sanger. I believe it's just `maq sol2sanger old.fastq new.fq` (or something like that) - it's how I've converted the reads I use. Also, for nice read statistics, try out FastQC: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ It allows you to use a java GUI to view the statistics, or it will generate an HTML page for you on the command line.

**R diggity** · 08-05-2010, 07:34 AM

FastQC also looks like a great way to evaluate reads, thanks. Does the sol2sanger script properly convert 1.5 illumina formats? I've tested it out and had slightly better alignment, but I thought Solexa fastq used a different range of characters.

**Adrian_H** · 08-05-2010, 07:37 AM

I don't think sol2sanger does the 1.5 Illumina format.

What I have been doing is just to pre-process the files using this script:

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $count = 0;
while (<>) {
    chomp;
    if ($count++ % 4 == 3) { tr/\x40-\xff\x00-\x3f/\x21-\xe0\x21/; }
    print "$_\n";
}

Just save it as something like ill2sanger.pl, and then run:

perl ill2sanger.pl < YourData_illumina.fastq.txt > YourData_sanger.fastq.txt

**raela** · 08-05-2010, 07:56 AM

My scores went from something like
bbbbbbbbbbbbbbbbbbbb`bbbbbabbbbbbbabbbba_W]^`babTa]aaWaY]aa^bbR`b[][MW^^]ZX[_`
to
CCCCCCCCCCCCCCCCCCCCACCCCCBCCCCCCCBCCCCB@8>?ACBC5B>BB8B:>BB?CC3AC<><.8??>;9<@A

I'm 95% sure I used maq to do it.. I don't believe the perl script I tried once really worked. I use BWA to align, though, so definitely check how it works before going with it.

**R diggity** · 08-05-2010, 10:05 AM

Thanks for explaining how to use the script! I used ill2sanger successfully.

Original read qualities:

\^][Iba[GTVIYXYUUZ\VY^IXa`\_TGRIIZY]RDFY\][BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Converted read qualities:

=?><*CB<(57*:9:66;=7:?*9BA=@5(3**;:>3%':=><################################

Would it make more sense for the "#" characters to be "!" in Sanger format?

Edit: Also, does FastQC require a certain format for analysis?

**Adrian_H** · 08-05-2010, 12:54 PM

Yeah, I guess it would make some sense for the # to be a !, but I doubt that it makes much practical difference if you're using a cut-off of 10 or 15 for doing trimming.

**R diggity** · 08-05-2010, 06:50 PM

Thanks to everyone for the suggestions so far. I've learned more in the past week than when I started looking at this project several months ago.

So, after mapping reads with Maq under many different thresholds I seem to only get 30% mapped. FastQC makes me feel better about the reads (low duplication, average Phred qualities near 30). I'm fairly certain my problems stem from one of two issues:

1. Maq's mismatch (-n) has a maximum of 3 for shorter reads, but I'm using 75nt reads.
2. The species in question are highly diverged (~10my). I may need to rethink my entire approach.

Are there known assemblies between highly diverged species?

**Adrian_H** · 08-05-2010, 08:13 PM

Oh, so there's no reference genome for the actual organism that you sequenced??

I don't know anything about comparative assembly, but one thought is how about aligning a subset of the reads with something much more tolerant of errors/divergence (SSAHA2 can be set up to report all local alignments, or maybe just BLAST, but I don't know what's out there) to get a better sense of what is going on? Or can you consider doing de novo assembly? What sort of organism is this?

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 49 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Comparative genome analysis (using illumina reads)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News