Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparative genome analysis (using illumina reads)

    Hi all,

    I'm involved in a project involving SNP detection for various candidate genes. I've started by using Maq to align my reads using a mismatch value of 3. For some reason, I've only been able to align around 20% of the reads. I'm wondering if this has something to do with:

    1) The format of the quality scores. I'm assuming they are something like solexa, but I just want to double check that this is what Maq "wants". Here is an example before fastq conversion.

    @7:1:22:390#0/1
    AAATACGATGTAGAAACCACATATTTTGAAACAATATGCAACAACAAACTGTGAATTAAATCAACGCATATGAAA
    +7:1:22:390#0/1
    ab``\HRZ`\T]`]]aOR_[]][^Q_SZP][YP\W\VWU\LS\Y]_]UUPVNMN]TQYY\MHN^`BBBBBBBBBB


    2) The actual quality of the reads. A quick glance at my reads file shows a lot of Bs on the 3' end. I joined this project after the reads were obtained, so I know little about the sample quality. Is there a way to determine average quality?

    Overall, I estimate close to 24x coverage. I know Maq should be able to deal with low quality scores, but I'm wondering if this is one of those special cases requiring trimming. Short of acquiring new reads, is there an appropriate way to approach this issue?
    Last edited by R diggity; 07-23-2010, 07:06 AM.

  • #2
    You can use something like PIQA http://bioinfo.uh.edu/PIQA/ to generate boxplots showing the average read quality. The low quality at the end of the reads is not unusual, and you might well get better results after trimming the sequences back - it's definitely worth a try. Maq can do this itself - just set the sequence length to something shorter with the -1 option, or you can use a program like fastx_trimmer (http://hannonlab.cshl.edu/fastx_toolkit/ )

    Comment


    • #3
      Thanks for directing me towards PIQA, that's exactly the sort of package I was looking for. I can't seem to get it to work, but for now I'm using Maq's built-in read length feature to "shorten" reads. I assume this increases the possibility of false mapping, but it has increased my overall percentage mapped greatly (near 50%).
      Last edited by R diggity; 08-04-2010, 12:18 PM.

      Comment


      • #4
        Just looked at this message again - I think that your quality scores are in the newer Illumina FASTQ format. (This is different from the old Solexa format). Those B's at the end are Illumina's QC signal saying not to use that base (although I've certainly seen that those are sometimes still mappable)

        As far as I know, Maq expects qualities to be in Sanger Fastq encoding, which means that you should to convert them before using Maq. Some of the other aligners have built in support for adjusting the scales, but most still assume the Sanger/Phred FASTQ scale unless told otherwise.

        there are some little scripts to do the conversion here: http://seqanswers.com/forums/showthread.php?t=5210

        See the wikipedia article on fastq for the details on all the different formats: http://en.wikipedia.org/wiki/FASTQ_format
        Last edited by Adrian_H; 08-05-2010, 07:50 PM.

        Comment


        • #5
          Ah, thanks for the lead. It appears as though I do have Illumina 1.5+. I'm not certain I understand how to run the scripts to convert these reads. Can I place the code within an existing Maq script?

          Comment


          • #6
            MAQ will actually do this for you - look into maq sol2sanger. I believe it's just `maq sol2sanger old.fastq new.fq` (or something like that) - it's how I've converted the reads I use. Also, for nice read statistics, try out FastQC: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ It allows you to use a java GUI to view the statistics, or it will generate an HTML page for you on the command line.

            Comment


            • #7
              FastQC also looks like a great way to evaluate reads, thanks. Does the sol2sanger script properly convert 1.5 illumina formats? I've tested it out and had slightly better alignment, but I thought Solexa fastq used a different range of characters.

              Comment


              • #8
                I don't think sol2sanger does the 1.5 Illumina format.

                What I have been doing is just to pre-process the files using this script:

                Code:
                #!/usr/bin/perl
                
                use strict;
                use warnings;
                
                my $count = 0;
                while (<>) {
                    chomp;
                    if ($count++ % 4 == 3) { tr/\x40-\xff\x00-\x3f/\x21-\xe0\x21/; }
                    print "$_\n";
                }
                Just save it as something like ill2sanger.pl, and then run:

                perl ill2sanger.pl < YourData_illumina.fastq.txt > YourData_sanger.fastq.txt
                Last edited by Adrian_H; 08-05-2010, 07:51 PM.

                Comment


                • #9
                  My scores went from something like
                  bbbbbbbbbbbbbbbbbbbb`bbbbbabbbbbbbabbbba_W]^`babTa]aaWaY]aa^bbR`b[][MW^^]ZX[_`
                  to
                  CCCCCCCCCCCCCCCCCCCCACCCCCBCCCCCCCBCCCCB@8>?ACBC5B>BB8B:>BB?CC3AC<><.8??>;9<@A

                  I'm 95% sure I used maq to do it.. I don't believe the perl script I tried once really worked. I use BWA to align, though, so definitely check how it works before going with it.

                  Comment


                  • #10
                    Thanks for explaining how to use the script! I used ill2sanger successfully.

                    Original read qualities:

                    \^][Iba[GTVIYXYUUZ\VY^IXa`\_TGRIIZY]RDFY\][BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

                    Converted read qualities:

                    =?><*CB<(57*:9:66;=7:?*9BA=@5(3**;:>3%':=><################################

                    Would it make more sense for the "#" characters to be "!" in Sanger format?

                    Edit: Also, does FastQC require a certain format for analysis?
                    Last edited by R diggity; 08-05-2010, 10:16 AM.

                    Comment


                    • #11
                      Yeah, I guess it would make some sense for the # to be a !, but I doubt that it makes much practical difference if you're using a cut-off of 10 or 15 for doing trimming.
                      Last edited by Adrian_H; 08-05-2010, 07:49 PM.

                      Comment


                      • #12
                        Thanks to everyone for the suggestions so far. I've learned more in the past week than when I started looking at this project several months ago.

                        So, after mapping reads with Maq under many different thresholds I seem to only get 30% mapped. FastQC makes me feel better about the reads (low duplication, average Phred qualities near 30). I'm fairly certain my problems stem from one of two issues:

                        1. Maq's mismatch (-n) has a maximum of 3 for shorter reads, but I'm using 75nt reads.
                        2. The species in question are highly diverged (~10my). I may need to rethink my entire approach.

                        Are there known assemblies between highly diverged species?

                        Comment


                        • #13
                          Oh, so there's no reference genome for the actual organism that you sequenced??

                          I don't know anything about comparative assembly, but one thought is how about aligning a subset of the reads with something much more tolerant of errors/divergence (SSAHA2 can be set up to report all local alignments, or maybe just BLAST, but I don't know what's out there) to get a better sense of what is going on? Or can you consider doing de novo assembly? What sort of organism is this?
                          Last edited by Adrian_H; 08-05-2010, 08:43 PM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advancing Precision Medicine for Rare Diseases in Children
                            by seqadmin




                            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                            12-16-2024, 07:57 AM
                          • seqadmin
                            Recent Advances in Sequencing Technologies
                            by seqadmin



                            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                            Long-Read Sequencing
                            Long-read sequencing has seen remarkable advancements,...
                            12-02-2024, 01:49 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 12-17-2024, 10:28 AM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-13-2024, 08:24 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-12-2024, 07:41 AM
                          0 responses
                          34 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-11-2024, 07:45 AM
                          0 responses
                          46 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X