Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparative genome analysis (using illumina reads)

    Hi all,

    I'm involved in a project involving SNP detection for various candidate genes. I've started by using Maq to align my reads using a mismatch value of 3. For some reason, I've only been able to align around 20% of the reads. I'm wondering if this has something to do with:

    1) The format of the quality scores. I'm assuming they are something like solexa, but I just want to double check that this is what Maq "wants". Here is an example before fastq conversion.

    @7:1:22:390#0/1
    AAATACGATGTAGAAACCACATATTTTGAAACAATATGCAACAACAAACTGTGAATTAAATCAACGCATATGAAA
    +7:1:22:390#0/1
    ab``\HRZ`\T]`]]aOR_[]][^Q_SZP][YP\W\VWU\LS\Y]_]UUPVNMN]TQYY\MHN^`BBBBBBBBBB


    2) The actual quality of the reads. A quick glance at my reads file shows a lot of Bs on the 3' end. I joined this project after the reads were obtained, so I know little about the sample quality. Is there a way to determine average quality?

    Overall, I estimate close to 24x coverage. I know Maq should be able to deal with low quality scores, but I'm wondering if this is one of those special cases requiring trimming. Short of acquiring new reads, is there an appropriate way to approach this issue?
    Last edited by R diggity; 07-23-2010, 07:06 AM.

  • #2
    You can use something like PIQA http://bioinfo.uh.edu/PIQA/ to generate boxplots showing the average read quality. The low quality at the end of the reads is not unusual, and you might well get better results after trimming the sequences back - it's definitely worth a try. Maq can do this itself - just set the sequence length to something shorter with the -1 option, or you can use a program like fastx_trimmer (http://hannonlab.cshl.edu/fastx_toolkit/ )

    Comment


    • #3
      Thanks for directing me towards PIQA, that's exactly the sort of package I was looking for. I can't seem to get it to work, but for now I'm using Maq's built-in read length feature to "shorten" reads. I assume this increases the possibility of false mapping, but it has increased my overall percentage mapped greatly (near 50%).
      Last edited by R diggity; 08-04-2010, 12:18 PM.

      Comment


      • #4
        Just looked at this message again - I think that your quality scores are in the newer Illumina FASTQ format. (This is different from the old Solexa format). Those B's at the end are Illumina's QC signal saying not to use that base (although I've certainly seen that those are sometimes still mappable)

        As far as I know, Maq expects qualities to be in Sanger Fastq encoding, which means that you should to convert them before using Maq. Some of the other aligners have built in support for adjusting the scales, but most still assume the Sanger/Phred FASTQ scale unless told otherwise.

        there are some little scripts to do the conversion here: http://seqanswers.com/forums/showthread.php?t=5210

        See the wikipedia article on fastq for the details on all the different formats: http://en.wikipedia.org/wiki/FASTQ_format
        Last edited by Adrian_H; 08-05-2010, 07:50 PM.

        Comment


        • #5
          Ah, thanks for the lead. It appears as though I do have Illumina 1.5+. I'm not certain I understand how to run the scripts to convert these reads. Can I place the code within an existing Maq script?

          Comment


          • #6
            MAQ will actually do this for you - look into maq sol2sanger. I believe it's just `maq sol2sanger old.fastq new.fq` (or something like that) - it's how I've converted the reads I use. Also, for nice read statistics, try out FastQC: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ It allows you to use a java GUI to view the statistics, or it will generate an HTML page for you on the command line.

            Comment


            • #7
              FastQC also looks like a great way to evaluate reads, thanks. Does the sol2sanger script properly convert 1.5 illumina formats? I've tested it out and had slightly better alignment, but I thought Solexa fastq used a different range of characters.

              Comment


              • #8
                I don't think sol2sanger does the 1.5 Illumina format.

                What I have been doing is just to pre-process the files using this script:

                Code:
                #!/usr/bin/perl
                
                use strict;
                use warnings;
                
                my $count = 0;
                while (<>) {
                    chomp;
                    if ($count++ % 4 == 3) { tr/\x40-\xff\x00-\x3f/\x21-\xe0\x21/; }
                    print "$_\n";
                }
                Just save it as something like ill2sanger.pl, and then run:

                perl ill2sanger.pl < YourData_illumina.fastq.txt > YourData_sanger.fastq.txt
                Last edited by Adrian_H; 08-05-2010, 07:51 PM.

                Comment


                • #9
                  My scores went from something like
                  bbbbbbbbbbbbbbbbbbbb`bbbbbabbbbbbbabbbba_W]^`babTa]aaWaY]aa^bbR`b[][MW^^]ZX[_`
                  to
                  CCCCCCCCCCCCCCCCCCCCACCCCCBCCCCCCCBCCCCB@8>?ACBC5B>BB8B:>BB?CC3AC<><.8??>;9<@A

                  I'm 95% sure I used maq to do it.. I don't believe the perl script I tried once really worked. I use BWA to align, though, so definitely check how it works before going with it.

                  Comment


                  • #10
                    Thanks for explaining how to use the script! I used ill2sanger successfully.

                    Original read qualities:

                    \^][Iba[GTVIYXYUUZ\VY^IXa`\_TGRIIZY]RDFY\][BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

                    Converted read qualities:

                    =?><*CB<(57*:9:66;=7:?*9BA=@5(3**;:>3%':=><################################

                    Would it make more sense for the "#" characters to be "!" in Sanger format?

                    Edit: Also, does FastQC require a certain format for analysis?
                    Last edited by R diggity; 08-05-2010, 10:16 AM.

                    Comment


                    • #11
                      Yeah, I guess it would make some sense for the # to be a !, but I doubt that it makes much practical difference if you're using a cut-off of 10 or 15 for doing trimming.
                      Last edited by Adrian_H; 08-05-2010, 07:49 PM.

                      Comment


                      • #12
                        Thanks to everyone for the suggestions so far. I've learned more in the past week than when I started looking at this project several months ago.

                        So, after mapping reads with Maq under many different thresholds I seem to only get 30% mapped. FastQC makes me feel better about the reads (low duplication, average Phred qualities near 30). I'm fairly certain my problems stem from one of two issues:

                        1. Maq's mismatch (-n) has a maximum of 3 for shorter reads, but I'm using 75nt reads.
                        2. The species in question are highly diverged (~10my). I may need to rethink my entire approach.

                        Are there known assemblies between highly diverged species?

                        Comment


                        • #13
                          Oh, so there's no reference genome for the actual organism that you sequenced??

                          I don't know anything about comparative assembly, but one thought is how about aligning a subset of the reads with something much more tolerant of errors/divergence (SSAHA2 can be set up to report all local alignments, or maybe just BLAST, but I don't know what's out there) to get a better sense of what is going on? Or can you consider doing de novo assembly? What sort of organism is this?
                          Last edited by Adrian_H; 08-05-2010, 08:43 PM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X