Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • R diggity
    Member
    • Jun 2010
    • 12

    Comparative genome analysis (using illumina reads)

    Hi all,

    I'm involved in a project involving SNP detection for various candidate genes. I've started by using Maq to align my reads using a mismatch value of 3. For some reason, I've only been able to align around 20% of the reads. I'm wondering if this has something to do with:

    1) The format of the quality scores. I'm assuming they are something like solexa, but I just want to double check that this is what Maq "wants". Here is an example before fastq conversion.

    @7:1:22:390#0/1
    AAATACGATGTAGAAACCACATATTTTGAAACAATATGCAACAACAAACTGTGAATTAAATCAACGCATATGAAA
    +7:1:22:390#0/1
    ab``\HRZ`\T]`]]aOR_[]][^Q_SZP][YP\W\VWU\LS\Y]_]UUPVNMN]TQYY\MHN^`BBBBBBBBBB


    2) The actual quality of the reads. A quick glance at my reads file shows a lot of Bs on the 3' end. I joined this project after the reads were obtained, so I know little about the sample quality. Is there a way to determine average quality?

    Overall, I estimate close to 24x coverage. I know Maq should be able to deal with low quality scores, but I'm wondering if this is one of those special cases requiring trimming. Short of acquiring new reads, is there an appropriate way to approach this issue?
    Last edited by R diggity; 07-23-2010, 07:06 AM.
  • Adrian_H
    Member
    • Feb 2010
    • 10

    #2
    You can use something like PIQA http://bioinfo.uh.edu/PIQA/ to generate boxplots showing the average read quality. The low quality at the end of the reads is not unusual, and you might well get better results after trimming the sequences back - it's definitely worth a try. Maq can do this itself - just set the sequence length to something shorter with the -1 option, or you can use a program like fastx_trimmer (http://hannonlab.cshl.edu/fastx_toolkit/ )

    Comment

    • R diggity
      Member
      • Jun 2010
      • 12

      #3
      Thanks for directing me towards PIQA, that's exactly the sort of package I was looking for. I can't seem to get it to work, but for now I'm using Maq's built-in read length feature to "shorten" reads. I assume this increases the possibility of false mapping, but it has increased my overall percentage mapped greatly (near 50%).
      Last edited by R diggity; 08-04-2010, 12:18 PM.

      Comment

      • Adrian_H
        Member
        • Feb 2010
        • 10

        #4
        Just looked at this message again - I think that your quality scores are in the newer Illumina FASTQ format. (This is different from the old Solexa format). Those B's at the end are Illumina's QC signal saying not to use that base (although I've certainly seen that those are sometimes still mappable)

        As far as I know, Maq expects qualities to be in Sanger Fastq encoding, which means that you should to convert them before using Maq. Some of the other aligners have built in support for adjusting the scales, but most still assume the Sanger/Phred FASTQ scale unless told otherwise.

        there are some little scripts to do the conversion here: http://seqanswers.com/forums/showthread.php?t=5210

        See the wikipedia article on fastq for the details on all the different formats: http://en.wikipedia.org/wiki/FASTQ_format
        Last edited by Adrian_H; 08-05-2010, 07:50 PM.

        Comment

        • R diggity
          Member
          • Jun 2010
          • 12

          #5
          Ah, thanks for the lead. It appears as though I do have Illumina 1.5+. I'm not certain I understand how to run the scripts to convert these reads. Can I place the code within an existing Maq script?

          Comment

          • raela
            Member
            • Apr 2010
            • 39

            #6
            MAQ will actually do this for you - look into maq sol2sanger. I believe it's just `maq sol2sanger old.fastq new.fq` (or something like that) - it's how I've converted the reads I use. Also, for nice read statistics, try out FastQC: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ It allows you to use a java GUI to view the statistics, or it will generate an HTML page for you on the command line.

            Comment

            • R diggity
              Member
              • Jun 2010
              • 12

              #7
              FastQC also looks like a great way to evaluate reads, thanks. Does the sol2sanger script properly convert 1.5 illumina formats? I've tested it out and had slightly better alignment, but I thought Solexa fastq used a different range of characters.

              Comment

              • Adrian_H
                Member
                • Feb 2010
                • 10

                #8
                I don't think sol2sanger does the 1.5 Illumina format.

                What I have been doing is just to pre-process the files using this script:

                Code:
                #!/usr/bin/perl
                
                use strict;
                use warnings;
                
                my $count = 0;
                while (<>) {
                    chomp;
                    if ($count++ % 4 == 3) { tr/\x40-\xff\x00-\x3f/\x21-\xe0\x21/; }
                    print "$_\n";
                }
                Just save it as something like ill2sanger.pl, and then run:

                perl ill2sanger.pl < YourData_illumina.fastq.txt > YourData_sanger.fastq.txt
                Last edited by Adrian_H; 08-05-2010, 07:51 PM.

                Comment

                • raela
                  Member
                  • Apr 2010
                  • 39

                  #9
                  My scores went from something like
                  bbbbbbbbbbbbbbbbbbbb`bbbbbabbbbbbbabbbba_W]^`babTa]aaWaY]aa^bbR`b[][MW^^]ZX[_`
                  to
                  CCCCCCCCCCCCCCCCCCCCACCCCCBCCCCCCCBCCCCB@8>?ACBC5B>BB8B:>BB?CC3AC<><.8??>;9<@A

                  I'm 95% sure I used maq to do it.. I don't believe the perl script I tried once really worked. I use BWA to align, though, so definitely check how it works before going with it.

                  Comment

                  • R diggity
                    Member
                    • Jun 2010
                    • 12

                    #10
                    Thanks for explaining how to use the script! I used ill2sanger successfully.

                    Original read qualities:

                    \^][Iba[GTVIYXYUUZ\VY^IXa`\_TGRIIZY]RDFY\][BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

                    Converted read qualities:

                    =?><*CB<(57*:9:66;=7:?*9BA=@5(3**;:>3%':=><################################

                    Would it make more sense for the "#" characters to be "!" in Sanger format?

                    Edit: Also, does FastQC require a certain format for analysis?
                    Last edited by R diggity; 08-05-2010, 10:16 AM.

                    Comment

                    • Adrian_H
                      Member
                      • Feb 2010
                      • 10

                      #11
                      Yeah, I guess it would make some sense for the # to be a !, but I doubt that it makes much practical difference if you're using a cut-off of 10 or 15 for doing trimming.
                      Last edited by Adrian_H; 08-05-2010, 07:49 PM.

                      Comment

                      • R diggity
                        Member
                        • Jun 2010
                        • 12

                        #12
                        Thanks to everyone for the suggestions so far. I've learned more in the past week than when I started looking at this project several months ago.

                        So, after mapping reads with Maq under many different thresholds I seem to only get 30% mapped. FastQC makes me feel better about the reads (low duplication, average Phred qualities near 30). I'm fairly certain my problems stem from one of two issues:

                        1. Maq's mismatch (-n) has a maximum of 3 for shorter reads, but I'm using 75nt reads.
                        2. The species in question are highly diverged (~10my). I may need to rethink my entire approach.

                        Are there known assemblies between highly diverged species?

                        Comment

                        • Adrian_H
                          Member
                          • Feb 2010
                          • 10

                          #13
                          Oh, so there's no reference genome for the actual organism that you sequenced??

                          I don't know anything about comparative assembly, but one thought is how about aligning a subset of the reads with something much more tolerant of errors/divergence (SSAHA2 can be set up to report all local alignments, or maybe just BLAST, but I don't know what's out there) to get a better sense of what is going on? Or can you consider doing de novo assembly? What sort of organism is this?
                          Last edited by Adrian_H; 08-05-2010, 08:43 PM.

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            Yesterday, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 12:03 PM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, Yesterday, 11:40 AM
                          0 responses
                          14 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          29 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-26-2026, 10:12 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...