Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Human reference sequence and tools like MAQ

    How do people use that?

    Use each chromosome separately or combine the fasta sequences and run through maq or tool of choice .. I had this issue earlier when I used contigs and the headers were not too parseable by tools. Any standards on the chromosomes to use, edit their headers before using them as reference..

    It is just taking me too long to manage the reference and am looking for a more efficient way to do so !
    --
    bioinfosm

  • #2
    Reference

    For Maq, I split the reference by chromosome and submit each Maq run for a given chromosome to a different node on my cluster. With Soap, I do the same thing. It also works for me to split the reference into 5 chunks for Soap runs. The cluster I was using couldn't handle Soap if I give it all my reads and all the chromosomes at once. I had memory problems. It works fine if you split it up in chunks, or split up your reads as well. I use the hg18 chromosomes from UCSC Genome Browser. The headers are simple chr1, chr2 ...
    Wes Beckstead
    Predoctoral Fellow in Bioinformatics
    Boston University Partnership with NIH
    [email protected]

    Comment


    • #3
      For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

      If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.

      Comment


      • #4
        I agree with Ih3 by not splitting the reference genome up. With any tool if you do this you'll need to combine the results and sort them post-alignment.
        I've been using maq, soap, eland and novoalign. Eland's quite fast if you want exact, 1 or 2 mismatches and you dont care about quality. Building the index for eland is also very quick.
        There is a script with the maq package that will do the automated chunk mapping. If you use novoalign do some neatening up of the fasta headers in your genome, it will save downstream perl/sed jobs and indexing of a full human genome on at least 8Gb server will take about 4-5 min.

        Comment


        • #5
          Genome Viewer - Annotations

          Hello,
          I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

          I am looking at data from whole genome sequencing effort using 454 FLX.

          Cheers

          Sm

          Comment


          • #6
            Originally posted by mukatira View Post
            Hello,
            I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

            I am looking at data from whole genome sequencing effort using 454 FLX.

            Cheers

            Sm
            Hi Mukatira,

            I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
            For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
            Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

            Hope this helps.

            Comment


            • #7
              Originally posted by mukatira View Post
              Hello,
              I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

              I am looking at data from whole genome sequencing effort using 454 FLX.

              Cheers

              Sm
              Mukatira,

              if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

              Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
              So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

              With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.


              Split human reference genome
              We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in it´s entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

              Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours it´s done.
              For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

              Cheers

              Klaus

              Comment


              • #8
                Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.

                Originally posted by kmay View Post
                Mukatira,

                if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

                Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
                So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

                With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.




                We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in it´s entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

                Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours it´s done.
                For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

                Cheers

                Klaus

                Comment


                • #9
                  Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.
                  Zee,

                  in fact a "best unique match" search is now is even faster. See the Helicos forum for some more detail on what that means. I will post some benchmarks in a couple of weeks.

                  Yes, we do it on one CPU. In fact, my earlier posting was outdated. The GMS has tody comes with 64 GB main memory. So... no desktop system


                  Yes, is scales linear.

                  Cheers

                  Klaus

                  Comment


                  • #10
                    Originally posted by zee View Post
                    Hi Mukatira,

                    I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
                    For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
                    Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

                    Hope this helps.
                    Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?
                    --
                    bioinfosm

                    Comment


                    • #11
                      Hi,

                      It's not that difficult actually.

                      First you should convert your alignments to gene feature format (GFF). Then you should also have all your reference genome sequence definitions in GFF as well. There are some good templates

                      You will need mysql/postgresql setup with database create/file/read permissions. A gbrowse.conf file for your database will also need to exist. A good example is http://ftp.hapmap.org/jimwatsonseque...wsequence.conf

                      1) Create your gbrowse database in Mysql
                      2) Use the bp_load_gff.pl or bp_bulk_load_gff.pl to load GFF files of your read alignments, chromosome names, genes, etc
                      3) make sure you have the source and method columns for each GFF file specific enough to define each track. Have a look at the fgroup table in the gbrowse schema.
                      4)Edit the gbrows.conf file to reflect you track names (method:source)
                      5) Ensure your gbrowse.conf has the correct username/hostname/password to connect to the database containing your data.

                      If all the data was loaded correctly it should be straight forward to see the tracks. I usually edit an existing configuration file.

                      Some example files can be found on:






                      I have some scripts to do novoalign->gff conversion. For maq you could probably take a look at maqview (from the maq page) because it's quite a good browser and reads the .map file directly.

                      Originally posted by bioinfosm View Post
                      Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?

                      Comment


                      • #12
                        chunks

                        Originally posted by lh3 View Post
                        For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

                        If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.
                        I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

                        Secondly, are you able to refer me to the command that can do this splitting within maq?

                        Thank you

                        L

                        Comment


                        • #13
                          From the maq manual:
                          Code:
                          maq fastq2bfq [-n nreads] in.read.fastq out.read.bfq|out.prefix
                          
                          Convert reads in FASTQ format to Maq’s BFQ (binary FASTQ) format.
                          
                          OPTIONS:
                          -n INT 	number of reads per file [not specified]
                          As both end'-readfiles are converted separately, I guess you won't do anything wrong if you are using the same 'n' value for both 'end'files.

                          Followups:
                          maq map steps for all bfq-parts (using PE-corresponding part files of course)
                          followed by
                          maq mapmerge out.aln.map in.aln1.map in.aln2.map [...]

                          see the maq-docu for this, too.

                          Best
                          -Jonathan

                          Comment


                          • #14
                            Layla,

                            You should have each member of the pair in a separate file. In other words whatever you're going to be splitting needs to be done on each of these files containing mate 1 and mate 2.
                            The maq.pl easyrun command should take care of the process. I would recommend splitting into chunks of size 2 million, therefore it should do 5 iterations.

                            Assuming you have 1 side in reads1.fq and the other in reads2.fq

                            Something like:

                            maq.pl easyrun db.bfa reads1.fq reads2.fq


                            should work fine.

                            Originally posted by Layla View Post
                            I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

                            Secondly, are you able to refer me to the command that can do this splitting within maq?

                            Thank you

                            L

                            Comment


                            • #15
                              Thankyou Jonathon and Zee,

                              I guess this should be good enough. Split both files into 2 million chunks and then run the following command 5 times..
                              ./maq match out.map all_chrom.bfa 2million_5prime.bfq 2million_3prime.bfq

                              Any ideas how long this should take on a 32GB Mac OS?

                              Running the above command for all 20million reads in one go took 7 days!

                              Whilst on this subject, is anyone able to guide me with a specific data filtering issue before I use Batman?

                              Cheers again

                              L

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X