Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ReferenceMapper: Annotation files & known SNPs

    Dear All,

    I am currently analysing a 454 dataset using ReferenceMapper (part of the 454 package). I have captured and sequenced a contiguous 5Mb region of human chromosome 21. I have managed to map all my reads onto a reference sequence with ReferenceMapper. However, I have not managed to download and upload the right files containing the annotation and known SNPs which are crucial to data interpretation. I know the files have to be downloaded from UCSC's ftp server.

    I was wondering if someone can help me with that.

    Many thanks,
    N

  • #2
    Nigel,

    From the UCSC FTP directory ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database you want the files refGene.txt.gz and snp130.txt.gz

    If you want to create files which contain only the region of interest (your 5 Mbp of Chr21) you could use the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) to define the table and region of interest then output to a file just this subset.

    Comment


    • #3
      Thank you

      Thanks very much kmcarr. I'll give this a try.

      Cheers,
      N

      Comment


      • #4
        ReferenceMapper Issues

        Originally posted by kmcarr View Post
        Nigel,

        From the UCSC FTP directory ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database you want the files refGene.txt.gz and snp130.txt.gz

        If you want to create files which contain only the region of interest (your 5 Mbp of Chr21) you could use the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) to define the table and region of interest then output to a file just this subset.
        Hi kmcarr,

        I have downloaded the files and uploaded them into ReferenceMapper. When I include the 'Genome annotation' and 'known SNP' files, the information in the HCDiffs table map to genes and SNPs on the wrong chromosome: chr2 instead of 21. I figured I need to input a targeted region. When I type in the string chr21:****-*****, I get a blank HCDiffs file.

        Do you know what could be causing this? Your help is kindly appreciated.

        Cheers,
        N

        Comment


        • #5
          Nigel,

          In the Results Files look at 454RefStatus.txt. What fraction of your reads map to Chr2 vs. Chr21?

          Comment


          • #6
            454Refstatus?

            Originally posted by kmcarr View Post
            Nigel,

            In the Results Files look at 454RefStatus.txt. What fraction of your reads map to Chr2 vs. Chr21?

            Hi kmcarr,
            here's what the RefStatus file looks like (see below): Most of my reads map to chr21 (83.26%)! what could be causing this? Could it be because of the design of the reference fasta file?

            File:/data2/RAYMOND/chr21/chr21_451_12/mapping/454RefStatus.txt
            Lines:3
            Modified:Thu Sep 24 18:53:57 BST 2009
            _________________________________________
            Reference Num Unique Pct of All Pct of Pct Coverage
            Accession Matching Reads Unique Matches All Reads of Reference Description
            chr21:40999701-46944323 427445 100.0% 64.0% 83.26%

            Cheers,
            N.

            Comment


            • #7
              Nigel,

              This information tells me that you were only attempting to map reads to a 5.9Mbp segment of Chr21 which you targeted. There would be no reads mapped to Chr2 since it was not included as part of the reference. Is this information from the mapping performed when you entered information into the targeted region field (i.e. chr21:40999701-46944323) or from a mapping where no targeted region was specified? What is the reference file you are providing to the gsMapper program; is it the entire human genome or is it a file with just your target region?

              BTW, the 83.26% is what fraction of the reference is covered by reads. You had 64% of your reads (427,445 of them) map to this 5.9Mbp region of chr21. This is what you would be hoping to see in a seq capture experiment.
              Last edited by kmcarr; 09-29-2009, 05:02 AM.

              Comment


              • #8
                Originally posted by kmcarr View Post
                Nigel,

                This information tells me that you were only attempting to map reads to a 5.9Mbp segment of Chr21 which you targeted. There would be no reads mapped to Chr2 since it was not included as part of the reference. Is this information from the mapping performed when you entered information into the targeted region field (i.e. chr21:40999701-46944323) or from a mapping where no targeted region was specified? What is the reference file you are providing to the gsMapper program; is it the entire human genome or is it a file with just your target region?

                BTW, the 83.26% is what fraction of the reference is covered by reads. You had 64% of your reads (427,445 of them) map to this 5.9Mbp region of chr21. This is what you would be hoping to see in a seq capture experiment.

                Hi kmcarr,

                My reference sequence was chr21:40999701-46944323, not the entire genome. What puzzles me is that when I upload the annotation and known SNP, the HCDiffs table contains genes and SNPs that map to chr2. When I input chr21:40999701-46944323 in the target sequence slot to target the output, the HCDiffs table is blank!!

                What should I check?

                Many thanks,
                N.
                Last edited by Nigel; 09-29-2009, 05:19 AM.

                Comment


                • #9
                  Analysing ReferenceMapper HCDiff File with Excel

                  Is it possible to export a HCDiff file from ReferenceMapper as an excel file?

                  Cheers,
                  N

                  Comment


                  • #10
                    Originally posted by Nigel View Post
                    My reference sequence was chr21:40999701-46944323, not the entire genome. What puzzles me is that when I upload the annotation and known SNP, the HCDiffs table contains genes and SNPs that map to chr2. When I input chr21:40999701-46944323 in the target sequence slot to target the output, the HCDiffs table is blank!!
                    There might by a name mismatch between your reference FASTA file and the reference annotation and SNP files. You say that your reference file only contains the sequence region of interest. What is the definition line of your reference FASTA file? This would be the very first line and should start with a ">"; spacing and punctuation are critical.

                    Is it possible to export a HCDiff file from ReferenceMapper as an excel file?
                    I wish it was possible but I have not found a way. You can get most of the information in a tab delimited format which you can then import into Excel by extracting it from the 454HCDiffs.txt file

                    Code:
                    grep ^">" 454HCDiffs.txt | sed -e 's/^>//' > HCDiffs_table.tsv
                    You can then open the HCDiffs_table.tsv file with Excel.

                    Comment


                    • #11
                      As far as I know, the mapper won't automatically offset coordinates in the snp file to reflect the subsequence you used in your mapping. You should extract the SNPs from the regions of interest from the snp file and re-coordinate them to match their actual coordinates in the fasta file you used. For example, if you are using chr 2 from 10,001-20,000 as your reference, extract those SNPs from the snp file and subtract 10,000 from every listed coordinate.

                      Comment


                      • #12
                        I have done a similar experiment, capture arrays then 454 sequencing. I have found that when I used only the capture region as the reference that a lot of reads from other similar regions of the genome have been aligned onto the reference despite multiple mismatches (I need to tighten the minimum alignment identity also I think, but its a trade off between excluding sequences from other regions and identifying sequences from the target region that are legitimately different). I am about to redo the alignment using the whole genome to see if that fixes it. Has anyone else had problems like this?

                        Comment


                        • #13
                          Using the whole genome instead of the just the target region as the reference improves the alignment immensely. It also has the benefit of outputing the data with the actual chromosomal coordinates.

                          Comment


                          • #14
                            Originally posted by Jeremy View Post
                            Using the whole genome instead of the just the target region as the reference improves the alignment immensely. It also has the benefit of outputing the data with the actual chromosomal coordinates.
                            I have the same experience. Lots of spurious reads (due to non-perfect capture) are not forcibly aligned to a restricted region anymore. However, it is very time and resource consuming to do the whole genome alignment using 454!
                            --
                            bioinfosm

                            Comment


                            • #15
                              I did mine on the cluster supplied with the machine, using that computer the analysis only took about 4 hours for a full gasket (2 halves) using a 2.5 Gb genome. But even if it takes a week on a desktop it is well worth it. My average read depth went from 24 to about 12, taking a look at the difference using something like eagleview shows just how good the alignment becomes.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X