Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Alignment of resequenced population genome data to reference genome (A. lyrata)

    I am working with population specific re-sequenced genome data from A. lyrata. I need to align it to the reference genome which is already available and then prepare population specific sequence. I will also do variant calling but for now alignment of fragment to reference genome is what I am interested in.

    Could anyone suggest of a pipeline/apps combination that would be helpful for this purpose?

    Thanks,

  • #2
    I suggest BBMap for alignment. However, generating a haploid genome from a diploid organism by aligning to a reference is nontrivial; I don't know of a good pipeline for it. Generally, I recommend calling variations from the aligned reads, discarding all variations that are not homozygous and high-confidence, applying those to the reference, and reiterating until there are no further changes.

    Comment


    • #3
      Thanks Brian. I have genome reseq data from two population (6 individuals per population). So, i hope it should be enough for preparing population specific genome. I will follow your instructions and see what I find.

      First, I need to get BBmap running which I am trying to do since two days. i am not very good at running programs via command and I think that my limitation. i will follow up with you if I have additional questions.

      Comment


      • #4
        BBMap should be very easy to run, so if you are having trouble, please tell me!

        Particularly, what operating system are you using? It's easiest to run in Linux, but any OS will work. A.lyrata is pretty small, at ~207Mbp, so you shouldn't have any memory problems even on a laptop.

        Comment


        • #5
          My laptop is window 7 (64 bit) but I have also install VMware and loaded Ubuntu on it. BBmap was suggest by one another person on this forum but running it on both linux and windows didn't quite work for me. I am not sure why. I am going to give it a try tomorrow once more and will report you the problem. Also, I need some examples to work with (or say documentation) that would make my analyses and interpretation of results more clear, and I am not finding that either.

          Hopefully you will be able to help.

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            I suggest BBMap for alignment. However, generating a haploid genome from a diploid organism by aligning to a reference is nontrivial; I don't know of a good pipeline for it. Generally, I recommend calling variations from the aligned reads, discarding all variations that are not homozygous and high-confidence, applying those to the reference, and reiterating until there are no further changes.
            Hi Brian,
            I am visiting bbmap after a while again and came across this post I made. While I have been successful at doing my analyses at some level I still need to figure out a lot.

            Regarding reiteration until there are no further changes - I think you meant making a haploid genome for that populations and again aligning the reads to this haploid until no new variants are discovered, rite? If yes, how do you account for multiple variations found at one particular loci; since I have 6 population samples I may/will have different homozygous polymorphism at one loci. How do you weigh at selecting one polymorphisms vs. the other one?

            Also, could you let me let me know if there are specific guidelines to filter variants after mapping.

            Thanks,

            Comment


            • #7
              There was small problem when I prepared the index of lyrata genome, which I didn't notice until now.

              A. lyrata genome until now has above a thousand scaffolds, so several have not a specified chromosome number. The genome I a using is in this link ftp://ftp.ensemblgenomes.org/pub/pla...is_lyrata/dna/ and named "Arabidopsis_lyrata.v.1.0.30.dna_sm.toplevel.fa.gz".

              The ones that are named 1-8 actually represent chromosome # 1 through 8 and don't have "chr" tag infront of it. I think the scaffold_9 is a mitochondrial genome while scaffold_10 is a chloroplast genome. The remaining scaffolds have not been specified a particular chr number.

              When I indexed the genome using bbmap.sh ref=lyrata_genome.fa
              the index was built, but the index contains only one chromosome and is designated chr#1, while the actual scaffold and chromosome were assigned under the "id" column. Also, I see that the number of scaffolds reported in the index is 695; did this happen because the maximum number chromosome was exceeded? and it automatically assigned the 1000s scaffold under the ID column. Sorry, if I am making any wrong assumptions.


              A part of the scaffolds.txt is reported below:
              #chrom id start length name
              1 1 8000 1005 scaffold_1118 dna_sm:scaffold scaffold:JGI8X:scaffold_1118:1:1005:1
              1 2 9305 1017 scaffold_1117 dna_sm:scaffold scaffold:JGI8X:scaffold_1117:1:1017:1
              1 3 10622 1032 scaffold_1116 dna_sm:scaffold scaffold:JGI8X:scaffold_1116:1:1032:1
              1 4 11954 1037 scaffold_1113 dna_sm:scaffold scaffold:JGI8X:scaffold_1113:1:1037:1
              1 5 13291 1047 scaffold_1112 dna_sm:scaffold scaffold:JGI8X:scaffold_1112:1:1047:1

              Summary.txt is reported:

              #Summary
              #Generated on Mon Feb 15 12:38:16 EST 2016
              #Version 5
              chroms 1
              bases 206892135
              defined 183778159
              undefined 23113976
              contigs 3648
              scaffolds 695
              interpad 300
              source /media/everestial007/Seagate-ExtHDD/DATA_analyses/genomeDATAanalyses/bbmap_alignment/lyrataINDEX/lyrata_genome.fa
              bytes 210159440
              last modified 1446233302000

              I don't want to remove scaffolds that don't have a known assembly positions as it will impact the scores of the aligned sequence.

              Could you please help me with this issue?
              Last edited by everestial; 02-15-2016, 01:49 PM. Reason: typo

              Comment


              • #8
                Originally posted by everestial View Post
                Regarding reiteration until there are no further changes - I think you meant making a haploid genome for that populations and again aligning the reads to this haploid until no new variants are discovered, rite? If yes, how do you account for multiple variations found at one particular loci; since I have 6 population samples I may/will have different homozygous polymorphism at one loci. How do you weigh at selecting one polymorphisms vs. the other one?
                Where there are polymorphisms, you can either go with the ref allele, or the most common allele, or use an N to indicate a degenerate base. None of those 3 is necessarily the "correct" choice.

                Also, could you let me let me know if there are specific guidelines to filter variants after mapping.
                If you are making a haploid consensus, then calling variants with a >50% allele fraction makes sense to me. Anything beyond that would be specific to the variant caller.

                There was small problem when I prepared the index of lyrata genome, which I didn't notice until now.

                A. lyrata genome until now has above a thousand scaffolds, so several have not a specified chromosome number. The genome I a using is in this link ftp://ftp.ensemblgenomes.org/pub/pla...is_lyrata/dna/ and named "Arabidopsis_lyrata.v.1.0.30.dna_sm.toplevel.fa.gz".

                The ones that are named 1-8 actually represent chromosome # 1 through 8 and don't have "chr" tag infront of it. I think the scaffold_9 is a mitochondrial genome while scaffold_10 is a chloroplast genome. The remaining scaffolds have not been specified a particular chr number.

                When I indexed the genome using bbmap.sh ref=lyrata_genome.fa
                the index was built, but the index contains only one chromosome and is designated chr#1, while the actual scaffold and chromosome were assigned under the "id" column. Also, I see that the number of scaffolds reported in the index is 695; did this happen because the maximum number chromosome was exceeded? and it automatically assigned the 1000s scaffold under the ID column. Sorry, if I am making any wrong assumptions.
                Don't worry about BBMap's internal coordinate system, which squishes all the sequences together into something it calls a chromosome for legacy reasons. All the scaffolds will still be present unless some were below the lower length limit (150bp or something like that by default). BBMap supports billions of scaffolds and a real genome won't go over the limit in practice. I just checked with stats.sh and indeed that file has 695 scaffolds:

                Code:
                A       C       G       T       N       IUPAC   Other   GC      GC_stdev
                0.3195  0.1805  0.1804  0.3196  0.1108  0.0000  0.0000  0.3609  0.0440
                
                Main genome scaffold total:             695
                Main genome contig total:               3648
                Main genome scaffold sequence total:    206.668 MB
                Main genome contig sequence total:      183.778 MB      11.076% gap
                Main genome scaffold N/L50:             4/24.465 MB
                Main genome contig N/L50:               247/227.391 KB
                Max scaffold length:                    33.133 MB
                Max contig length:                      1.188 MB
                Number of scaffolds > 50 KB:            49
                % main genome in scaffolds > 50 KB:     97.23%
                
                
                Minimum         Number          Number          Total           Total           Scaffold
                Scaffold        of              of              Scaffold        Contig          Contig
                Length          Scaffolds       Contigs         Length          Length          Coverage
                --------        --------------  --------------  --------------  --------------  --------
                    All                    695           3,648     206,667,935     183,778,159    88.92%
                   1 KB                    695           3,648     206,667,935     183,778,159    88.92%
                 2.5 KB                    675           3,628     206,641,073     183,751,297    88.92%
                   5 KB                    464           3,301     205,766,616     182,948,721    88.91%
                  10 KB                    235           2,905     204,152,087     181,412,911    88.86%
                  25 KB                     72           2,558     201,790,437     179,128,000    88.77%
                  50 KB                     49           2,471     200,933,512     178,561,895    88.87%
                 100 KB                     23           2,299     199,189,015     177,284,754    89.00%
                 250 KB                     14           2,199     197,897,279     176,426,941    89.15%
                 500 KB                      9           2,051     196,089,052     175,118,424    89.31%
                   1 MB                      9           2,051     196,089,052     175,118,424    89.31%
                 2.5 MB                      8           2,021     194,182,311     173,245,969    89.22%
                   5 MB                      8           2,021     194,182,311     173,245,969    89.22%
                  10 MB                      8           2,021     194,182,311     173,245,969    89.22%
                  25 MB                      2             434      58,246,127      52,399,375    89.96%

                Comment


                • #9
                  Thanks for clarification Brian !

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM
                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-14-2024, 06:13 AM
                  0 responses
                  34 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-08-2024, 08:03 AM
                  0 responses
                  72 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-07-2024, 08:13 AM
                  0 responses
                  81 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-06-2024, 09:51 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X