Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • where to get aligned datasets ?

    I usually download the data from genbank

    but it's tedious to align the sequences, filter out the possible errors
    or incorrect insertions or just distant not well matching strains.

    Others must have done the same thing ...

    It should be useful to provide the aligned data to others,
    so they needn't redo it.
    But I didn't find it. Genbank doesn't seem interested
    to provide it or to store it and make it available from other's uploads

  • #2
    What datasets are you referring to?

    If you are looking for gene level pre-compiled alignments then "Homologene" is the place you want to visit. Here is an example: http://www.ncbi.nlm.nih.gov/homologene/?term=brca2

    UCSC provides alignments. Look in the alignments section: http://hgdownload.soe.ucsc.edu/downloads.html#human

    Ensembl also has similar information available: http://www.ensembl.org/info/website/...s/compara.html

    Genome level alignments are also at Ensembl: http://www.ensembl.org/info/genome/c.../analyses.html

    Comment


    • #3
      I'm mainly doing influenza sequencing.

      So, I need aligned datasets of ~10000 sequences of length 838-2280 nucleotides
      for avian influenza of the 8 segments and 15 different strains for the HA and 9 for the
      NA and each of these probably divided into an Eurasian and North American lineage.

      Earlier here I had mitochondrial human DNA, 15000 sequences of length 16680
      I also (occasionally) did Dengue, the 4 groups, Ebola etc.
      Today I was trying helicobacter pylori ...

      it's always the same problem, takes hours to generate suitable aligned datasets

      Comment


      • #4
        A search brought this up. You must have seen this already: http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html

        Then there is http://www.fludb.org/brc/home.spg?decorator=influenza

        For Mitochondria: http://www.ncbi.nlm.nih.gov/genome/organelle/

        As you know first hand, it takes time/effort to create meaningful MSA's. I am going to speculate that NCBI creates those for genes of model organisms/common genomes using the limited resources they have.

        You should consider making your own alignments available since that would save someone else some frustration.

        Comment


        • #5
          For influenza, I think the best is to download all the ~400000 unaligned genbank sequences
          in fasta-format, which they provide in one file of ~650MB.
          But then you must filter for segments, groups, align, sort etc.
          I'm doing this regularly ~1-2 times per year for the ~130000 avian sequences
          into 5+2+9+16 aligned files. Takes 10-20hours.
          If only one person in the world would be doing the same ...it would save much time.

          Ideally you would have ~100 files with aligned sequences for the strains with an index from each.
          And the files sorted by best neighbor match. From these you can extract and filter whatever you want.
          flugenome.org did something like this, but is no longer being updated.


          flu comes from birds , whenever
          it jumps to new hosts you want to know where it came from,
          the genome and each of the 8 segments separately, how it evolved,
          whether/where there is pandemic danger.

          And then the human and swine sequences for special types less regularly,
          when the flu-season starts and there are new variants or such.

          I assume it's similar for other organisms : the data should be provided
          in filtered,sorted,aligned form.

          I could easily make my files available from my HD, where to put them so other will find it ?
          Best to send them on micro-SD



          what's MSA
          Last edited by gsgs; 11-26-2015, 05:55 AM.

          Comment


          • #6
            MSA = Multiple sequence alignment

            Isn't NCBI allowing you to do something similar to flugenome here (it is limited to 1000 genomes): http://www.ncbi.nlm.nih.gov/genomes/...i?go=alignment

            That said, I agree with you that the analysis you are doing would be a useful resource for the flu community. But since the number of people working on flu must be relatively small can't you propose this internally (at a relevant meeting/working group) that a resource such as this be created and then hosted by the group.

            Or you could write to NCBI and the group that manages the flu database and see if they would be interested in presenting the data the way you are proposing.

            Comment


            • #7
              it's not just the MSA, you must remove/separate errors and nonmatches and
              single-nucleotide insertions (==> probably error) , pseudo-recombinations , wrong segments,
              wrong or missing strain-classifications, and such.
              And then sort the sequences. And these are typically 10000 sequences.
              It can be done, but takes some time (or tedious automization...)

              I've been talking with the genbank flu expert in emails since 2006.
              They are not interested. Genbank-flu has improved since
              2006, though. More features, more uniform=computer friendly,

              I could upload it somewhere, but noone will find it.

              the flu-community may be small (and I'm not a member with meetings or writing papers
              or professional=being paid or such) but this problem in general should apply to all sequencing.
              It's just my amateur pandemic concern, that started with H5N1 in 2005

              They may have somehow "solved" it in the human community (?)

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              57 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X