SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to visualize microbiome datasets ? Richa Sharma Metagenomics 13 02-09-2016 09:14 AM
How to show two datasets on one tracks tujchl Bioinformatics 0 04-20-2014 02:32 AM
MAQC II datasets gcsaa RNA Sequencing 0 10-12-2012 11:21 AM
How to count aligned RNA-seq reads after sequenced and aligned by Illumina? IceWater Illumina/Solexa 5 04-05-2012 09:18 AM
about SRA paired datasets syslm01 RNA Sequencing 21 10-19-2011 10:59 AM

Reply
 
Thread Tools
Old 11-25-2015, 11:19 PM   #1
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default where to get aligned datasets ?

I usually download the data from genbank

but it's tedious to align the sequences, filter out the possible errors
or incorrect insertions or just distant not well matching strains.

Others must have done the same thing ...

It should be useful to provide the aligned data to others,
so they needn't redo it.
But I didn't find it. Genbank doesn't seem interested
to provide it or to store it and make it available from other's uploads
gsgs is offline   Reply With Quote
Old 11-26-2015, 03:52 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

What datasets are you referring to?

If you are looking for gene level pre-compiled alignments then "Homologene" is the place you want to visit. Here is an example: http://www.ncbi.nlm.nih.gov/homologene/?term=brca2

UCSC provides alignments. Look in the alignments section: http://hgdownload.soe.ucsc.edu/downloads.html#human

Ensembl also has similar information available: http://www.ensembl.org/info/website/...s/compara.html

Genome level alignments are also at Ensembl: http://www.ensembl.org/info/genome/c.../analyses.html
GenoMax is offline   Reply With Quote
Old 11-26-2015, 04:06 AM   #3
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

I'm mainly doing influenza sequencing.

So, I need aligned datasets of ~10000 sequences of length 838-2280 nucleotides
for avian influenza of the 8 segments and 15 different strains for the HA and 9 for the
NA and each of these probably divided into an Eurasian and North American lineage.

Earlier here I had mitochondrial human DNA, 15000 sequences of length 16680
I also (occasionally) did Dengue, the 4 groups, Ebola etc.
Today I was trying helicobacter pylori ...

it's always the same problem, takes hours to generate suitable aligned datasets
gsgs is offline   Reply With Quote
Old 11-26-2015, 04:16 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

A search brought this up. You must have seen this already: http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html

Then there is http://www.fludb.org/brc/home.spg?decorator=influenza

For Mitochondria: http://www.ncbi.nlm.nih.gov/genome/organelle/

As you know first hand, it takes time/effort to create meaningful MSA's. I am going to speculate that NCBI creates those for genes of model organisms/common genomes using the limited resources they have.

You should consider making your own alignments available since that would save someone else some frustration.
GenoMax is offline   Reply With Quote
Old 11-26-2015, 04:50 AM   #5
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

For influenza, I think the best is to download all the ~400000 unaligned genbank sequences
in fasta-format, which they provide in one file of ~650MB.
But then you must filter for segments, groups, align, sort etc.
I'm doing this regularly ~1-2 times per year for the ~130000 avian sequences
into 5+2+9+16 aligned files. Takes 10-20hours.
If only one person in the world would be doing the same ...it would save much time.

Ideally you would have ~100 files with aligned sequences for the strains with an index from each.
And the files sorted by best neighbor match. From these you can extract and filter whatever you want.
flugenome.org did something like this, but is no longer being updated.
http://www.flugenome.org/show_subtypes.php

flu comes from birds , whenever
it jumps to new hosts you want to know where it came from,
the genome and each of the 8 segments separately, how it evolved,
whether/where there is pandemic danger.

And then the human and swine sequences for special types less regularly,
when the flu-season starts and there are new variants or such.

I assume it's similar for other organisms : the data should be provided
in filtered,sorted,aligned form.

I could easily make my files available from my HD, where to put them so other will find it ?
Best to send them on micro-SD



what's MSA

Last edited by gsgs; 11-26-2015 at 04:55 AM.
gsgs is offline   Reply With Quote
Old 11-26-2015, 06:10 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

MSA = Multiple sequence alignment

Isn't NCBI allowing you to do something similar to flugenome here (it is limited to 1000 genomes): http://www.ncbi.nlm.nih.gov/genomes/...i?go=alignment

That said, I agree with you that the analysis you are doing would be a useful resource for the flu community. But since the number of people working on flu must be relatively small can't you propose this internally (at a relevant meeting/working group) that a resource such as this be created and then hosted by the group.

Or you could write to NCBI and the group that manages the flu database and see if they would be interested in presenting the data the way you are proposing.
GenoMax is offline   Reply With Quote
Old 11-26-2015, 08:22 AM   #7
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

it's not just the MSA, you must remove/separate errors and nonmatches and
single-nucleotide insertions (==> probably error) , pseudo-recombinations , wrong segments,
wrong or missing strain-classifications, and such.
And then sort the sequences. And these are typically 10000 sequences.
It can be done, but takes some time (or tedious automization...)

I've been talking with the genbank flu expert in emails since 2006.
They are not interested. Genbank-flu has improved since
2006, though. More features, more uniform=computer friendly,

I could upload it somewhere, but noone will find it.

the flu-community may be small (and I'm not a member with meetings or writing papers
or professional=being paid or such) but this problem in general should apply to all sequencing.
It's just my amateur pandemic concern, that started with H5N1 in 2005

They may have somehow "solved" it in the human community (?)
gsgs is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO