SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SPANNER (1000 genomes) Margarida Bioinformatics 11 12-04-2013 10:09 AM
1000 genomes - can anyone join in? henry.wood General 0 06-24-2011 05:21 AM
1000 Genomes Data RichardRocca General 1 03-16-2011 01:11 PM
1000 genomes Nataiki Bioinformatics 4 02-04-2011 05:42 AM
CoComputational Resources on Assembling Genomes raghavagps Bioinformatics 0 01-17-2010 10:56 AM

Reply
 
Thread Tools
Old 12-06-2012, 05:37 PM   #21
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

Laura,

how/what should I specify, if I don't have particular region to look at and want to get all genome-wide variants?

Thanks so much for you kind help.
rama is offline   Reply With Quote
Old 12-06-2012, 11:27 PM   #22
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

You can give tabix a whole chromosome but be aware tabix can not cope with losed network connectivity so when streaming large data volumes that can cause lossed data which means you may need to download the whole file

http://www.biostars.org/p/50752/
laura is offline   Reply With Quote
Old 12-07-2012, 07:46 AM   #23
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

there is another paper with 113 pages, "supplemental information"
http://www.nature.com/nature/journal...re11632-s1.pdf

with a referrence:

...
38!
Garrison,!E.!K.!vcflib$K$a$simple$C++$library$for$parsing$and$manipulating$
VCF$files,!<https://github.com/ekg/vcflib>!(2012).!

pointing back to :

http://www.1000genomes.org/wiki/Anal...mat-version-41

which is 19 pages

Last edited by gsgs; 12-07-2012 at 08:15 AM.
gsgs is offline   Reply With Quote
Old 12-07-2012, 12:36 PM   #24
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

Laura,

I tried with downloading both the vcf.gz and tbi files. but it did not work and it is difficult to interpret the error. can you see what I am doing wrong here

./tabix -h /Volumes/Macintosh\ HD\ 3/1000Genome/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz | perl ~/othertools/vcftools_0.1.10/perl/vcf-subset -c NA10851

[tabix] the index file exists. Please use '-f' to overwrite.
Broken VCF header, no column names?
at /Users/molpathuser1/othertools/vcftools_0.1.10/perl//Vcf.pm line 177
Vcf::throw('Vcf4_1=HASH(0x7fa9d982f8d8)', 'Broken VCF header, no column names?') called at /Users/molpathuser1/othertools/vcftools_0.1.10/perl//Vcf.pm line 869
VcfReader::_read_column_names('Vcf4_1=HASH(0x7fa9d982f8d8)') called at /Users/molpathuser1/othertools/vcftools_0.1.10/perl//Vcf.pm line 604
VcfReader:arse_header('Vcf4_1=HASH(0x7fa9d982f8d8)') called at /Users/molpathuser1/othertools/vcftools_0.1.10/perl/vcf-subset line 122
main::vcf_subset('HASH(0x7fa9d98288f0)') called at /Users/molpathuser1/othertools/vcftools_0.1.10/perl/vcf-subset line 12

many thanks for your kind help
rama is offline   Reply With Quote
Old 12-07-2012, 12:49 PM   #25
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

You need to give tabix some sort of chromosome name otherwise it doesn't know what to fetch

If you just want to filter the whole file you will need to use zcat

That being said you downloaded the sites file which contains no genotype info and no columns with individual genotypes to filter
laura is offline   Reply With Quote
Old 12-12-2012, 03:58 PM   #26
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

Hi Laura,

I am still having trouble with extracting the variant calls of a specific sample.

As you pointed out earlier that I have downloaded sites file with no genotype column, so now I got this version ALL.2of4intersection.20100804.genotypes.vcf.gz vcf and tbi file from ftp site (release/20100804).

and I used the following command to subset the vcf file

tabix -fh /Volumes/Macintosh_HD_3/1000Genome/ALL.2of4intersection.20100804.genotypes.vcf.gz 1 | perl ~/othertools/vcftools_0.1.10/perl/vcf-subset -c NA10851 > NA10851/NA10851_chr1

but strangely the out-put file has all genotype columns. I have been following the directions given on the 1000 genome except the I don't give the range for chromosome as I want to get all variants. So I tried with giving the coordinates (see below) and result file has just the header only.

here is the command i used,
tabix -fh /Volumes/Macintosh_HD_3/1000Genome/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:1-243199373 | perl ~/othertools/vcftools_0.1.10/perl/vcf-subset -c NA10851 > NA10851/NA10851_chr2

so now I really don't know what I am doing wrong in trying to subset the vcf file. I really appreciate for your kind help so far and would be very grateful if you could help me how to solve this.

thank you so much.
Rama
rama is offline   Reply With Quote
Old 12-13-2012, 01:17 AM   #27
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Hello Rama

Unfortunately I can not recreate your issue

Using your command I get a vcf file which just contains genotypes for NA10851
laura is offline   Reply With Quote
Old 12-13-2012, 08:42 AM   #28
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

Thanks Laura, for trying it out.
rama is offline   Reply With Quote
Old 12-19-2012, 12:53 AM   #29
papori
Senior Member
 
Location: berd

Join Date: Dec 2010
Posts: 179
Default

Hi,
Sorry if this is already been asked, I didn't find it..

I try to figure out if I can do a search by read length.
I am looking for reads length 101.
Is there a way to know this information before downloading?
I looked in the sequence.index but I didn't find this.

Thanks in advance,
Pap

Thanks,
papori is offline   Reply With Quote
Old 02-25-2013, 06:44 PM   #30
southan
Member
 
Location: Oceania

Join Date: May 2011
Posts: 11
Default

I'm going to download Bam files from the Project.
I see two links:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/
and
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/

There are some overlapping files between the two links.

I would like to know which one I should use?

Thanks,
southan is offline   Reply With Quote
Old 02-26-2013, 02:33 AM   #31
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

These two data sets represent our most recent set of alignments and the frozen alignments used for the phase1 analysis effort

There will be overlapping individuals between the two sets but no bam files should be the same as an extended version of GRCh37 is being used for the post phase1 mapping

see http://www.1000genomes.org/faq/which...bly-do-you-use

You should be able to tell the difference between these files by the YYYYMMDD in their name as this points to the sequence index they were based on
laura is offline   Reply With Quote
Old 03-15-2013, 12:52 PM   #32
jgibbons1
Senior Member
 
Location: Worcester, MA

Join Date: Oct 2009
Posts: 130
Default

@papori I had the same issue. You can download the "sequence.index" file from the ftp site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/). In Excel, I ended up making a new column where I divided BASE_COUNT by READ_COUNT. You can then filter the read length you are looking for.
jgibbons1 is offline   Reply With Quote
Old 12-04-2013, 10:07 AM   #33
Mokhtar
Junior Member
 
Location: Tunisia

Join Date: Dec 2013
Posts: 4
Default

Please is there any one can help me how can I BLAST one FASTE file with more than 3000 sequences
Mokhtar is offline   Reply With Quote
Old 12-05-2013, 01:28 AM   #34
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Mokhtar you would be better creating a new thread for your question, this isnt really related to the 1000 genomes project

If you let people know what your sequences (dna, cdna, protein?) are and what species you are working in they will probably be able to offer better advice
laura is offline   Reply With Quote
Old 12-05-2013, 02:36 AM   #35
Mokhtar
Junior Member
 
Location: Tunisia

Join Date: Dec 2013
Posts: 4
Default

Please is there any one can help me how can I BLAST one FASTE file with more than 3000 DNA sequences generated from fungus community.
Mokhtar is offline   Reply With Quote
Old 12-05-2013, 02:37 AM   #36
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Start a new thread, this is not the right place for this question
laura is offline   Reply With Quote
Old 04-15-2017, 09:48 AM   #37
lucasrocha
Junior Member
 
Location: Brasil

Join Date: Apr 2017
Posts: 2
Default A Fast Algorithm for th inexact Characteristic String Problem - Doubt

Hi Guys,

Fine?

Has anyone read this article?
Https://mediatum.ub.tum.de/doc/1094391/1094391.pdf

I have question about ring buffer of this article...

The page 11/12 the article explains for us that algorithm acess lines 14 and 15 of algorithm and posErr is with value k+1, but I was testing and the value is k and the algoritm Assumes negative values ​​for the ring buffer posErr....

Can someone who knows this problem help me understand how to work with this ring buffer, or am I analyzing it wrong?

Thanks
lucasrocha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO