![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Human Reference Genome | Mucki0815 | General | 9 | 01-10-2013 12:08 PM |
which human Reference genome to use | david.tamborero | Bioinformatics | 4 | 12-22-2011 07:27 AM |
Reference genome for MAQ - split reference genome by chromosome or not? | inesdesantiago | Bioinformatics | 4 | 02-18-2009 09:44 AM |
In Sequence: Independent Studies Use Illumina GA To Sequence Three Human Genomes | Newsbot! | Illumina/Solexa | 1 | 11-21-2008 09:32 AM |
In Sequence: Illumina, ABI Testing How Their Next-Gen Tools Can Seq a Human Genome | Newsbot! | Illumina/Solexa | 0 | 02-26-2008 03:20 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: USA Join Date: Jan 2008
Posts: 482
|
![]()
How do people use that?
Use each chromosome separately or combine the fasta sequences and run through maq or tool of choice .. I had this issue earlier when I used contigs and the headers were not too parseable by tools. Any standards on the chromosomes to use, edit their headers before using them as reference.. It is just taking me too long to manage the reference and am looking for a more efficient way to do so ! |
![]() |
![]() |
![]() |
#2 |
Junior Member
Location: Bethesda, MD Join Date: Jul 2008
Posts: 6
|
![]()
For Maq, I split the reference by chromosome and submit each Maq run for a given chromosome to a different node on my cluster. With Soap, I do the same thing. It also works for me to split the reference into 5 chunks for Soap runs. The cluster I was using couldn't handle Soap if I give it all my reads and all the chromosomes at once. I had memory problems. It works fine if you split it up in chunks, or split up your reads as well. I use the hg18 chromosomes from UCSC Genome Browser. The headers are simple chr1, chr2 ...
__________________
Wes Beckstead Predoctoral Fellow in Bioinformatics Boston University Partnership with NIH becksteadw@mail.nih.gov |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.
If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable. |
![]() |
![]() |
![]() |
#4 |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
I agree with Ih3 by not splitting the reference genome up. With any tool if you do this you'll need to combine the results and sort them post-alignment.
I've been using maq, soap, eland and novoalign. Eland's quite fast if you want exact, 1 or 2 mismatches and you dont care about quality. Building the index for eland is also very quick. There is a script with the maq package that will do the automated chunk mapping. If you use novoalign do some neatening up of the fasta headers in your genome, it will save downstream perl/sed jobs and indexing of a full human genome on at least 8Gb server will take about 4-5 min. |
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: 38103 Join Date: Jun 2008
Posts: 2
|
![]()
Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated. I am looking at data from whole genome sequencing effort using 454 FLX. Cheers Sm |
![]() |
![]() |
![]() |
#6 | |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]() Quote:
I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations. For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs. Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available. Hope this helps. |
|
![]() |
![]() |
![]() |
#7 | ||
Member
Location: Munich, Germany Join Date: Aug 2008
Posts: 29
|
![]() Quote:
if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering So think about the questions you want to ask beforehand and register then. After two weeks it will be gone... With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally. Quote:
Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours itīs done. For RNA-seq we use an additional reference: all potential splice junctions pre-calculated. Cheers Klaus |
||
![]() |
![]() |
![]() |
#8 | |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.
Quote:
|
|
![]() |
![]() |
![]() |
#9 | |
Member
Location: Munich, Germany Join Date: Aug 2008
Posts: 29
|
![]() Quote:
in fact a "best unique match" search is now is even faster. See the Helicos forum for some more detail on what that means. I will post some benchmarks in a couple of weeks. Yes, we do it on one CPU. In fact, my earlier posting was outdated. The GMS has tody comes with 64 GB main memory. So... no desktop system ![]() Yes, is scales linear. Cheers Klaus |
|
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: USA Join Date: Jan 2008
Posts: 482
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
Hi,
It's not that difficult actually. First you should convert your alignments to gene feature format (GFF). Then you should also have all your reference genome sequence definitions in GFF as well. There are some good templates You will need mysql/postgresql setup with database create/file/read permissions. A gbrowse.conf file for your database will also need to exist. A good example is http://ftp.hapmap.org/jimwatsonseque...wsequence.conf 1) Create your gbrowse database in Mysql 2) Use the bp_load_gff.pl or bp_bulk_load_gff.pl to load GFF files of your read alignments, chromosome names, genes, etc 3) make sure you have the source and method columns for each GFF file specific enough to define each track. Have a look at the fgroup table in the gbrowse schema. 4)Edit the gbrows.conf file to reflect you track names (method:source) 5) Ensure your gbrowse.conf has the correct username/hostname/password to connect to the database containing your data. If all the data was loaded correctly it should be straight forward to see the tracks. I usually edit an existing configuration file. Some example files can be found on: http://ftp.hapmap.org/jimwatsonseque...fo_hg18.gff.gz http://ftp.hapmap.org/jimwatsonseque...ne_hg18.gff.gz I have some scripts to do novoalign->gff conversion. For maq you could probably take a look at maqview (from the maq page) because it's quite a good browser and reads the .map file directly. |
![]() |
![]() |
![]() |
#12 | |
Member
Location: London Join Date: Sep 2008
Posts: 58
|
![]() Quote:
Secondly, are you able to refer me to the command that can do this splitting within maq? Thank you L |
|
![]() |
![]() |
![]() |
#13 |
Member
Location: Germany Join Date: Jun 2009
Posts: 36
|
![]()
From the maq manual:
Code:
maq fastq2bfq [-n nreads] in.read.fastq out.read.bfq|out.prefix Convert reads in FASTQ format to Maq’s BFQ (binary FASTQ) format. OPTIONS: -n INT number of reads per file [not specified] Followups: maq map steps for all bfq-parts (using PE-corresponding part files of course) followed by maq mapmerge out.aln.map in.aln1.map in.aln2.map [...] see the maq-docu for this, too. Best -Jonathan |
![]() |
![]() |
![]() |
#14 | |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
Layla,
You should have each member of the pair in a separate file. In other words whatever you're going to be splitting needs to be done on each of these files containing mate 1 and mate 2. The maq.pl easyrun command should take care of the process. I would recommend splitting into chunks of size 2 million, therefore it should do 5 iterations. Assuming you have 1 side in reads1.fq and the other in reads2.fq Something like: maq.pl easyrun db.bfa reads1.fq reads2.fq should work fine. Quote:
|
|
![]() |
![]() |
![]() |
#15 |
Member
Location: London Join Date: Sep 2008
Posts: 58
|
![]()
Thankyou Jonathon and Zee,
I guess this should be good enough. Split both files into 2 million chunks and then run the following command 5 times.. ./maq match out.map all_chrom.bfa 2million_5prime.bfq 2million_3prime.bfq Any ideas how long this should take on a 32GB Mac OS? Running the above command for all 20million reads in one go took 7 days! ![]() Whilst on this subject, is anyone able to guide me with a specific data filtering issue before I use Batman? Cheers again L |
![]() |
![]() |
![]() |
Thread Tools | |
|
|