SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Human Reference Genome Mucki0815 General 9 01-10-2013 11:08 AM
which human Reference genome to use david.tamborero Bioinformatics 4 12-22-2011 06:27 AM
Reference genome for MAQ - split reference genome by chromosome or not? inesdesantiago Bioinformatics 4 02-18-2009 08:44 AM
In Sequence: Independent Studies Use Illumina GA To Sequence Three Human Genomes Newsbot! Illumina/Solexa 1 11-21-2008 08:32 AM
In Sequence: Illumina, ABI Testing How Their Next-Gen Tools Can Seq a Human Genome Newsbot! Illumina/Solexa 0 02-26-2008 02:20 PM

Reply
 
Thread Tools
Old 07-15-2008, 08:40 AM   #1
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default Human reference sequence and tools like MAQ

How do people use that?

Use each chromosome separately or combine the fasta sequences and run through maq or tool of choice .. I had this issue earlier when I used contigs and the headers were not too parseable by tools. Any standards on the chromosomes to use, edit their headers before using them as reference..

It is just taking me too long to manage the reference and am looking for a more efficient way to do so !
bioinfosm is offline   Reply With Quote
Old 07-15-2008, 09:32 AM   #2
wesb
Junior Member
 
Location: Bethesda, MD

Join Date: Jul 2008
Posts: 6
Default Reference

For Maq, I split the reference by chromosome and submit each Maq run for a given chromosome to a different node on my cluster. With Soap, I do the same thing. It also works for me to split the reference into 5 chunks for Soap runs. The cluster I was using couldn't handle Soap if I give it all my reads and all the chromosomes at once. I had memory problems. It works fine if you split it up in chunks, or split up your reads as well. I use the hg18 chromosomes from UCSC Genome Browser. The headers are simple chr1, chr2 ...
__________________
Wes Beckstead
Predoctoral Fellow in Bioinformatics
Boston University Partnership with NIH
becksteadw@mail.nih.gov
wesb is offline   Reply With Quote
Old 07-17-2008, 12:08 AM   #3
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.
lh3 is offline   Reply With Quote
Old 07-17-2008, 06:34 AM   #4
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

I agree with Ih3 by not splitting the reference genome up. With any tool if you do this you'll need to combine the results and sort them post-alignment.
I've been using maq, soap, eland and novoalign. Eland's quite fast if you want exact, 1 or 2 mismatches and you dont care about quality. Building the index for eland is also very quick.
There is a script with the maq package that will do the automated chunk mapping. If you use novoalign do some neatening up of the fasta headers in your genome, it will save downstream perl/sed jobs and indexing of a full human genome on at least 8Gb server will take about 4-5 min.
zee is offline   Reply With Quote
Old 07-17-2008, 12:47 PM   #5
mukatira
Junior Member
 
Location: 38103

Join Date: Jun 2008
Posts: 2
Default Genome Viewer - Annotations

Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

I am looking at data from whole genome sequencing effort using 454 FLX.

Cheers

Sm
mukatira is offline   Reply With Quote
Old 07-21-2008, 12:26 AM   #6
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Quote:
Originally Posted by mukatira View Post
Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

I am looking at data from whole genome sequencing effort using 454 FLX.

Cheers

Sm
Hi Mukatira,

I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

Hope this helps.
zee is offline   Reply With Quote
Old 08-01-2008, 05:15 AM   #7
kmay
Member
 
Location: Munich, Germany

Join Date: Aug 2008
Posts: 29
Default

Quote:
Originally Posted by mukatira View Post
Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

I am looking at data from whole genome sequencing effort using 454 FLX.

Cheers

Sm
Mukatira,

if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.


Quote:
Split human reference genome
We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in itīs entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours itīs done.
For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

Cheers

Klaus
kmay is offline   Reply With Quote
Old 08-05-2008, 06:11 AM   #8
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.

Quote:
Originally Posted by kmay View Post
Mukatira,

if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.




We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in itīs entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours itīs done.
For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

Cheers

Klaus
zee is offline   Reply With Quote
Old 08-05-2008, 07:16 AM   #9
kmay
Member
 
Location: Munich, Germany

Join Date: Aug 2008
Posts: 29
Default

Quote:
Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.
Zee,

in fact a "best unique match" search is now is even faster. See the Helicos forum for some more detail on what that means. I will post some benchmarks in a couple of weeks.

Yes, we do it on one CPU. In fact, my earlier posting was outdated. The GMS has tody comes with 64 GB main memory. So... no desktop system


Yes, is scales linear.

Cheers

Klaus
kmay is offline   Reply With Quote
Old 09-15-2008, 07:37 AM   #10
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by zee View Post
Hi Mukatira,

I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

Hope this helps.
Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?
bioinfosm is offline   Reply With Quote
Old 09-15-2008, 07:53 AM   #11
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Hi,

It's not that difficult actually.

First you should convert your alignments to gene feature format (GFF). Then you should also have all your reference genome sequence definitions in GFF as well. There are some good templates

You will need mysql/postgresql setup with database create/file/read permissions. A gbrowse.conf file for your database will also need to exist. A good example is http://ftp.hapmap.org/jimwatsonseque...wsequence.conf

1) Create your gbrowse database in Mysql
2) Use the bp_load_gff.pl or bp_bulk_load_gff.pl to load GFF files of your read alignments, chromosome names, genes, etc
3) make sure you have the source and method columns for each GFF file specific enough to define each track. Have a look at the fgroup table in the gbrowse schema.
4)Edit the gbrows.conf file to reflect you track names (method:source)
5) Ensure your gbrowse.conf has the correct username/hostname/password to connect to the database containing your data.

If all the data was loaded correctly it should be straight forward to see the tracks. I usually edit an existing configuration file.

Some example files can be found on:

http://ftp.hapmap.org/jimwatsonseque...fo_hg18.gff.gz

http://ftp.hapmap.org/jimwatsonseque...ne_hg18.gff.gz


I have some scripts to do novoalign->gff conversion. For maq you could probably take a look at maqview (from the maq page) because it's quite a good browser and reads the .map file directly.

Quote:
Originally Posted by bioinfosm View Post
Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?
zee is offline   Reply With Quote
Old 06-09-2009, 01:41 AM   #12
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default chunks

Quote:
Originally Posted by lh3 View Post
For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.
I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

Secondly, are you able to refer me to the command that can do this splitting within maq?

Thank you

L
Layla is offline   Reply With Quote
Old 06-09-2009, 02:31 AM   #13
Jonathan
Member
 
Location: Germany

Join Date: Jun 2009
Posts: 36
Default

From the maq manual:
Code:
maq fastq2bfq [-n nreads] in.read.fastq out.read.bfq|out.prefix

Convert reads in FASTQ format to Maq’s BFQ (binary FASTQ) format.

OPTIONS:
-n INT 	number of reads per file [not specified]
As both end'-readfiles are converted separately, I guess you won't do anything wrong if you are using the same 'n' value for both 'end'files.

Followups:
maq map steps for all bfq-parts (using PE-corresponding part files of course)
followed by
maq mapmerge out.aln.map in.aln1.map in.aln2.map [...]

see the maq-docu for this, too.

Best
-Jonathan
Jonathan is offline   Reply With Quote
Old 06-09-2009, 02:32 AM   #14
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Layla,

You should have each member of the pair in a separate file. In other words whatever you're going to be splitting needs to be done on each of these files containing mate 1 and mate 2.
The maq.pl easyrun command should take care of the process. I would recommend splitting into chunks of size 2 million, therefore it should do 5 iterations.

Assuming you have 1 side in reads1.fq and the other in reads2.fq

Something like:

maq.pl easyrun db.bfa reads1.fq reads2.fq


should work fine.

Quote:
Originally Posted by Layla View Post
I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

Secondly, are you able to refer me to the command that can do this splitting within maq?

Thank you

L
zee is offline   Reply With Quote
Old 06-12-2009, 07:01 AM   #15
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default

Thankyou Jonathon and Zee,

I guess this should be good enough. Split both files into 2 million chunks and then run the following command 5 times..
./maq match out.map all_chrom.bfa 2million_5prime.bfq 2million_3prime.bfq

Any ideas how long this should take on a 32GB Mac OS?

Running the above command for all 20million reads in one go took 7 days!

Whilst on this subject, is anyone able to guide me with a specific data filtering issue before I use Batman?

Cheers again

L
Layla is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO