Seqanswers Leaderboard Ad

**GenoMax** · 04-29-2013, 11:01 AM

Not sure if you have looked at IGV for visualization. You should be able to use the FASTA file as a custom genome.

If you want to download the entire region that you have captured then look at BioMart (from Ensembl) or UCSC table browser. You may need to redo your alignments since the region co-ordinates/ID's will not match if you try to use this new file as the custom genome.

**aggp11** · 04-29-2013, 11:08 AM

Also, besides just aligning to the HLA region, you could also try aligning to the entire human genome (hg19). Hopefully this wouldn't lead to too many off-target alignment and you would be able to get annotations comparatively easily for the hg19. I agree with GenoMax in trying out IGV as it is really good for visualization. You could also try Golden Helix's GenomeBrowse (http://www.goldenhelix.com/GenomeBrowse/) for visualization as it has added functionality of having dbSNP, 1000 genomes and various other forms of annotated data.

**Jeremy** · 04-29-2013, 08:23 PM

I would recommend aligning to the whole human genome, if you don't want all the off-target alignments you can filter those off afterwards. Alignment programs try to align all reads to the given reference, when the read set covers much more than the given reference the alignment program can result in some pretty awful 'forced' alignments giving you rubbish data.

**ajthomas** · 05-03-2013, 12:52 PM

Thanks for the input so far. I've tried both IGV and GenomeBrowse and both seem to work fine once I figured out how to make a .bai file. Now I have another question. How do I calculate the coverage of the capture region and on-target percentage?

When I initially aligned just to a section of the chromosome that corresponded to the HLA region I had something like 70% coverage. Now, however, in IGV I loaded the .bed file that describes the capture baits, I can see that the baits don't cover all of the HLA, but there are lots of little gaps between the baits. The reads I have correlate very well with the baits, so I think my coverage should be much higher than that. Here's a screen shot from IGV showing the coverage and the .bed file tracks.

So, how can I get a coverage percentage that doesn't count all of those little spots that aren't covered by the capture baits?

**GenoMax** · 05-05-2013, 02:43 PM

Originally posted by ajthomas View Post

So, how can I get a coverage percentage that doesn't count all of those little spots that aren't covered by the capture baits?

Post#4 in this thread should allow you to do that: http://seqanswers.com/forums/showthread.php?t=29758

**mimi_lupton** · 05-21-2013, 11:56 PM

Originally posted by ajthomas View Post

As a collaboration project with a colleague I did a sequence capture/sequencing experiment. The goal was to sequence the HLA region of a particular haplotype associated with a particular disease. My Agilent rep gave me early access to a SureSelect HLA capture product that seemed to work pretty well. I got a library that sequenced well, and an initial alignment with the HLA region yielded decent coverage and on-target numbers. Now, however, I'm trying to do more analysis of it and need some help.

I'm well-versed in the bench-work of this type, but I'm still a novice when it comes to genomic data analysis. For the initial alignment I mentioned, I just mapped the reads to a FASTA file of the HLA region using Roche's Assembler software. That's not annotated or anything, though, so it's not very useful. I have a .bed file describing the baits, and I've used that at the UCSC Genome Browser site to visualize them, and found that the captured region covers a little more than what was in the FASTA used for the initial alignment, and that there are some fairly large gaps that aren't covered by the baits. Is there a way I can use the .bed file to download a better reference file that covers just the captured region and not the gaps?

Also, I loaded the BAM file from that alignment into GenomeBrowse to visualize the the data but that didn't work at all. I can't say I'm surprised, though, because it was only aligned to a FASTA file with a few Mb of data, not the whole genome. What is the right way to go about visualizing the data?

I use Picard tools CalculateHsMetrics.
This allows you to use a bed files with the target intervals, and the capture bait positions to get capture efficiency and coverage metrics on your targeted regions.

Encountered a 404 error

http://picard.sourceforge.net/command-line-overview.shtml#CalculateHsMetrics

**Zaag** · 05-22-2013, 06:44 AM

Is it 454data? If so, use the gsMapper, it can visualize, give % ontarget and has proper documentation so you can try out a few things to see what works for you without the need to learn a lot of different things.

I would use the whole genome as a reference and if you download the corect tables from UCSC and give the BED file it will annotate variants and more with gene information, In or ExtRegion etc.

**ajthomas** · 05-23-2013, 10:13 AM

Originally posted by mimi_lupton View Post

I use Picard tools CalculateHsMetrics.
This allows you to use a bed files with the target intervals, and the capture bait positions to get capture efficiency and coverage metrics on your targeted regions.

http://picard.sourceforge.net/comman...ulateHsMetrics

I looked at that, but it's not clear how to use it. According to the documentation, a BAIT_INTERVALS and TARGET_INTERVALS file are needed. It seems to me that this is something I would get from the .bed file describing the capture baits, but what's the difference between them? What is the format of these files?

**ajthomas** · 05-23-2013, 10:15 AM

Originally posted by GenoMax View Post

Post#4 in this thread should allow you to do that: http://seqanswers.com/forums/showthread.php?t=29758

Thanks, I've been looking at that and getting some information from it.

**ajthomas** · 05-23-2013, 10:19 AM

Originally posted by Zaag View Post

Is it 454data? If so, use the gsMapper, it can visualize, give % ontarget and has proper documentation so you can try out a few things to see what works for you without the need to learn a lot of different things.

I would use the whole genome as a reference and if you download the corect tables from UCSC and give the BED file it will annotate variants and more with gene information, In or ExtRegion etc.

I used gsMapper initially to map the data to the genome. A few questions for you: How do I use the BED file I have to download information from UCSC? I've tried to do that but can't seem to get it to give me only information from the regions specified int the BED file. Once I have that, how do I give it to gsMapper? I made a file with targeted regions from information in the BED file and specified that as the targeted regions file; is that what you mean?

**kmcarr** · 05-23-2013, 12:08 PM

Originally posted by ajthomas View Post

I looked at that, but it's not clear how to use it. According to the documentation, a BAIT_INTERVALS and TARGET_INTERVALS file are needed. It seems to me that this is something I would get from the .bed file describing the capture baits, but what's the difference between them? What is the format of these files?

These two files should be in GATK interval file format (scroll down to #3. Intervals). The interval format is very similar to a bed file except it also contains a header (a sequence .dict in GATK parlance which are basically SAM @SQ lines) followed by bed like interval lines. Each line contains chromsome name, start position, end position, strand (always +) and an ID.

The BAIT file lists the positions of the actual bait oligos used for the capture. The TARGET file lists the larger target regions intended to be pulled down by a set of baits. Here are a couple of snippets showing first the BAIT intervals followed by the corresponding TARGET intervals.

Sample BAIT_INTERVALS

Code:

@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:249250621	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ	SN:chr2	LN:243199373	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:a0d9851da00400dec1098a9255ac712e
chr1	160007387	160007466	+	12242589
chr1	160007677	160007756	+	12242598
chr1	160007967	160008046	+	12242595
chr1	160008257	160008336	+	12242603
chr1	160008537	160008616	+	12242600
chr1	160008837	160008916	+	12242597
chr1	160009067	160009146	+	12242594
chr1	160009357	160009436	+	12242596
chr1	160009627	160009706	+	12242593
chr1	160009927	160010006	+	12242592
chr1	160010187	160010266	+	12242591
chr1	160010487	160010566	+	12242604
chr1	160010747	160010826	+	12242602
chr1	160011057	160011136	+	12242599
chr1	160011337	160011416	+	12242606
chr1	160011647	160011726	+	12242601
chr1	160011937	160012016	+	12242605
chr1	160012237	160012316	+	12242588
chr1	160039881	160039960	+	12242590
chr1	160041470	160041549	+	12242427
chr1	160041980	160042059	+	12242425
chr1	160042350	160042429	+	12242426
chr2	26680091	26680170	+	11653005
chr2	26680431	26680510	+	11653006
chr2	26680791	26680870	+	11653007
chr2	26681001	26681080	+	11653004
chr2	26682954	26683033	+	11653001
chr2	26683515	26683594	+	11652997
chr2	26683769	26683848	+	11652984
chr2	26684644	26684723	+	11652985
chr2	26684960	26685039	+	11653029
chr2	26686355	26686434	+	11653020
chr2	26686863	26686942	+	11653033
chr2	26688585	26688664	+	11653021
chr2	26688840	26688919	+	11652990
chr2	26689610	26689689	+	11653008
chr2	26689994	26690073	+	11652993
chr2	26690261	26690340	+	11652996
chr2	26691269	26691348	+	11652998
chr2	26693485	26693564	+	11653009
chr2	26693963	26694042	+	11653022
chr2	26695382	26695461	+	11652995
chr2	26695412	26695491	+	11652999
chr2	26696041	26696120	+	11653011
chr2	26696314	26696393	+	11653010
chr2	26696868	26696947	+	11653013
chr2	26697421	26697500	+	11653034
chr2	26698244	26698323	+	11652981
chr2	26698784	26698863	+	11652991
chr2	26699050	26699129	+	11653023
chr2	26699805	26699884	+	11653018
chr2	26700058	26700137	+	11653000
chr2	26700289	26700368	+	11653024
chr2	26700527	26700606	+	11653025
chr2	26700573	26700652	+	11652988
chr2	26700713	26700792	+	11652989
chr2	26700771	26700850	+	11652987
chr2	26702152	26702231	+	11653030
chr2	26702391	26702470	+	11653014
chr2	26703085	26703164	+	11653015
chr2	26703765	26703844	+	11653026
chr2	26705327	26705406	+	11653012
chr2	26706383	26706462	+	11652982
chr2	26707381	26707460	+	11653032
chr2	26712081	26712160	+	11652986
chr2	26712537	26712616	+	11652992
chr2	26717835	26717914	+	11652994
chr2	26724609	26724688	+	11653016
chr2	26725191	26725270	+	11653019
chr2	26726636	26726715	+	11653027
chr2	26739336	26739415	+	11653002
chr2	26741887	26741966	+	11652983
chr2	26750704	26750783	+	11653028
chr2	26760573	26760652	+	11653017

Sample TARGET_INTERVALS

Code:

@HD	VN:1.0	SO:unsorted
@SQ	SN:chr1	LN:249250621	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ	SN:chr2	LN:243199373	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:a0d9851da00400dec1098a9255ac712e
chr1	160007257	160012322	+	4059197_1_160007257_160012322
chr1	160039812	160040051	+	4059197_1_160039812_160040051
chr1	160041390	160042433	+	4059195_1_160041390_160042433
chr2	26680071	26681088	+	4021269_2_26680071_26681088
chr2	26682876	26683073	+	4021269_2_26682876_26683073
chr2	26683515	26683615	+	4021269_2_26683515_26683615
chr2	26683720	26683898	+	4021269_2_26683720_26683898
chr2	26684564	26684805	+	4021269_2_26684564_26684805
chr2	26684951	26685049	+	4021269_2_26684951_26685049
chr2	26686351	26686439	+	4021269_2_26686351_26686439
chr2	26686832	26686974	+	4021269_2_26686832_26686974
chr2	26687737	26687897	+	4021269_2_26687737_26687897
chr2	26688540	26688710	+	4021269_2_26688540_26688710
chr2	26688817	26688944	+	4021269_2_26688817_26688944
chr2	26689582	26689719	+	4021269_2_26689582_26689719
chr2	26689967	26690101	+	4021269_2_26689967_26690101
chr2	26690233	26690369	+	4021269_2_26690233_26690369
chr2	26691276	26691342	+	4021269_2_26691276_26691342
chr2	26693461	26693589	+	4021269_2_26693461_26693589
chr2	26693989	26694018	+	4021269_2_26693989_26694018
chr2	26695387	26695457	+	4021269_2_26695387_26695457
chr2	26695387	26695517	+	4021269_2_26695387_26695517
chr2	26696000	26696162	+	4021269_2_26696000_26696162
chr2	26696274	26696435	+	4021269_2_26696274_26696435
chr2	26696859	26696978	+	4021269_2_26696859_26696978
chr2	26697381	26697542	+	4021269_2_26697381_26697542
chr2	26698227	26698361	+	4021269_2_26698227_26698361
chr2	26698782	26698906	+	4021269_2_26698782_26698906
chr2	26698996	26699185	+	4021269_2_26698996_26699185
chr2	26699759	26699911	+	4021269_2_26699759_26699911
chr2	26700040	26700156	+	4021269_2_26700040_26700156
chr2	26700284	26700374	+	4021269_2_26700284_26700374
chr2	26700517	26700617	+	4021269_2_26700517_26700617
chr2	26700517	26700910	+	4021269_2_26700517_26700910
chr2	26700712	26700910	+	4021269_2_26700712_26700910
chr2	26702132	26702252	+	4021269_2_26702132_26702252
chr2	26702341	26702521	+	4021269_2_26702341_26702521
chr2	26703071	26703179	+	4021269_2_26703071_26703179
chr2	26703654	26703877	+	4021269_2_26703654_26703877
chr2	26705274	26705460	+	4021269_2_26705274_26705460
chr2	26706330	26706516	+	4021269_2_26706330_26706516
chr2	26707342	26707501	+	4021269_2_26707342_26707501
chr2	26712079	26712163	+	4021269_2_26712079_26712163
chr2	26712546	26712608	+	4021269_2_26712546_26712608
chr2	26717810	26717941	+	4021269_2_26717810_26717941
chr2	26724622	26724676	+	4021269_2_26724622_26724676
chr2	26725168	26725294	+	4021269_2_26725168_26725294
chr2	26726640	26726713	+	4021269_2_26726640_26726713
chr2	26739286	26739467	+	4021269_2_26739286_26739467
chr2	26741878	26741977	+	4021269_2_26741878_26741977
chr2	26750700	26750788	+	4021269_2_26750700_26750788
chr2	26760584	26760642	+	4021269_2_26760584_26760642
chr2	26781361	26781566	+	4021269_2_26781361_26781566

This shows that the 22 baits on chr1 were designed to cover 3 target regions. CalculateHsMetrics reports coverage for both baits and targets

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

now what? (sequence capture data analysis)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News