Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • now what? (sequence capture data analysis)

    As a collaboration project with a colleague I did a sequence capture/sequencing experiment. The goal was to sequence the HLA region of a particular haplotype associated with a particular disease. My Agilent rep gave me early access to a SureSelect HLA capture product that seemed to work pretty well. I got a library that sequenced well, and an initial alignment with the HLA region yielded decent coverage and on-target numbers. Now, however, I'm trying to do more analysis of it and need some help.

    I'm well-versed in the bench-work of this type, but I'm still a novice when it comes to genomic data analysis. For the initial alignment I mentioned, I just mapped the reads to a FASTA file of the HLA region using Roche's Assembler software. That's not annotated or anything, though, so it's not very useful. I have a .bed file describing the baits, and I've used that at the UCSC Genome Browser site to visualize them, and found that the captured region covers a little more than what was in the FASTA used for the initial alignment, and that there are some fairly large gaps that aren't covered by the baits. Is there a way I can use the .bed file to download a better reference file that covers just the captured region and not the gaps?

    Also, I loaded the BAM file from that alignment into GenomeBrowse to visualize the the data but that didn't work at all. I can't say I'm surprised, though, because it was only aligned to a FASTA file with a few Mb of data, not the whole genome. What is the right way to go about visualizing the data?

  • #2
    Not sure if you have looked at IGV for visualization. You should be able to use the FASTA file as a custom genome.

    If you want to download the entire region that you have captured then look at BioMart (from Ensembl) or UCSC table browser. You may need to redo your alignments since the region co-ordinates/ID's will not match if you try to use this new file as the custom genome.

    Comment


    • #3
      Also, besides just aligning to the HLA region, you could also try aligning to the entire human genome (hg19). Hopefully this wouldn't lead to too many off-target alignment and you would be able to get annotations comparatively easily for the hg19. I agree with GenoMax in trying out IGV as it is really good for visualization. You could also try Golden Helix's GenomeBrowse (http://www.goldenhelix.com/GenomeBrowse/) for visualization as it has added functionality of having dbSNP, 1000 genomes and various other forms of annotated data.

      Comment


      • #4
        I would recommend aligning to the whole human genome, if you don't want all the off-target alignments you can filter those off afterwards. Alignment programs try to align all reads to the given reference, when the read set covers much more than the given reference the alignment program can result in some pretty awful 'forced' alignments giving you rubbish data.

        Comment


        • #5
          Thanks for the input so far. I've tried both IGV and GenomeBrowse and both seem to work fine once I figured out how to make a .bai file. Now I have another question. How do I calculate the coverage of the capture region and on-target percentage?

          When I initially aligned just to a section of the chromosome that corresponded to the HLA region I had something like 70% coverage. Now, however, in IGV I loaded the .bed file that describes the capture baits, I can see that the baits don't cover all of the HLA, but there are lots of little gaps between the baits. The reads I have correlate very well with the baits, so I think my coverage should be much higher than that. Here's a screen shot from IGV showing the coverage and the .bed file tracks.



          So, how can I get a coverage percentage that doesn't count all of those little spots that aren't covered by the capture baits?

          Comment


          • #6
            Originally posted by ajthomas View Post

            So, how can I get a coverage percentage that doesn't count all of those little spots that aren't covered by the capture baits?
            Post#4 in this thread should allow you to do that: http://seqanswers.com/forums/showthread.php?t=29758

            Comment


            • #7
              Originally posted by ajthomas View Post
              As a collaboration project with a colleague I did a sequence capture/sequencing experiment. The goal was to sequence the HLA region of a particular haplotype associated with a particular disease. My Agilent rep gave me early access to a SureSelect HLA capture product that seemed to work pretty well. I got a library that sequenced well, and an initial alignment with the HLA region yielded decent coverage and on-target numbers. Now, however, I'm trying to do more analysis of it and need some help.

              I'm well-versed in the bench-work of this type, but I'm still a novice when it comes to genomic data analysis. For the initial alignment I mentioned, I just mapped the reads to a FASTA file of the HLA region using Roche's Assembler software. That's not annotated or anything, though, so it's not very useful. I have a .bed file describing the baits, and I've used that at the UCSC Genome Browser site to visualize them, and found that the captured region covers a little more than what was in the FASTA used for the initial alignment, and that there are some fairly large gaps that aren't covered by the baits. Is there a way I can use the .bed file to download a better reference file that covers just the captured region and not the gaps?

              Also, I loaded the BAM file from that alignment into GenomeBrowse to visualize the the data but that didn't work at all. I can't say I'm surprised, though, because it was only aligned to a FASTA file with a few Mb of data, not the whole genome. What is the right way to go about visualizing the data?
              I use Picard tools CalculateHsMetrics.
              This allows you to use a bed files with the target intervals, and the capture bait positions to get capture efficiency and coverage metrics on your targeted regions.

              Comment


              • #8
                Is it 454data? If so, use the gsMapper, it can visualize, give % ontarget and has proper documentation so you can try out a few things to see what works for you without the need to learn a lot of different things.

                I would use the whole genome as a reference and if you download the corect tables from UCSC and give the BED file it will annotate variants and more with gene information, In or ExtRegion etc.

                Comment


                • #9
                  Originally posted by mimi_lupton View Post
                  I use Picard tools CalculateHsMetrics.
                  This allows you to use a bed files with the target intervals, and the capture bait positions to get capture efficiency and coverage metrics on your targeted regions.

                  http://picard.sourceforge.net/comman...ulateHsMetrics
                  I looked at that, but it's not clear how to use it. According to the documentation, a BAIT_INTERVALS and TARGET_INTERVALS file are needed. It seems to me that this is something I would get from the .bed file describing the capture baits, but what's the difference between them? What is the format of these files?

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Post#4 in this thread should allow you to do that: http://seqanswers.com/forums/showthread.php?t=29758
                    Thanks, I've been looking at that and getting some information from it.

                    Comment


                    • #11
                      Originally posted by Zaag View Post
                      Is it 454data? If so, use the gsMapper, it can visualize, give % ontarget and has proper documentation so you can try out a few things to see what works for you without the need to learn a lot of different things.

                      I would use the whole genome as a reference and if you download the corect tables from UCSC and give the BED file it will annotate variants and more with gene information, In or ExtRegion etc.
                      I used gsMapper initially to map the data to the genome. A few questions for you: How do I use the BED file I have to download information from UCSC? I've tried to do that but can't seem to get it to give me only information from the regions specified int the BED file. Once I have that, how do I give it to gsMapper? I made a file with targeted regions from information in the BED file and specified that as the targeted regions file; is that what you mean?

                      Comment


                      • #12
                        Originally posted by ajthomas View Post
                        I looked at that, but it's not clear how to use it. According to the documentation, a BAIT_INTERVALS and TARGET_INTERVALS file are needed. It seems to me that this is something I would get from the .bed file describing the capture baits, but what's the difference between them? What is the format of these files?
                        These two files should be in GATK interval file format (scroll down to #3. Intervals). The interval format is very similar to a bed file except it also contains a header (a sequence .dict in GATK parlance which are basically SAM @SQ lines) followed by bed like interval lines. Each line contains chromsome name, start position, end position, strand (always +) and an ID.

                        The BAIT file lists the positions of the actual bait oligos used for the capture. The TARGET file lists the larger target regions intended to be pulled down by a set of baits. Here are a couple of snippets showing first the BAIT intervals followed by the corresponding TARGET intervals.

                        Sample BAIT_INTERVALS
                        Code:
                        @HD	VN:1.0	SO:unsorted
                        @SQ	SN:chr1	LN:249250621	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:1b22b98cdeb4a9304cb5d48026a85128
                        @SQ	SN:chr2	LN:243199373	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:a0d9851da00400dec1098a9255ac712e
                        chr1	160007387	160007466	+	12242589
                        chr1	160007677	160007756	+	12242598
                        chr1	160007967	160008046	+	12242595
                        chr1	160008257	160008336	+	12242603
                        chr1	160008537	160008616	+	12242600
                        chr1	160008837	160008916	+	12242597
                        chr1	160009067	160009146	+	12242594
                        chr1	160009357	160009436	+	12242596
                        chr1	160009627	160009706	+	12242593
                        chr1	160009927	160010006	+	12242592
                        chr1	160010187	160010266	+	12242591
                        chr1	160010487	160010566	+	12242604
                        chr1	160010747	160010826	+	12242602
                        chr1	160011057	160011136	+	12242599
                        chr1	160011337	160011416	+	12242606
                        chr1	160011647	160011726	+	12242601
                        chr1	160011937	160012016	+	12242605
                        chr1	160012237	160012316	+	12242588
                        chr1	160039881	160039960	+	12242590
                        chr1	160041470	160041549	+	12242427
                        chr1	160041980	160042059	+	12242425
                        chr1	160042350	160042429	+	12242426
                        chr2	26680091	26680170	+	11653005
                        chr2	26680431	26680510	+	11653006
                        chr2	26680791	26680870	+	11653007
                        chr2	26681001	26681080	+	11653004
                        chr2	26682954	26683033	+	11653001
                        chr2	26683515	26683594	+	11652997
                        chr2	26683769	26683848	+	11652984
                        chr2	26684644	26684723	+	11652985
                        chr2	26684960	26685039	+	11653029
                        chr2	26686355	26686434	+	11653020
                        chr2	26686863	26686942	+	11653033
                        chr2	26688585	26688664	+	11653021
                        chr2	26688840	26688919	+	11652990
                        chr2	26689610	26689689	+	11653008
                        chr2	26689994	26690073	+	11652993
                        chr2	26690261	26690340	+	11652996
                        chr2	26691269	26691348	+	11652998
                        chr2	26693485	26693564	+	11653009
                        chr2	26693963	26694042	+	11653022
                        chr2	26695382	26695461	+	11652995
                        chr2	26695412	26695491	+	11652999
                        chr2	26696041	26696120	+	11653011
                        chr2	26696314	26696393	+	11653010
                        chr2	26696868	26696947	+	11653013
                        chr2	26697421	26697500	+	11653034
                        chr2	26698244	26698323	+	11652981
                        chr2	26698784	26698863	+	11652991
                        chr2	26699050	26699129	+	11653023
                        chr2	26699805	26699884	+	11653018
                        chr2	26700058	26700137	+	11653000
                        chr2	26700289	26700368	+	11653024
                        chr2	26700527	26700606	+	11653025
                        chr2	26700573	26700652	+	11652988
                        chr2	26700713	26700792	+	11652989
                        chr2	26700771	26700850	+	11652987
                        chr2	26702152	26702231	+	11653030
                        chr2	26702391	26702470	+	11653014
                        chr2	26703085	26703164	+	11653015
                        chr2	26703765	26703844	+	11653026
                        chr2	26705327	26705406	+	11653012
                        chr2	26706383	26706462	+	11652982
                        chr2	26707381	26707460	+	11653032
                        chr2	26712081	26712160	+	11652986
                        chr2	26712537	26712616	+	11652992
                        chr2	26717835	26717914	+	11652994
                        chr2	26724609	26724688	+	11653016
                        chr2	26725191	26725270	+	11653019
                        chr2	26726636	26726715	+	11653027
                        chr2	26739336	26739415	+	11653002
                        chr2	26741887	26741966	+	11652983
                        chr2	26750704	26750783	+	11653028
                        chr2	26760573	26760652	+	11653017

                        Sample TARGET_INTERVALS
                        Code:
                        @HD	VN:1.0	SO:unsorted
                        @SQ	SN:chr1	LN:249250621	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:1b22b98cdeb4a9304cb5d48026a85128
                        @SQ	SN:chr2	LN:243199373	UR:file:C:\Illumina\MiSeq Reporter\Genomes\Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa	M5:a0d9851da00400dec1098a9255ac712e
                        chr1	160007257	160012322	+	4059197_1_160007257_160012322
                        chr1	160039812	160040051	+	4059197_1_160039812_160040051
                        chr1	160041390	160042433	+	4059195_1_160041390_160042433
                        chr2	26680071	26681088	+	4021269_2_26680071_26681088
                        chr2	26682876	26683073	+	4021269_2_26682876_26683073
                        chr2	26683515	26683615	+	4021269_2_26683515_26683615
                        chr2	26683720	26683898	+	4021269_2_26683720_26683898
                        chr2	26684564	26684805	+	4021269_2_26684564_26684805
                        chr2	26684951	26685049	+	4021269_2_26684951_26685049
                        chr2	26686351	26686439	+	4021269_2_26686351_26686439
                        chr2	26686832	26686974	+	4021269_2_26686832_26686974
                        chr2	26687737	26687897	+	4021269_2_26687737_26687897
                        chr2	26688540	26688710	+	4021269_2_26688540_26688710
                        chr2	26688817	26688944	+	4021269_2_26688817_26688944
                        chr2	26689582	26689719	+	4021269_2_26689582_26689719
                        chr2	26689967	26690101	+	4021269_2_26689967_26690101
                        chr2	26690233	26690369	+	4021269_2_26690233_26690369
                        chr2	26691276	26691342	+	4021269_2_26691276_26691342
                        chr2	26693461	26693589	+	4021269_2_26693461_26693589
                        chr2	26693989	26694018	+	4021269_2_26693989_26694018
                        chr2	26695387	26695457	+	4021269_2_26695387_26695457
                        chr2	26695387	26695517	+	4021269_2_26695387_26695517
                        chr2	26696000	26696162	+	4021269_2_26696000_26696162
                        chr2	26696274	26696435	+	4021269_2_26696274_26696435
                        chr2	26696859	26696978	+	4021269_2_26696859_26696978
                        chr2	26697381	26697542	+	4021269_2_26697381_26697542
                        chr2	26698227	26698361	+	4021269_2_26698227_26698361
                        chr2	26698782	26698906	+	4021269_2_26698782_26698906
                        chr2	26698996	26699185	+	4021269_2_26698996_26699185
                        chr2	26699759	26699911	+	4021269_2_26699759_26699911
                        chr2	26700040	26700156	+	4021269_2_26700040_26700156
                        chr2	26700284	26700374	+	4021269_2_26700284_26700374
                        chr2	26700517	26700617	+	4021269_2_26700517_26700617
                        chr2	26700517	26700910	+	4021269_2_26700517_26700910
                        chr2	26700712	26700910	+	4021269_2_26700712_26700910
                        chr2	26702132	26702252	+	4021269_2_26702132_26702252
                        chr2	26702341	26702521	+	4021269_2_26702341_26702521
                        chr2	26703071	26703179	+	4021269_2_26703071_26703179
                        chr2	26703654	26703877	+	4021269_2_26703654_26703877
                        chr2	26705274	26705460	+	4021269_2_26705274_26705460
                        chr2	26706330	26706516	+	4021269_2_26706330_26706516
                        chr2	26707342	26707501	+	4021269_2_26707342_26707501
                        chr2	26712079	26712163	+	4021269_2_26712079_26712163
                        chr2	26712546	26712608	+	4021269_2_26712546_26712608
                        chr2	26717810	26717941	+	4021269_2_26717810_26717941
                        chr2	26724622	26724676	+	4021269_2_26724622_26724676
                        chr2	26725168	26725294	+	4021269_2_26725168_26725294
                        chr2	26726640	26726713	+	4021269_2_26726640_26726713
                        chr2	26739286	26739467	+	4021269_2_26739286_26739467
                        chr2	26741878	26741977	+	4021269_2_26741878_26741977
                        chr2	26750700	26750788	+	4021269_2_26750700_26750788
                        chr2	26760584	26760642	+	4021269_2_26760584_26760642
                        chr2	26781361	26781566	+	4021269_2_26781361_26781566
                        This shows that the 22 baits on chr1 were designed to cover 3 target regions. CalculateHsMetrics reports coverage for both baits and targets

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        18 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        17 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X