Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract subalignment from SAM/BAM

    Hello everyone, I,ve been looking for an answer since a few day for a question that not seem so difficult...
    How can I extract subsequences from a BAM/SAM alignment?

    I have assembled Fastq paired end data from an Illumina run with BWA mem vs. a reference BaL_V3_extend (506 bp long).
    Fastq reads are 250 bp long overlapping on 70 bp.
    The SAM file was filtered and converted to BAM to retrieve only the mapped sequence on the reference :
    Code:
    samtools view -SbhF 4 raw.sam > mapped.bam
    I'm interested in recovering parts of this alignment (I mean the sequences) that cover region 197-301 (105 bp) of my reference.
    I tried bedtools :
    Code:
    intersectBed -abam mapped.bam -b V3.bed > V3_mapped.bam
    V3.bed contains :
    Code:
    BaL_V3_extend	196	301	V3_loop
    which is my region of interest, but V3_mapped.bam contains the ~430 bp long , not the 105 bp alignment :
    Code:
    HWI-M00185:92:000000000-AAE9F:1:1101:18230:1606	0	BaL_V3_extend	46	60	219M3D36M164S	*	0	0	ACAATGCACACATGGAATTAGGCCAGTGGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGATGTAATAATCAGATCTGACAATTTCACAGACAATGTTAAAACCATAATAGTACAGCTAAACGAAACTGTAAAAATTAATTGCACAAGACCCAACAACAATACAAGGAGAAGTATACATATGGGACCAGGGAGAGCATTTTATGGAACAGATATAATAGGAGATATAAGACAAGCTCATTGTAATGTTAGTGAAAAAGATTGGAAGAAAACTCTACGGAGGGTAGCTATAAAATTAAAGGAACAATTTAATAAAACAATAATTTTTAATCACTCCTCAGGAGGGGATCCAGAAATTGTAATGCACACCTTTAATTGCAGAGGGGAATTTTTCTACTGTAATACATCAC	CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGG@DGGGGG>FGGFGFEGGGDGGGGGGGFFGDDGGGEGGGFFGFGFFEFGFDFDGEFGFGDDGFGGGFFFFFFGGFGE9CFGGGGGGGGGGGGF?FAFGFBAGGGFFAGGFFFGGGGC?CEDFGGGGGGGGGGGGGGGGGGGGGGGGFFE9AGGGGGFGGGGGFFBGFGGGGGGGGGGCGGGGGGGGGGDGGGGGGGGGGGGGGGGGFFFFGGGGFFGGGFGGGGGGGGGGGGGGGGGGGGGGGGCCCBC	NM:i:29	MD:Z:6T8G11A12C31G3G4T5C2A6G1G7C21G2T3T5G10T23A1A12A8C8A3A0C3^AGG3A23A8	AS:i:116	XS:i:0	SA:Z:BaL_V3_extend,346,+,297S22M3D100M,29,18;
    HWI-M00185:92:000000000-AAE9F:1:1101:12377:1640	0	BaL_V3_extend	46	60	219M3D43M3D4M3I14M4D6M4I115M14S	*	0	0	ACAATGTACACATGGAATTAAGCCAGTAGTATCAACTCAATTGCTGTTAAATGGCAGTCTAGCAGAAGAAGACATAGTAATCAGATCTGAAAATTTCACAGACAATGCTAAAACCATAATAGTACAGCTGAATGAAACTGTACAAATTAAGTGCACAAGGCCCAACAACAATACAAGGAGAAGTATACATATGGGACCGGGGAGAGCATTTTATGGAACAGATATAATAGGAAATATAAGACAAGCTCATTGTAATGTCAGTGAAAAGGAATGGAAGAAAACTCTACGACGGGTAGCTATAAAATTAAGGGAACAATTTGAGAATAAAACAATAGTCTTTACTCACTCCTCAGGAGGGGATCCAGAAATTGTAATGCACACCTTTAATTGTAGAGGGGAATTTTTCTATTGCAATACATCAC	CCCCCFFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGEFGGGFFGDFEGGGGFFCFFGFGGGGGGGGGGGGGGGGAFGFFGG<FFGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFGFFFFCFGGCFFFGGGGGGFEFGGGGGGFGFGF9FGGGGDFC8FF?FEGGGGCEGGGGG:FGGGG,@DFGGFDDFG8BEGGGGDGCGGF;=(ADFGGG9;ECFGGFFEGGGDFFCFFFFGFGDGA@@6A9DGGGGGCFGGFFE8AFDFCFFEEECGFGGGGGCGFC@,BB,FGGDFFFF9FEBE?8FGF@A<?F<FGDGGEFFAFGGF<GGGGCGGGFGGFGGGFF8GGFGGGGGFFGFGGGGF9FCAFGGFGGGGGGGGFGGGGGCGGGGFFFF?FGGGGGGGFGGGGGGGGFGGFCCCCC	NM:i:63	MD:Z:15G4G19C31G0G7T5C9G1G36T5G7T2T5A17A1A12A5A2C8A3A0C3^AGG3A9G13A8C0C1T3^AGA1C9T0G1C3^TTAA0A2A5T12A10G20A0G2T14C11G1C5G0T9G16	AS:i:124	XS:i:0
    HWI-M00185:92:000000000-AAE9F:1:1101:8886:2382	0	BaL_V3_extend	46	60	219M3D43M3D4M3I14M4D6M4I129M	*	0	0	ACAATGCACACATGGAATTAGGCCAGTAGTATCAACTCAACTGCTGTTAAATGGCAGTCTAGCAGAAGGAGAGGTAGTAATCAGATCTGACAATTTCACAGACAATGTTAAAACCATAATAGTACAGCTAAACGAAACTGTAAAAATTAATTGCACAAGACCCAACAACAATACAAGGAGAAGTATACATATGGGACCAGGGAGAGCATTTTATGGAACAGATATAATAGGAAATATAAGACAAGCTCATTGTAATGTTAGTGAAAAGGAATGGAAGAAAACTCTACGACGGGTAGCTATAAAATTAAGGGAACAATTTGAGAATAAAACAATAGTCTTTACTCACTCCTCAGGAGGGGATCCAGAAATTGTAATGCACACCTTTAATTGTAGAGGGGAACTTTTCTATTGTAATACATCAC	CCCCCGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGFGGGFGGFGGGGGFGGGGGGGGGGGGGGGGGGGGGGGFGGGGGEGGCCGGGGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGGFGGGGGG=FGGGGGGGGDFGGFFEDEEGGGG=CFEGFGFGFFFFGGGFGGGGFFFFGGGGFGGGFFFFFFGGFDFFGFE==+FCD9,FDGGFGGGGGGGGDGFFGGGFEE6F@EECCGFFGGGFEGGGGFGFFFF9FAGGGGGGGFEFEEECFFGFCFGGGGFEECGGGGGFGGGGGFEGGGGGGGGGFCGFGGGEFGGGGGGGGGGGGGGGGGFFGGGGGGGGGFEFGGGGFEFGGGGGGGGGGCCCCC	NM:i:65	MD:Z:6T8G52A12T5C2A6G1G7C21G2T3T5G10T23A1A12A8C8A3A0C3^AGG3A9G13A8C0C5^AGA1C9T0G1C3^TTAA0A2A5T12A10G20A0G2T14C11G1C5G0T9G8T7C6T2A3	AS:i:131	XS:i:0
    ...
    I also tried in BED format :
    Code:
    intersectBed -abam mapped.bam -b V3.bed -bed > V3_mapped.bed
    The V3_mapped.bed contains :
    Code:
    BaL_V3_extend	196	301	HWI-M00185:92:000000000-AAE9F:1:1101:18230:1606	60	+	45	303	0,0,0	1	258,	0,
    BaL_V3_extend	196	301	HWI-M00185:92:000000000-AAE9F:1:1101:12377:1640	60	+	45	456	0,0,0	1	411,	0,
    BaL_V3_extend	196	301	HWI-M00185:92:000000000-AAE9F:1:1101:8886:2382	60	+	45	470	0,0,0	1	425,	0,
    ...
    But how could I extract the subalignment sequences (105bp) from my mapped.bam file?

    Thanks for your advices.
    Last edited by Mesmer; 01-16-2015, 03:14 AM.

  • #2
    I'm not familiar with a tool that will subset alignments like you want (not that it's difficult to write one). Why exactly do you want to do that? There are likely better ways.

    Comment


    • #3
      I need that kind of data to be processed by another script (in JAVA), to determine specific features of that V3 region (105 pb).
      I could pass to my JAVA script the fastq sequences from my bam file and align them vs the V3 region but I'm afraid that will take a long computing time, especially if I have various samples to procced.
      As BWA have already aligned the reads, I thought I could use this alignment and cut it at specific positions to save computing time.

      Comment


      • #4
        You might need to just trim the alignments in your java program (the only non-simple part is dealing with InDels).

        Comment


        • #5
          yes I'm trying that way, if someone has an other idea, I'm still listening

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X