Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract positions from alignment

    Hi all,

    I got some short (~220bp) pair-ended DNA sequences targeted in one region. The raw data are in fastq and I aligned them using bwa, which give me sam file. Now I want to extract some positions from each read. For example,
    The original aligned read sequence is:

    TTGAATGGGGGATGTTTTTGGGATATAGATTATGTTTTTATATC

    I want to extract position 3, 6, 8, so I want:
    GTG
    on each line.

    The reason that I can't simply pick up the position from sam file is that sometimes there are indels which will shift my positions.

    I googled this problem but most answers only give me the count summary of each position like using pileup, bamtobed... But in my case, the sequence of GTG is important and can not be split. Namely, I'm interested in how many GTGs rather than how many G, how many T and how many G.

    Any help is very appreciated!

  • #2
    Which computer languages can you program in?

    For example, if you said Python, I would suggest looking at the pysam library for working with SAM/BAM files from Python using the C samtools library.

    Comment


    • #3
      Extracting nucleotid at given position

      Hi,

      i am facing the same issue. I have sam-files resulting from an alignment with CASAVA and I have a list of positons, i.e chr1:63229714.
      I want to extract the bases in the aligned reads at these positions.

      Easy to do when CIGAR=50M (provided reads are 50bp long).
      But tricky when indels are present, i.e. CIGAR=34M1D16M, or CIGAR=10M1467N40M.

      I wrote a perl-script to do the job, but it's too slow.

      I hope there exist a tool which perform this job and would apreciate any help.
      Regards.

      Comment


      • #4
        Something like

        Code:
        samtools view file.bam | cut -f 10 | cut -c 3,6,8 > output.txt
        Will work, but not around indels. You could cut out the sequence and the cigar together, and maybe use awk to use the data from the one to know which letters to cut from the other.

        Comment


        • #5
          I also had a very similar (exact same?) problem: http://seqanswers.com/forums/showthread.php?t=27648

          I ended up doing something similar to vincentdemolombe, by writing a custom perl script.

          I used Bio::Perl and Bio:B::Sam, and parsed the CIGAR string and padded_alignment method to get the read bases from particular positions/regions on the reference.

          Comment


          • #6
            I think you might be better off generating a pileup, and parsing the desired line of that. Let the pileup program deal with shifting the reads around properly.

            Comment


            • #7
              Great point.. though in the 5 minutes I played around with samtools mpileup, it doesn't seem to keep the cluster ID in the output. In my case, I needed to know the cluster ID AND its bases at particular positions on the reference after alignment.

              This is as far as I got:
              Code:
              samtools mpileup -f ref.fasta read1.sorted.bam | less -S -N

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X