Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • search CIGAR in SAM file

    Hi,

    how can I search a SAM file to find the alignments with N or D above a certain value in the CIGAR?

  • #2
    Using my tool samjs : https://github.com/lindenb/jvarkit/wiki/SamJS

    file filter.js

    Code:
    function accept(rec)
    	{
    	if( rec.getReadUnmappedFlag()) return false;
    	var cigar = rec.cigar;
    	if(cigar==null) return false;
    	for(var i = 0;i< cigar.numCigarElements();++i)
    		{
    		var ce = cigar.getCigarElement(i);
    		if( ce.getLength()< 8 ) continue;
    		var op =  ce.getOperator().name();
    		if( op=="N" || op=="D") return true;
    		}
    	return false;
    	}
    accept(record);

    Code:
    curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" | java -jar dist-1.133/samjs.jar -f filter.js
    
    (...)
    ERR018420.635615	147	1	18295328	70	32M10D44M	=	18294913	-500	TTTGTTTTGTTTTATTATATTATATTATATTGTATTATATTATATTATATTATATTATATTATATTAATTATATTA	=@B9>>?B;>AAAA@AAAABAAAABA@A@BA>A@AA@A@AA@@@@A@A@AA@A@AA@A@BA@AABAB@BA@A@AA<	X0:i:1	X1:i:0	OC:Z:67M1D9M	MD:Z:32^TATTATATTA44	RG:Z:ERR018420	AM:i:37	NM:i:10	SM:i:37	BQ:Z:BED@@@@@@@@@@@@@@@@@@@@@@@@@@@@@`_``_`_``____`_`_``_`_``_`_a`_``a`a_a`_`_``[	MQ:i:60	XT:A:U
    ERR018419.19671936	163	1	18335432	35	50M14D20M6S	=	18335873	499	TCTAAAAAAAAAAAAAAAGAGAGAGAGAAATAAAGAGAGAAAGAGAAAGAAGAAAGAAGGAAAGAAAGAAAACAAA	;76>=AB<<>?;>=8>972753.&9:64;>87==7;9667B=8=39A>8=82525)%67<5;9)8A>.=969,A<@	OC:Z:62M2D8M6S	MD:Z:50^AGAAAGAAAGAAAG20	RG:Z:ERR018419	AM:i:25	NM:i:14	SM:i:25	BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@WQTQTHDUV[TZXHW`]M\X@@@@@@	MQ:i:25	XT:A:M
    ERR018418.3162191	163	1	18360887	70	43M16D33M	=	18361308	496	CATCTATCTATTATCTATCTATCTCATTTATCCATCTATCTATTATCTATCTATCTATCTATCTATCTATCTGCCT	<A@@BAA@B@ABAA?BAA@AA?@B@AABBA@??AA?BAA@BAABAA@BA@@BAA@BA@@BAA@BBB@BB@AC??BD	X0:i:1	X1:i:0	OC:Z:76M	MD:Z:43^TATCTATCTATCTATC33	RG:Z:ERR018418	AM:i:37	NM:i:16	SM:i:37	BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@a``_a`__a``_a`__a``_aaa_aa_`b^^ac	MQ:i:60	XT:A:U
    ERR018418.5325337	163	1	18360893	70	37M16D39M	=	18361325	507	TCTATTATCTATCTATCTCATTTATCCATCTATCTATTATCTATCTATCTATCTATCTATCTATCTGCCTATCCAT	<?AAABAA@AAA@BAA@B@AAABAA@@AA@AA@?BAABAA?AAA@BAA?BAA@AAA@BAA@BA@@B@AACBBAAC@	X0:i:1	X1:i:0	OC:Z:76M	MD:Z:37^TATCTATCTATCTATC39	RG:Z:ERR018418	AM:i:37	NM:i:16	SM:i:37	BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@a``^```_a``^a``_```_a``_a`__a_``baa``b_	MQ:i:60	XT:A:U

    Comment


    • #3
      This should do, using python with pysam 0.8.1

      Code:
      #!/usr/bin/env python
      
      import sys
      import pysam
      
      INBAM=sys.argv[1]
      MIN_GAP= int(sys.argv[2])
      
      insam= pysam.AlignmentFile(INBAM)
      outsam= pysam.AlignmentFile('-', 'wh', template= insam)
      for aln in insam:
          for x in aln.cigartuples:
              # See cigartuples in http://pysam.readthedocs.org/en/latest/api.html 
              if x[0] in [2, 3] and x[1] > MIN_GAP:
                  outsam.write(aln)
                  break
      insam.close()
      outsam.close()
      sys.exit()
      Save it as something like "longGaps.py" and run it as longGaps.py <inaln.bam> <min-gap>. It will print to stdout a sam file with header with the reads containing D or N operator longer than min-gap. (I haven't tested it carefully)

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      25 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      21 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X