Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cleaning partial PE sam data

    Hello there,
    I obtained PE data from Illumina (chr21 subset of NA18507 - ftp://webdata:[email protected]..._100_chr21.bam).

    After a lot of misery and systematic LENIENT use of picard I could manage to use the data to extract fastQ paired reads from it and remap them to another reference build.

    BUT

    I discovered (among many other serious SAM compliancy problems) that not all reads are present in that file and many pairs have lost one end (which probably did not map on chr21 and was filtered out uncleanly!)

    My question is how to clean such SAM/BAM where FLAGS indicate paired reads but one read of the pair is not present anymore.
    • I cannot use the SAM flags to do it because they are erroneous
    • I could not fix the flags to reflect the true paired status


    Below is the head of the original name-sorted reads showing the absence of one of the 'EAS51_0210:7:33:5109:13959' reads (many 1000's like that)

    Thanks for your suggestions on which command to use and how to eliminate these reads to obtain a fully paired file on which bam2fastq will run smoothly. Preferentially using picard and not with some fancy perl code keeping only '*' in the 7th column .

    Thanks a lot for your lights,
    Stephane

    Code:
    EAS51_0210:3:6:3797:7459	165	chr21	9719702	255	*	*	0	0	AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT	GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC	H0:i:0	H1:i:0	H2:i:2	SM:i:-1	AS:i:0
    EAS51_0210:3:6:3797:7459	89	chr21	9719702	73	100M	*	0	0	ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC	DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34	SM:i:73	AS:i:0
    EAS51_0210:7:33:5109:13959	145	chr21	9719707	254	100M	chrY	10653706	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC	FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2	SM:i:461	AS:i:0
    EAS25_0078:8:23:14907:11377	165	chr21	9719708	255	*	*	0	0	CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF	H0:i:1	H1:i:6	H2:i:60	SM:i:-1	AS:i:0
    EAS25_0078:8:23:14907:11377	89	chr21	9719708	254	100M	*	0	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA	EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3	SM:i:461	AS:i:0
    EAS25_0078:3:4:1830:9254	101	chr21	9719709	255	*	*	0	0	TTGAACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTTTGAGCCCATTGAGGCCTATGGTGAAATACGAAA	GGGGGGGGGGGGGGGGGGFGBGGGGGGGFEGGGGEGGGGGGGGGEGGGGGGEFGGEFGGEFFFFF/&8?@EEECCFGFGGFDFGFEGF?DEEDEFEEFEE	H0:i:0	H1:i:0	H2:i:3	SM:i:-1	AS:i:0
    EAS25_0078:3:4:1830:9254	153	chr21	9719709	254	100M	*	0	0	GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC	EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4	SM:i:461	AS:i:0
    http://www.bits.vib.be/index.php

  • #2
    Do you really want to clean that file or do you just want the clean and synced fastq files? The latter is actually not terribly difficult (see here and here for suggestions).

    Comment


    • #3
      I would like to learn how to clean that file in order to be able to redo such operation with future data having similar issues.

      Thanks for the links anyway.
      http://www.bits.vib.be/index.php

      Comment


      • #4
        Fair enough. The follow isn't perl, which I generally loathe, but it's a simple python solution:
        Code:
        #!/usr/bin/env python
        import sys
        
        f = sys.stdin
        of = sys.stdout
        
        read1 = None
        name1 = None
        
        for read in f :
            #deal with the header
            if(read[0] == '@') :
                of.write("%s" % read)
            if(name1 == None) :
                read1 = read
                name1 = read1.split("\t")[0]
            else :
                name2 = read.split("\t")[0]
                if(name1 == name2) :
                    of.write("%s%s" % (read1, read))
                    read1 = None
                    name1 = None
                else :
                    read1 = read
                    name1 = read1.split("\t")[0]
        This assumes that both mates in a pair have the same name (so no /1 or /2 suffixes) and that the reads are name-sorted. If you saved that as "blah.py" and made it executable, then usage would be:

        Code:
        samtools view name_sorted.bam | blah.py | samtools view -bSo name_sorted.filtered.bam -
        I haven't tested that it prints the header correctly, so you may need to fix that! I should note that the generally better solution is that employed by HTSeq. There, the RNEXT and PNEXT of a read are compared to that of its proposed mate to ensure that they match. In your case, those are often not set, so I suspect that wouldn't work.

        Edit: A perl solution could be similar. You'd use while(<>) for the loop and then probably just chomp() that to split things, though perhaps there are more appropriate perl methods than those. The general work-flow could be the same, though.
        Last edited by dpryan; 10-16-2013, 02:14 AM.

        Comment


        • #5
          Thanks a LOT Devon,

          I added -h to the upstream samtools view cmd to forward the sam header and 'continue' to the code to process header lines and directly go to the next loop

          HTML Code:
          for read in f :
             #deal with the header
             if(read[0] == '@') :
                 of.write("%s" % read)
                 continue
          HTML Code:
          samtools view -h <name_sorted.bam> | \
          	bam_re-pair.py | \
          	samtools view -bSo <name_sorted.filtered.bam> -
          I also made a Perl version following your advice that additionally reports counts for all, passed, and failed read lines. Both codes run at identical speed.

          Thanks you really for this code, it helped me a lot

          S

          ### Perl translation of Devon python code
          HTML Code:
          #!/usr/bin/perl -w
          
          # filter unpaired reads from a - read-name sorted - BAM file
          # bam_re-pair.pl
          # author: Stephane Plaisance (translated from python version by Devon Ryan
          # http://seqanswers.com/forums/showthread.php?p=118936#post118936
          # usage:
          # samtools view -h <name_sorted.bam> | \
          #	bam_re-pair.pl | \
          #	samtools view -bSo <name_sorted.filtered.bam> -
          
          use warnings;
          use strict;
          
          # variables
          my $read = "";
          my $read1 = "none";
          my $read2 = "none";
          my $name1 = "none";
          my $name2 = "none";
          
          my ($ln,$ok,$no)=(0,0,0);
          
          while (my $read = <>) {
          
          # forward header lines
          if ($read =~ /^@/){
          	print STDOUT $read;
          	next;
          	}
          	
          # process data
          $ln++;
          if( $name1 eq "none" ){
          	$read1 = $read;
              $name1 = (split("\t", $read1))[0];
          	} else {
          		$name2 = (split("\t", $read))[0];
          		if( $name1 eq $name2 ){
          			# is paired
          			$ok++;
          			print STDOUT sprintf("%s%s", $read1, $read);
          			$read1 = "none";
          			$name1 = "none";
          			} else {
          				# is not paired
          				$no++;
          				$read1 = $read;
          				$name1 = (split("\t", $read1))[0];
          				}
          	}
          }
          
          # report counts
          print STDERR sprintf("\n########################\n# Results\n# processed:\t%8d\n# passed:\t%8d\n# rejected\t%8d\n", $ln, $ok, $no);
          exit 0;
          http://www.bits.vib.be/index.php

          Comment


          • #6
            Cool, glad that's working for you. I totally forgot the -h in my example and only added the header stuff after the fact, which is why I didn't notice the missing continue :P Glad you were able to fix that properly!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Working...
            X