Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cleaning partial PE sam data

    Hello there,
    I obtained PE data from Illumina (chr21 subset of NA18507 - ftp://webdata:[email protected]..._100_chr21.bam).

    After a lot of misery and systematic LENIENT use of picard I could manage to use the data to extract fastQ paired reads from it and remap them to another reference build.

    BUT

    I discovered (among many other serious SAM compliancy problems) that not all reads are present in that file and many pairs have lost one end (which probably did not map on chr21 and was filtered out uncleanly!)

    My question is how to clean such SAM/BAM where FLAGS indicate paired reads but one read of the pair is not present anymore.
    • I cannot use the SAM flags to do it because they are erroneous
    • I could not fix the flags to reflect the true paired status


    Below is the head of the original name-sorted reads showing the absence of one of the 'EAS51_0210:7:33:5109:13959' reads (many 1000's like that)

    Thanks for your suggestions on which command to use and how to eliminate these reads to obtain a fully paired file on which bam2fastq will run smoothly. Preferentially using picard and not with some fancy perl code keeping only '*' in the 7th column .

    Thanks a lot for your lights,
    Stephane

    Code:
    EAS51_0210:3:6:3797:7459	165	chr21	9719702	255	*	*	0	0	AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT	GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC	H0:i:0	H1:i:0	H2:i:2	SM:i:-1	AS:i:0
    EAS51_0210:3:6:3797:7459	89	chr21	9719702	73	100M	*	0	0	ATATTTGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTC	DDC@BEEEEEEGEEBFGEGG@EEDDBEEGEGGGGFFFFGEECGFGGEGGGGGGGGDFGGGFFGGGGFGFEGGGFGGGGGGBGGEGGFGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN34	SM:i:73	AS:i:0
    EAS51_0210:7:33:5109:13959	145	chr21	9719707	254	100M	chrY	10653706	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATAAAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTC	FEGDGEGEEEEGFEGEEGEEDFEGGGGGGEFGFGFAFFFEGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T2	SM:i:461	AS:i:0
    EAS25_0078:8:23:14907:11377	165	chr21	9719708	255	*	*	0	0	CTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATATCTTCCCATAAAAACTAGACAGAAGGTTTCTAAGAA	GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGEFGGGGGGGGGEGGGGGGGGGGFGFGDGGGGGFDFBFGCEBFFFEGFF	H0:i:1	H1:i:6	H2:i:60	SM:i:-1	AS:i:0
    EAS25_0078:8:23:14907:11377	89	chr21	9719708	254	100M	*	0	0	TGGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCA	EGEEEEGEBGFGGEEEEEGGDEBGFGGGGGGGGGGAGFGGGEGGGGGGGGGGGGGGGGGGGFGGGGGGGGEEGGGGGGGGGGGGGGGGGGGGGGGGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T3	SM:i:461	AS:i:0
    EAS25_0078:3:4:1830:9254	101	chr21	9719709	255	*	*	0	0	TTGAACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTTTGAGCCCATTGAGGCCTATGGTGAAATACGAAA	GGGGGGGGGGGGGGGGGGFGBGGGGGGGFEGGGGEGGGGGGGGGEGGGGGGEFGGEFGGEFFFFF/&8?@EEECCFGFGGFDFGFEGF?DEEDEFEEFEE	H0:i:0	H1:i:0	H2:i:3	SM:i:-1	AS:i:0
    EAS25_0078:3:4:1830:9254	153	chr21	9719709	254	100M	*	0	0	GGAGCGCTTTGAGGCCTATGGTAAAAAAGGAAATACCATCACATGAAATTCGATGGAAGAATTCTGAGAAACTTCTTTGTGAGGGTTGGATTCATCTCAC	EEBE?EEEEEEEEGEEBEEEBGGGEGGGEFAFFECEEE=EDGGFGGGGDGFFGGGGGGGGGEEGGGGDGGGEGFGFGFGGGGGGGGGDGGGEGGFGGGGG	XD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN36T4	SM:i:461	AS:i:0
    http://www.bits.vib.be/index.php

  • #2
    Do you really want to clean that file or do you just want the clean and synced fastq files? The latter is actually not terribly difficult (see here and here for suggestions).

    Comment


    • #3
      I would like to learn how to clean that file in order to be able to redo such operation with future data having similar issues.

      Thanks for the links anyway.
      http://www.bits.vib.be/index.php

      Comment


      • #4
        Fair enough. The follow isn't perl, which I generally loathe, but it's a simple python solution:
        Code:
        #!/usr/bin/env python
        import sys
        
        f = sys.stdin
        of = sys.stdout
        
        read1 = None
        name1 = None
        
        for read in f :
            #deal with the header
            if(read[0] == '@') :
                of.write("%s" % read)
            if(name1 == None) :
                read1 = read
                name1 = read1.split("\t")[0]
            else :
                name2 = read.split("\t")[0]
                if(name1 == name2) :
                    of.write("%s%s" % (read1, read))
                    read1 = None
                    name1 = None
                else :
                    read1 = read
                    name1 = read1.split("\t")[0]
        This assumes that both mates in a pair have the same name (so no /1 or /2 suffixes) and that the reads are name-sorted. If you saved that as "blah.py" and made it executable, then usage would be:

        Code:
        samtools view name_sorted.bam | blah.py | samtools view -bSo name_sorted.filtered.bam -
        I haven't tested that it prints the header correctly, so you may need to fix that! I should note that the generally better solution is that employed by HTSeq. There, the RNEXT and PNEXT of a read are compared to that of its proposed mate to ensure that they match. In your case, those are often not set, so I suspect that wouldn't work.

        Edit: A perl solution could be similar. You'd use while(<>) for the loop and then probably just chomp() that to split things, though perhaps there are more appropriate perl methods than those. The general work-flow could be the same, though.
        Last edited by dpryan; 10-16-2013, 02:14 AM.

        Comment


        • #5
          Thanks a LOT Devon,

          I added -h to the upstream samtools view cmd to forward the sam header and 'continue' to the code to process header lines and directly go to the next loop

          HTML Code:
          for read in f :
             #deal with the header
             if(read[0] == '@') :
                 of.write("%s" % read)
                 continue
          HTML Code:
          samtools view -h <name_sorted.bam> | \
          	bam_re-pair.py | \
          	samtools view -bSo <name_sorted.filtered.bam> -
          I also made a Perl version following your advice that additionally reports counts for all, passed, and failed read lines. Both codes run at identical speed.

          Thanks you really for this code, it helped me a lot

          S

          ### Perl translation of Devon python code
          HTML Code:
          #!/usr/bin/perl -w
          
          # filter unpaired reads from a - read-name sorted - BAM file
          # bam_re-pair.pl
          # author: Stephane Plaisance (translated from python version by Devon Ryan
          # http://seqanswers.com/forums/showthread.php?p=118936#post118936
          # usage:
          # samtools view -h <name_sorted.bam> | \
          #	bam_re-pair.pl | \
          #	samtools view -bSo <name_sorted.filtered.bam> -
          
          use warnings;
          use strict;
          
          # variables
          my $read = "";
          my $read1 = "none";
          my $read2 = "none";
          my $name1 = "none";
          my $name2 = "none";
          
          my ($ln,$ok,$no)=(0,0,0);
          
          while (my $read = <>) {
          
          # forward header lines
          if ($read =~ /^@/){
          	print STDOUT $read;
          	next;
          	}
          	
          # process data
          $ln++;
          if( $name1 eq "none" ){
          	$read1 = $read;
              $name1 = (split("\t", $read1))[0];
          	} else {
          		$name2 = (split("\t", $read))[0];
          		if( $name1 eq $name2 ){
          			# is paired
          			$ok++;
          			print STDOUT sprintf("%s%s", $read1, $read);
          			$read1 = "none";
          			$name1 = "none";
          			} else {
          				# is not paired
          				$no++;
          				$read1 = $read;
          				$name1 = (split("\t", $read1))[0];
          				}
          	}
          }
          
          # report counts
          print STDERR sprintf("\n########################\n# Results\n# processed:\t%8d\n# passed:\t%8d\n# rejected\t%8d\n", $ln, $ok, $no);
          exit 0;
          http://www.bits.vib.be/index.php

          Comment


          • #6
            Cool, glad that's working for you. I totally forgot the -h in my example and only added the header stuff after the fact, which is why I didn't notice the missing continue :P Glad you were able to fix that properly!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X