Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    @Pepe I am sure you can make a script to perform the same task, though it is conceptually easier to avoid dropping the unmapped reads while you are processing them with 'bwa bwasw' one by one rather than figuring out later what you dropped. I also wanted to mention that occasionally 'bwa bwasw' reports multiple alignments in the sam file for the same read. The only way to modify that behavior at the moment is to get into the source code.

    Comment


    • #17
      Unmapped reads bwasw + patch

      Originally posted by freeseek View Post
      This simple patch:
      Code:
      --- bwtsw2_aux.c
      +++ bwtsw2_aux.c
      @@ -520,8 +520,10 @@
       			bsw2_resolve_query_overlaps(b[0], opt.mask_level);
       		} else b[1] = 0;
       		// generate CIGAR and print SAM
      -		gen_cigar(&opt, l, seq, pac, b[0]);
      -		print_hits(bns, &opt, p, b[0]);
      +		if(b[0]->n > 0) {
      +		    gen_cigar(&opt, l, seq, pac, b[0]);
      +		    print_hits(bns, &opt, p, b[0]);
      +		}
       		// free
       		free(seq[0]);
       		bsw2_destroy(b[0]);
      @@ -583,6 +585,7 @@
       	for (i = 0; i < _seq->n; ++i) {
       		bsw2seq1_t *p = _seq->seq + i;
       		if (p->sam) printf("%s", p->sam);
      +		else printf("%s\t4\t*\t0\t0\t*\t*\t0\t0\t%s\t%s\n", p->name, p->seq, p->qual);
       		free(p->name); free(p->seq); free(p->qual); free(p->sam);
       		p->tid = -1; p->l = 0;
       		p->name = p->seq = p->qual = p->sam = 0;
      will make bwasw spit out also the unmapped reads, though it only works with .fastq files, so use it carefully. To use it, create a file contained the above code, like "bwtsw2_aux.diff", enter the bwa source directory, and issue the command "patch < bwtsw2_aux.diff". (PS You need to be careful to not make the tabs become spaces, or the patch will not work)
      Hi Freeseek,



      I'm trying to align my sample illumina bacterial genome to an already published reference bacterial genome. The aim is to find out what genes are present in my sample genome, but missing in the reference. The plan of action was to run bwasw, and then pull out the sequence in the sample genome that do not align to the reference, then look at what these sequences are.

      I aligned my query genome to the ref using bwasw.

      The output was in sam format. I converted that to bam format using

      samtools view -bS aln.sam > aln.bam

      Then I ran the command lines quoted above.

      When I headed my mapped.sam file, it appears to have worked, however, my unmapped.sam file is empty,the size is zero.
      This is consistent for two of my sample genomes. Reading this thread, it seems that bwasw does not give out unmapped reads, thus I was trying your patch.

      This is the message I'm getting:
      bgansw:~/bwa$ patch < bwtsw2_aux.diff
      can't find file to patch at input line 3
      Perhaps you should have used the -p or --strip option?
      The text leading up to this was:
      --------------------------
      |--- bwtsw2_aux.c
      |+++ bwtsw2_aux.c
      --------------------------
      File to patch:

      I'm probably going to ask a very silly question. I'm a newbie, so am not good with scripts. What should I enter in file to patch? Do I have to make changes or input a file name in the script? I've got paired end illumina hiseq data.

      Any help would be greatly appreciated.

      bgansw

      Comment


      • #18
        Hi Freeseek,

        I inputted my faq file in the file to patch. That gave me .fastq.orig file and .fastq.rej

        Copy pasting the .orig file

        bgansw:~/bwa$ head 6_combined_paired.faq.orig
        @GAPC_0027_FC:7:1:4477:1032#TGACCA/1
        AGTATGGGGGAAATGCAAGCAAGCATTNATGCTAGTGGTTGTCAAATAGTTACCGTAGCTGTGCGCA
        +
        hhhhhhhhhhdhhghhhhhhhheddeeBededaad^dfdfffhhhhhhfhhfhhfceffdfbhhhad
        @GAPC_0027_FC:7:1:4477:1032#TGACCA/2
        TGCTTCTAAGGTACCAATGGNNNCAGGTAACAGGTATTTAGGATCGGGTATGACCTNNNNTTTAACA
        +
        hhhhhhhhhhhefhfhf_]_BBB`]aa^acdcff`fffffffhghhgfafccY][ZBBBBW\\[[[[
        @GAPC_0027_FC:7:1:4986:1032#TGACCA/1
        GAAAATCCGAAAGAACCGGCAATGATTNTTTCTACGTTCCGTAAAGAAAGCAATGGTGATAAGCCTG


        Am copy pasting the .fastq.rej file:


        bgansw:~/bwa$ cat 6_combined_paired.faq.rej
        ***************
        *** 520,527 ****
        bsw2_resolve_query_overlaps(b[0], opt.mask_level);
        } else b[1] = 0;
        // generate CIGAR and print SAM
        - gen_cigar(&opt, l, seq, pac, b[0]);
        - print_hits(bns, &opt, p, b[0]);
        // free
        free(seq[0]);
        bsw2_destroy(b[0]);
        --- 520,529 ----
        bsw2_resolve_query_overlaps(b[0], opt.mask_level);
        } else b[1] = 0;
        // generate CIGAR and print SAM
        + if(b[0]->n > 0) {
        + gen_cigar(&opt, l, seq, pac, b[0]);
        + print_hits(bns, &opt, p, b[0]);
        + }
        // free
        free(seq[0]);
        bsw2_destroy(b[0]);
        ***************
        *** 583,588 ****
        for (i = 0; i < _seq->n; ++i) {
        bsw2seq1_t *p = _seq->seq + i;
        if (p->sam) printf("%s", p->sam);
        free(p->name); free(p->seq); free(p->qual); free(p->sam);
        p->tid = -1; p->l = 0;
        p->name = p->seq = p->qual = p->sam = 0;
        --- 585,591 ----
        for (i = 0; i < _seq->n; ++i) {
        bsw2seq1_t *p = _seq->seq + i;
        if (p->sam) printf("%s", p->sam);
        + else printf("%s\t4\t*\t0\t0\t*\t*\t0\t0\t%s\t%s\n", p->name, p->seq, p->qual);
        free(p->name); free(p->seq); free(p->qual); free(p->sam);
        p->tid = -1; p->l = 0;
        p->name = p->seq = p->qual = p->sam = 0;

        My apologies, since I dont know if I'm on the right track or not. Any help would be greatly appreciated.

        Regards

        bgansw

        Comment


        • #19
          If you have Illumina reads, just use bwa aln. It doesn't much matter how long your reads are, you are only aligning to a bacterial genome, how long can it take?

          The post you are citing is a year old, do you know that it's suppsoed to work for whatever version of bwa you have?

          Comment


          • #20
            Hi swbarnes2,

            I ran bwa using the aln parameter, instead of the bwasw, but am not sure that its worked. The output .sam file is only 4K, while the bwasw output file was 796K. Additionally, the aln.sam file looks nothing like the bwasw.sam file. It doesnt contain reads, looks more like a .bam file. The aln.sam file also doesnt contain @SQ headers, while the bwasw file contained them, so I'm unable to convert the sam file to .bam.

            I tried

            samtools faidx ref.fa (but I get an error - [fai_build_core] different line length in sequence 'NZ_A01000034'.
            Segmentation fault

            So am stuck...

            I would really appreciate any help.

            Comment


            • #21
              You realize that aln doesn't output a sam file, right? You put the outputs of aln through samse or sampe, and they make the .sam file. I align to bacteria all the time with 60-mers and use bwa aln and sampe.

              And your faidx error means exactly what it looks like it means. bwa doesn't mind that particular problem, but you can't index the fasta if the line lengths vary within a sequence. I don't see why you need to index the fasta at this point anyway. Samtools view does not require an indexed fasta to make a .bam file. You can pipe samtools samse or sampe straight into samtools view, and not have to deal with a bulky .sam file. I do it all the time.

              4 K may be too small, but maybe not.

              For bacteria, these are the commands you want:

              bwa index bacteria_ref.fasta; bwa aln bacteria_ref.fasta read1.fq > R1.sai; bwa aln bacteria_ref.fasta read2.fq > R2.sai; bwa sampe bacteria_ref.fasta R1.sai R2.sai read1.fq read2.fq | samtools view -bSho out.bam -; samtools sort out.bam sort; samtools view -bf 4 > unmapped.bam
              You could throw in a -r `@RG\tID:foo\tSM:bar' in to the sampe step, to put read groups in. You could add a -u to the first samtools view step, as it makes that step run a little faster by not compressiong the .bam. The sort command will complain that the .bam lacks the right EOF, but it will still run correctly. Once you have the sorted file, you don't need the unsorted one. There are a couple of ways to convert a .bam to a fastq. Picard will do it. Velvet will take the .bam as is, and since de novo assembly is what you want to do anyway, that works out fine.

              Comment


              • #22
                extract unmapped reads from bwa - convert to fastq

                Hi Swarnes2,

                Thank you very very much for your detailed post. Since I'm a newbie, the post was direct, easy to understand and very helpful. Thank you very very much. I'm very grateful!

                I ran bwa aln , then sampe, got an output_unmapped.bam file

                I also converted the output_unmapped.bam file to fasta format.

                This is what the unmapped reads look like:

                >GAPC_0027_FC:7:1:4477:1032#TGACCA
                TGTTAAANNNNAGGTCATACCCGATCCTAAATACCTGTTACCTGNNNCCATTGGTACCTTAGAAGCA
                >GAPC_0027_FC:7:1:4986:1032#TGACCA
                GAAAATCCGAAAGAACCGGCAATGATTNTTTCTACGTTCCGTAAAGAAAGCAATGGTGATAAGCCTG
                >GAPC_0027_FC:7:1:4986:1032#TGACCA
                GATAGGGGCTACTTACTCAGNNNCAGCAGCTTCACCTTTCTTCTCTGCATTGTGTANNNNAGCAGAA
                >GAPC_0027_FC:7:1:6525:1034#TGACCA

                There are over 100 of these little sequences in the unmapped file.

                I tried to put the .bam file into velvet

                velveth UNMAPPED_06_aln_sampe 55 -fastq -shortPaired out_06_aln_sampe.bam
                [0.000000] Reading FastQ file out_06_aln_sampe.bam
                velveth: out_06_aln_sampe.bam does not seem to be in FastQ format

                Am I doing something wrong?

                I've been trying to use picard - bam to fastq and other fasta to fastq convertors available online, but havent got them working yet. Once I generate a fastq file, would aligning these reads in velvet or soapdenovo be the next step, to finding reads present in my sample genome but not in the reference genome.

                Thank you for your advice and help. I really appreciate it.

                Regards

                bgansw

                Comment


                • #23
                  Hi Swarnes2,

                  I was able to convert the unmapped.bam files to fastq and then assemble them in velvet.

                  I'm annotating them now.

                  Thank you. I guess I'm sorted for now.

                  Again, thank you very very much for your help.

                  Regards

                  bgansw

                  Comment


                  • #24
                    Sure, but new versions of velvet will take .bams. Of course, you have to specify that in the velvet command line.

                    Bams have the virtue of being very nicely compressed. Of course, fastq.gz is also pretty compressed, and both bwa and velvet will take those too.

                    My guess is that it's not going to show you much. Looks like those reads failed to align because something happened to the instrument at due to quality issues, and not because there was no suitable reference for them to align to. The pattern of the N's suggest that the instrument had bubbles or something during certain cycles, and those letters just couldn't be called.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    31 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X