Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bsmooth aligner

    Hi,

    I tried to use bsmooth for bs-seq alignment for a while. However, still can't figure out the exact command used for alignment. Below is error mesg

    time perl ~tli/bin/bsmooth-align-0.8.1/bin/bswc_bowtie2_align.pl --bowtie2=/usr/bin/bowtie2 --bscpg --fastq --sam=bsm --fr --ff --rf --rr --no-unpaired --out=. -- ~tli/gc3018/genome/hg19bsmooth/hg19.idx -- ~tli/gc3018/genome/hg19/chr*fa -- simulated_1.fastq -- simulated_2.fastq -- --
    Original index name: /gscuser/tli/gc3018/genome/hg19bsmooth/hg19.idx
    Stripped index name: /gscuser/tli/gc3018/genome/hg19bsmooth/hg19.idx
    Reference files:
    /gscuser/tli/gc3018/genome/hg19/chr1.fa
    /gscuser/tli/gc3018/genome/hg19/chr10.fa
    /gscuser/tli/gc3018/genome/hg19/chr11.fa
    /gscuser/tli/gc3018/genome/hg19/chr12.fa
    /gscuser/tli/gc3018/genome/hg19/chr13.fa
    /gscuser/tli/gc3018/genome/hg19/chr14.fa
    /gscuser/tli/gc3018/genome/hg19/chr15.fa
    /gscuser/tli/gc3018/genome/hg19/chr16.fa
    /gscuser/tli/gc3018/genome/hg19/chr17.fa
    /gscuser/tli/gc3018/genome/hg19/chr18.fa
    /gscuser/tli/gc3018/genome/hg19/chr19.fa
    /gscuser/tli/gc3018/genome/hg19/chr2.fa
    /gscuser/tli/gc3018/genome/hg19/chr20.fa
    /gscuser/tli/gc3018/genome/hg19/chr21.fa
    /gscuser/tli/gc3018/genome/hg19/chr22.fa
    /gscuser/tli/gc3018/genome/hg19/chr3.fa
    /gscuser/tli/gc3018/genome/hg19/chr4.fa
    /gscuser/tli/gc3018/genome/hg19/chr5.fa
    /gscuser/tli/gc3018/genome/hg19/chr6.fa
    /gscuser/tli/gc3018/genome/hg19/chr7.fa
    /gscuser/tli/gc3018/genome/hg19/chr8.fa
    /gscuser/tli/gc3018/genome/hg19/chr9.fa
    /gscuser/tli/gc3018/genome/hg19/chrM.fa
    /gscuser/tli/gc3018/genome/hg19/chrX.fa
    /gscuser/tli/gc3018/genome/hg19/chrY.fa
    Name-map files:
    bowtie2 executable: /usr/bin/bowtie2
    bowtie2 arguments: simulated_1.fastq
    Methylation output directory: .
    Keep Watson/Crick .sam: bsm
    Keep Watson/Crick .bam: no
    Read-level measurements: just CpGs
    Four-strand?: no
    Paired-end?: no
    Unpaired inputs:
    simulated_2.fastq
    Ignore evidence from unparied alignment of paired mate: 1
    Input format: fastq
    Skip first # reads: (no skipping)
    Stop after # reads: (no limit)
    Temporary directory: /tmp/5PALCMIVoU.tmp
    Intermediate Watson file name: /tmp/5PALCMIVoU.tmp/bsreads_watson.tab6
    Intermediate Crick file name: /tmp/5PALCMIVoU.tmp/bsreads_crick.tab6
    Locating tools...
    Found bowtie2: /usr/bin/bowtie2
    Checking index...
    Checking reads...
    Parsing name-map files...
    Creating named pipes...
    Forked off paired feeder process (pid=6456)...
    Opening simultaneous alignment pipes:
    /usr/bin/bowtie2 -x /gscuser/tli/gc3018/genome/hg19bsmooth/hg19.idx.watson simulated_1.fastq --quiet --ff --norc --sam-no-qname-trunc --reorder --tab6 /tmp/5PALCMIVoU.tmp/bsreads_watson.tab6 |
    /usr/bin/bowtie2 -x /gscuser/tli/gc3018/genome/hg19bsmooth/hg19.idx.crick simulated_1.fastq --quiet --ff --norc --sam-no-qname-trunc --reorder --tab6 /tmp/5PALCMIVoU.tmp/bsreads_crick.tab6 |
    Parsing reference...
    Warning: Output file 'simulated_1.fastq' was specified without -S. This will not work in future Bowtie 2 versions. Please use -S instead.
    Warning: Output file 'simulated_1.fastq' was specified without -S. This will not work in future Bowtie 2 versions. Please use -S instead.
    Writing reads from file 'simulated_2.fastq' ...

    Looks like it used simulated_1.fastq as bowtie2 arguments.

    Thanks!

    litd

  • #2
    Try:
    Code:
    time perl ~tli/bin/bsmooth-align-0.8.1/bin/bswc_bowtie2_align.pl --sam=bsm --fr --no-unpaired --out=. -- ~tli/gc3018/genome/hg19bsmooth/hg19.idx -- ~tli/gc3018/genome/hg19/chr*fa -- -- simulated_1.fastq -- simulated_2.fastq
    I also removed the options that weren't needed (e.g., --bscpg is on by default). You had specified an all possible mate orientations, which doesn't work. You also forgot the "--" to indicate the <bowtie2-args> are finished, so it's taking the first fastq file as being options for bowtie2! BTW, you might want to specify "-p 3" or something in the <bowtie2-args> portion to have things finish a bit faster.

    Also, I hope you have better luck than me with bsmooth. I'm trying to get the darn thing to run just for comparison's sake and it's giving me other random errors (there should be a law against bioinformatics programs being written in perl!).

    Comment


    • #3
      Thanks. Finally I have some alignment results.

      Comment


      • #4
        perl ~tli/bin/bsmooth-align-0.8.1/bin/bswc_bowtie2_align.pl --bowtie2=/usr/bin/bowtie2 --sam=bsm --fr --no-unpaired --out=. -- ~tli/gc3018/genome/hg19bsmooth/hg19.idx -- ~tli/gc3018/genome/hg19/chr*fa -- -- simulated_1.fastq -- simulated_2.fastq

        I used the above command to have the first alignment result using simulated data. However the result (below) of crick strand totally didn't matched with the simulated seq data, while the watson strand looks OK. Why?

        $ sed '1,27d' bsm.crick.sam | head -5
        4_chr1:95788238-95788581_R1 67 chr1 153462041 40 100M = 153462284 343 TATAGGTAAATATAGGGAGAAATTATGTTTGGTAAATGTAATTTTTAGTTGTGTAGTTTTTAATTGATAAATGTTTTAATGATGGAAAAAATAAAGGATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:16A83 YS:i:-18 YT:Z:CP YO:Z:CACAGGCAAATATAGGGAGAAACCATGTTTGGCAAATGTAATTTCTAGTTGTGTAGCTTTCAACTGATAAATGCCTTAATGATGGAAAAAACAAAGGACA XB:Z:C
        4_chr1:95788238-95788581_R2 131 chr1 153462284 40 100M = 153462041 -343 TATGTTATTTTATTATAGTTTATTTTATTTTATAAGTTTGTTTTGTTTATTAATAAGTTTTTGAATTATTTGAGGGTGTGTTTTAATTATTTTTTTTTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:60G25A1T11 YS:i:-6 YT:Z:CP YO:Z:CATGTCATTTTATTACAGTCCATCTTATTCCATAAGTTTGCCTTGTTTATCAACAAGCTTTTGAATTACTTGAGGGTGTGTCTTAATTATTCTTTTTTTG XB:Z:CR
        6_chr2:30786811-30787218_R1 67 chr2 212412156 40 100M = 212412463 407 TATGATTTTATTTTAATTAGTTTTTATTTATTAAATAATTTTTAAAGTTGGTATTTTTTGGTGTATATTTTTGTATTTATAAATATAAATTTATTATTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:9T4T85 YS:i:-18 YT:Z:CP YO:Z:TATGACCTTACTTTAATTAGTTTTTATTTACCAAATAACTTTCAAAGTTGGCATTCTTTGGTGTACATTTTCGCATTCACAAATATAAACCCATCATTTC XB:Z:C
        6_chr2:30786811-30787218_R2 131 chr2 212412463 40 100M = 212412156 -407 AGTATTATTTTTTGTAATTTAATTGTGTTTAGAATTTTAATTGTAAAAAATTAAAAATATATTATATAGTTGGTATATATTTTAGAGATTTGATTAGTAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:15G8A1T73 YS:i:-12 YT:Z:CP YO:Z:AGTATTATCCTCTGTAATCCAACCGCGTCTAGAATCTCAACTGTAAAAAATCAAAAATATACTACACAGTTGGCATACATTTTAGAGATCTGACTAGTAT XB:Z:CR


        $ sed '1,27d' bsm.watson.sam | head
        1_chr15:37586245-37586569_R1 67 chr15 37586245 40 100M = 37586469 324 TTGGAGTATTTAGATAGGTTATGTAGGAAATTTAAGATGTATAAAATTTAATTTAGTTGTTGTTAATAAATTTGTAATTGTTAGTATAAATAAGATATAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:2T9T60A26 YS:i:-12 YT:Z:CP YO:Z:CTGGAGTACCTAGATAGGCCATGCAGGAAATTTAAGATGTATAAAATTCAATTTAGCTGCTGTCAACAAACCTGTAATTGCTAGTATAAACAAGACACAT XB:Z:W
        1_chr15:37586245-37586569_R2 131 chr15 37586469 40 100M = 37586245 -324 ATTTATATATGTATAGAGATATATATATATAAATGTATATTAATATTATATTTTTATTATATATATGTATATTTAAATTTGTATATATTTTGTAGTTTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:2A60G36 YS:i:-18 YT:Z:CP YO:Z:ACTTATATATGCACAGAGATATATACACATAAATGTATACCAATACCATATTTCTACCACACACACGCACACTCAAATTCGTATATATTTTGTAGTTCTT XB:Z:WR
        2_chr6:134928088-134928318_R1 67 chr6 134928088 40 100M = 134928218 230 AGTTATTTTATTAGGTTTTAGTGAAATTTTTTGAGTAAGTTTTTTAGGTATTTTGTTTATAGAGATTTGAATGATATTGTTTTAGTTTTTGGATATTTAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:7G70T21 YS:i:-18 YT:Z:CP YO:Z:AGCCACTTCACCAGGCCTCAGTGAAATTTTCTGAGTAAGCTTCCTAGGTATCCTGTCCATAGAGATTTGAATGACACCGTTCTAGCCCCTGGACATTTAG XB:Z:W
        2_chr6:134928088-134928318_R2 131 chr6 134928218 40 100M = 134928088 -230 TTTAGTTTTTGTATTTAGTTTTATAATGTGTATTATTTTTTATTTTTTTTGTTGTAGATTTTAGATTTTTTTAGATTGTGTTTTATAGTTTTTTTTTTAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:4T5T39T49 YS:i:-12 YT:Z:CP YO:Z:CCTAGCTCTTGTATTTAGTCCCACAACGTGTACCACTCTTCATTTTCTCCGTTGCAGATCTTAGACCCCCTCAGACTGTGCCTTATAGCCCTCTTTTTAA XB:Z:WR


        $ ls ~tli/gc3018/genome/hg19bsmooth/
        hg19.idx.crick.1.bt2 hg19.idx.crick.3.bt2 hg19.idx.crick.rev.1.bt2 hg19.idx.watson.1.bt2 hg19.idx.watson.3.bt2 hg19.idx.watson.rev.1.bt2 wkshpseq.e8311173
        hg19.idx.crick.2.bt2 hg19.idx.crick.4.bt2 hg19.idx.crick.rev.2.bt2 hg19.idx.watson.2.bt2 hg19.idx.watson.4.bt2 hg19.idx.watson.rev.2.bt2 wkshpseq.o8311173


        $ ls ~tli/gc3018/genome/hg19/
        chr1.fa chr11.fa chr13.fa chr15.fa chr17.fa chr19.fa chr20.fa chr22.fa chr4.fa chr6.fa chr8.fa chrM.fa chrY.fa
        chr10.fa chr12.fa chr14.fa chr16.fa chr18.fa chr2.fa chr21.fa chr3.fa chr5.fa chr7.fa chr9.fa chrX.fa

        Comment


        • #5
          I imagine that bsmooth crashed at some point, then. It's a bit strange, but what it does is reverse the strands (i.e., have the chromosomes start at the end and end at the beginning, bisulfite convert that, and then perform the alignment). So, those coordinates are actually from the end of each chromosome, rather than from the beginning. There's a step in its processing where it then reverses that, so all coordinates are from the beginning of each chromosome, but presumably that was missed for some reason.

          Comment


          • #6
            The following is the commands used to index hg19 genome.

            index hg19 genome:

            cd ~/gc3018/genome/

            cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrM.fa chrX.fa chrY.fa > hg19.fa


            cd ~/gc3018/genome/hg19bsmooth


            bsub -q long -J bsmooth "perl ~/bin/bsmooth-align-0.8.1/bin/bswc_bowtie2_index.pl --name=hg19.idx --bowtie2-build=/usr/bin/bowtie2-build ../hg19.fa"


            index folder:
            $ ls ~/gc3018/genome/hg19bsmooth/
            hg19.idx.crick.1.bt2 hg19.idx.crick.3.bt2 hg19.idx.crick.rev.1.bt2 hg19.idx.watson.1.bt2 hg19.idx.watson.3.bt2 hg19.idx.watson.rev.1.bt2 wkshpseq.e8311173
            hg19.idx.crick.2.bt2 hg19.idx.crick.4.bt2 hg19.idx.crick.rev.2.bt2 hg19.idx.watson.2.bt2 hg19.idx.watson.4.bt2 hg19.idx.watson.rev.2.bt2 wkshpseq.o8311173


            $ ls ~/gc3018/genome/hg19/
            chr1.fa chr11.fa chr13.fa chr15.fa chr17.fa chr19.fa chr20.fa chr22.fa chr4.fa chr6.fa chr8.fa chrM.fa chrY.fa
            chr10.fa chr12.fa chr14.fa chr16.fa chr18.fa chr2.fa chr21.fa chr3.fa chr5.fa chr7.fa chr9.fa chrX.fa

            Comment


            • #7
              I should mention that looking back at my test dataset that was aligned with bsmooth, it seems that all of the alignments on the Crick ("-") strand still have their coordinates reversed. I ended up playing with the bsmooth code a little bit since it wasn't working for me (it turned out that the index was just corrupt, so that was unnecessary), so I'm not sure if this is intentional or a bug I introduced in my local copy (though, since you see the same thing, it's likely intentional). To convert to the actual bounds, I use the following code (which uses the samtools C API), which is just a snippet from the program I use to check alignment accuracy:

              Code:
              //XB is a char*
              //read1 is a bam1_t*
              //header is a bam_header_t*
              //match is an int that defaults to 0 (incorrect alignment)
              //cstart and cstop are the correct start/stop positions, which happens to some times
              //be offset by one (don't ask, I didn't write the read simulator that I used)
              
              //N.B., I left out the chromosome comparison bit
              
              XB = bam_aux2Z(bam_aux_get(read1, "XB"));
              start = read1->core.pos;
              stop = bam_calend(&read1->core, bam1_cigar(read1));
              
              if(read1->core.flag & BAM_FREAD1 && *XB == 'W') { //OT read #1
                  if(cstart == start+1) match=1;
              } else if(read1->core.flag & BAM_FREAD1) { //OB read #1
                  chromend = header->target_len[read1->core.tid];
                  if(cstop == chromend-start) match=1;
              } else if(*XB == 'W') { //OT read #2
                  if(cstop == stop+1) match=1;
              } else { //OB read #2
                  chromend = header->target_len[read1->core.tid];
                  if(cstart == chromend-stop) match=1;
              }
              The important parts are the "chromend = header->target_len[read1->core.tid];" and such, which recalculates things for Crick strand alignments.

              Comment


              • #8
                I contacted the author regarding my question, hopefully he can provide more info on how to do it.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X