Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools converting sam>bam

    Hi, Im completely new to analysis of seq data. Additionally, am completely new to using code of any kind so its all a bit of a struggle....

    I've been provided with mapped sequencing files in .sop format from BGI (ChIP-seq). All I want to do for now is be able to view a file in IGV... So far I've used a perl script to convert the .sop to a .sam. I tried opening the .sam in IGV but it said it needs to be indexed and wouldn't open. Aside from this others have suggested converting to bam as this would be a more manageable file size. I've downloaded samtools and think I have this running. I've put my .sam file into the samtools folder and used the following command

    $ view -b -S file.sam > file.bam

    But keep getting

    Vim: Warning: Output is not to a terminal

    After which the terminal locks up and I have to close it.....

    Any help or other suggestions would be greatly appreciated.....

  • #2
    samtools converting sam>bam

    The command should be

    Code:
    $ samtools view -b -S file.sam > file.bam
    Then you will need to use

    Code:
    $ samtools sort file.bam file-sorted
    followed by

    Code:
    $ samtools index  file-sorted.bam
    in order to get an indexed file.

    If you just type

    Code:
    $ samtools
    or samtools followed by the name of one of the samtools commands, you will get a few lines of help giving the correct syntax for that command,

    Comment


    • #3
      Still not working

      Thank you so much for your help. I tried this but it comes up with the error:

      -bash: samtools: command not found

      I think I might have done something wrong when I installed samtools...... (I really am a bit out of my depth here). I downloaded the samtools-0.1.19.tar file, expanded it and managed to compile it using the 'make' command (I had to download xcode for mac before this would work..).

      My impression was I would be able to open terminal. Change directory to the directory with my .sam files in it (using cd command), and then use the code:

      $ samtools view -b -S file.sam > file.bam

      (with the word 'file' replaced with the actual name of my file...)

      However, when I do this I just get the error

      -bash: samtools: command not found

      SO instead what I did was put my .sam file into the samtools-0.1.19 folder then went to terminal opened that folder using the cd xxxx/xxx/xxx/samtools-0.1.19

      then just typed $ view -b -S file.sam > file.bam (again replacing the word 'file' with the actual name of my file...)

      This is where I get the error:

      Vim: Warning: Output is not to a terminal

      It seems I can run other commands from within the samtools-0.1.19 folder, for example I can just type "view" and as suggested by you, I get lines of info about that command....

      Honestly not sue what I've done wrong......

      Comment


      • #4
        When you type $view, you are getting a different program called VIM, not samtools.

        If you get
        -bash: samtools: command not found

        that means that samtools is not in your PATH (the directories where the computer looks for command files).
        Add the path to the samtools files to the PATH variable, which should be in a file in your home directory called something like .profile or .bashrc or .bash_profile.

        Or you can just specify the complete path to the samtools files when running samtools.


        More info on running samtools can be found on the webpages:

        Comment


        • #5
          Originally posted by Kienan View Post
          All I want to do for now is be able to view a file in IGV... So far I've used a perl script to convert the .sop to a .sam. I tried opening the .sam in IGV but it said it needs to be indexed and wouldn't open.
          One alternative tool to IGV is Tablet, which can actually load and display SAM files directly. This is very handy for small projects like bacteria, but is too slow for large SAM files. For large datasets you should convert SAM to a sorted and indexed BAM file, which is what Tablet most assembly/mapping viewers expect or recommend.
          Last edited by maubp; 07-12-2013, 06:46 AM. Reason: clarity

          Comment


          • #6
            To add to what others are saying, if you know the path to where you put samtools on the computer, you can do this:

            /home/user/pathToSamtools/samtools view -uS <input.sam> | samtools sort - <output>

            That will read in the .sam file (-S tells it that it is a .sam file), generate an uncompressed .bam file (that's what -u does), and the pipe command ("|") then sends that output directly into samtools sort. Even though you only type "output", it will actually output "output.bam"

            Then do:

            /home/user/pathToSamtools/samtools index output.bam

            And you'll get the index file as well.

            Comment


            • #7
              Another Problem

              Hi Guys,

              Thank you all so much for your help and putting up with my complete naivety, with all of this.

              I finally got this to work (though have to type the path for samtools every time as I cannot for the life of me figure out where the executable path is on my mac...) and now I am getting a new error:

              [samopen] no @SQ lines in the header.
              [sam_read1] missing header? Abort!

              From what I've been reading this suggests that there is something wrong with the .sam files I have generated, though not sure what to do about this as I generated them from aligned soap files (provided by BGI from a ChIP-seq experiment) using the per script soap2sam.pl.

              Heisman, I also tried your suggestion but I get a different error when using your suggested command (samtools view -uS <input.sam> | samtools sort - <output>)

              The error I get is:

              -bash: syntax error near unexpected token `|'

              Comment


              • #8
                I haven't worked on Macs; perhaps piping doesn't work?

                To convert the .sam file to a .bam file, there probably needs to be proper headers in the .sam file. Do you know the exact reference sequence the data was aligned too?

                The header lines can look like this (tab-delimited), if the alignment was done to hg19 and only to chromosomes 1-22 and X/Y:

                Code:
                @HD     VN:1.4  GO:none SO:coordinate
                @SQ     SN:chr1 LN:249250621
                @SQ     SN:chr2 LN:243199373
                @SQ     SN:chr3 LN:198022430
                @SQ     SN:chr4 LN:191154276
                @SQ     SN:chr5 LN:180915260
                @SQ     SN:chr6 LN:171115067
                @SQ     SN:chr7 LN:159138663
                @SQ     SN:chr8 LN:146364022
                @SQ     SN:chr9 LN:141213431
                @SQ     SN:chr10        LN:135534747
                @SQ     SN:chr11        LN:135006516
                @SQ     SN:chr12        LN:133851895
                @SQ     SN:chr13        LN:115169878
                @SQ     SN:chr14        LN:107349540
                @SQ     SN:chr15        LN:102531392
                @SQ     SN:chr16        LN:90354753
                @SQ     SN:chr17        LN:81195210
                @SQ     SN:chr18        LN:78077248
                @SQ     SN:chr19        LN:59128983
                @SQ     SN:chr20        LN:63025520
                @SQ     SN:chr21        LN:48129895
                @SQ     SN:chr22        LN:51304566
                @SQ     SN:chrX LN:155270560
                @SQ     SN:chrY LN:59373566
                You can generate this yourself using PicardTools if you have the reference sequence on your computer; it's explained here: http://gatkforums.broadinstitute.org...e-as-reference

                Then you can take the part that looks like what I copy and pasted above and attach it to the head of your sam files (using "cat header.lines sam.file > full_sam.file").

                If you don't know the reference sequence then you can take your sam file and do "cut -f3 sam.file | sort | uniq > temp.1" to see what chromosomes were used for aligning (or if you can't pipe do "cut -f3 sam.file > temp.1" followed by "sort temp.1 > temp.2" and then "uniq temp.2 > temp.3")

                That still won't tell you if it's hg18 or hg19 that aligning was done to (or other versions if there are a bunch of "weird" chromosome names in the sam file).

                So if you can figure out the exact reference sequence used for aligning (or get a copy of it), that would be best; if not we can do some trial and error to figure it out adequately.
                Last edited by Heisman; 07-14-2013, 08:33 AM.

                Comment


                • #9
                  Adding Header to.sam files

                  Hi Heisman,

                  Thanks for your help. It was definitely aligned to hg19. I've managed to open the sam files I've generated using BamSeq (just something I found on the net) and it appears as though they have no header lines as you've suggested but I have no idea how to add them to the sam files I have........

                  Comment


                  • #10
                    Like I said in the previous post, if you have a copy of the reference sequence on your computer and you have PicardTools installed, you can do what was explained in the link in the previous post to the gatkforums. Then you will want to concatenate the header lines to your sam file: "cat header_lines sam.file > reformatted.sam"

                    If you don't have a copy of the reference sequence, you can download it from the UCSC website (except it's down right now; otherwise I'd provide the link).

                    If you do "cut -f3 sam.file | sort | uniq > temp.1" like I said previously you can see all of the chromosomes that were used for alignment. You can do that and report back; and figure out if you have a copy of the reference sequence.

                    Comment


                    • #11
                      slowly making progress

                      OK I've run the "cut -f3 sam.file | sort | uniq > temp.1" command (I'd done it before but kept typing samtools before hand as for some reason I had it in my head that it was a samtools command and couldn't figure out why it wouldn't work).

                      Can I just check when you write "sam.file" in this command do you mean I should replace this with "filename.sam"?

                      This is what I did and I've generated the type.1 file but now of course have no idea how to open it/look at it (you really are dealing with a complete newby here......., sorry).

                      Comment


                      • #12
                        Ah, didn't realize you were that new. You should ASAP google for a linux tutorial and go through it. There are quite a few online and any should work.

                        "less type.1" will let you look through a file. Then hit "q" to quit.

                        "cat type.1" will output the file to the screen. Not necessary here though; you can use less and then copy and paste the output here if you'd like (just highlight everything and ctr+c ctr+v).

                        Comment


                        • #13
                          Type.1 contents

                          Sorry, I'll do the tutorial ASAP, Sorry for taking up so much screen space....

                          chr1
                          chr10
                          chr11
                          chr11_gl000202_random
                          chr12
                          chr13
                          chr14
                          chr15
                          chr16
                          chr17
                          chr17_ctg5_hap1
                          chr17_gl000203_random
                          chr17_gl000204_random
                          chr17_gl000205_random
                          chr17_gl000206_random
                          chr18
                          chr18_gl000207_random
                          chr19
                          chr19_gl000208_random
                          chr19_gl000209_random
                          chr1_gl000191_random
                          chr1_gl000192_random
                          chr2
                          chr20
                          chr21
                          chr21_gl000210_random
                          chr22
                          chr3
                          chr4
                          chr4_ctg9_hap1
                          chr4_gl000193_random
                          chr4_gl000194_random
                          chr5
                          chr6
                          chr6_apd_hap1
                          chr6_cox_hap2
                          chr6_dbb_hap3
                          chr6_mann_hap4
                          chr6_mcf_hap5
                          chr6_qbl_hap6
                          chr6_ssto_hap7
                          chr7
                          chr7_gl000195_random
                          chr8
                          chr8_gl000196_random
                          chr8_gl000197_random
                          chr9
                          chr9_gl000198_random
                          chr9_gl000199_random
                          chr9_gl000200_random
                          chr9_gl000201_random
                          chrM
                          chrUn_gl000211
                          chrUn_gl000212
                          chrUn_gl000213
                          chrUn_gl000214
                          chrUn_gl000215
                          chrUn_gl000216
                          chrUn_gl000217
                          chrUn_gl000218
                          chrUn_gl000219
                          chrUn_gl000220
                          chrUn_gl000221
                          chrUn_gl000222
                          chrUn_gl000223
                          chrUn_gl000224
                          chrUn_gl000225
                          chrUn_gl000226
                          chrUn_gl000227
                          chrUn_gl000228
                          chrUn_gl000229
                          chrUn_gl000230
                          chrUn_gl000231
                          chrUn_gl000232
                          chrUn_gl000233
                          chrUn_gl000234
                          chrUn_gl000235
                          chrUn_gl000236
                          chrUn_gl000237
                          chrUn_gl000238
                          chrUn_gl000239
                          chrUn_gl000240
                          chrUn_gl000241
                          chrUn_gl000242
                          chrUn_gl000243
                          chrUn_gl000244
                          chrUn_gl000245
                          chrUn_gl000246
                          chrUn_gl000247
                          chrUn_gl000248
                          chrUn_gl000249
                          chrX
                          chrY

                          Comment


                          • #14
                            The link is still down but you are going to want to download a copy of the reference sequence from here: http://hgdownload.soe.ucsc.edu/downloads.html

                            Try to get on there every hour or so until it works and see if you can figure it out. If I can get on I'll write out more detailed instructions. In the meantime you can start with a tutorial if you'd like.

                            Comment


                            • #15
                              Which reference file to download?

                              Thanks a million Heisman, will do the tutorial. BTW the link to the UCSC downloads page is working from my end but not sure which of the files below I need to download (I'm guessing hg19.2bit but not sure)......



                              chromAgp.tar.gz - Description of how the assembly was generated from
                              fragments, unpacking to one file per chromosome.

                              chromFa.tar.gz - The assembly sequence in one file per chromosome.
                              Repeats from RepeatMasker and Tandem Repeats Finder (with period
                              of 12 or less) are shown in lower case; non-repeating sequence is
                              shown in upper case.

                              chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
                              Repeats are masked by capital Ns; non-repeating sequence is shown in
                              upper case.

                              chromOut.tar.gz - RepeatMasker .out files (one file per chromosome).
                              RepeatMasker was run with the -s (sensitive) setting.
                              Using: Jan 29 2009 (open-3-2-7) version of RepeatMasker and
                              RELEASE 20090120 of library RepeatMaskerLib.embl

                              chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats
                              with period less than or equal to 12, and translated into UCSC's BED 5+
                              format (one file per chromosome).

                              hg19.2bit - contains the complete hg19 Human Genome
                              in the 2bit format. A utility program, twoBitToFa (available
                              from our src tree), can be used to extract .fa file(s) from
                              this file. See also:


                              est.fa.gz - Human ESTs in GenBank. This sequence data is updated once a
                              week via automatic GenBank updates.

                              md5sum.txt - checksums of files in this directory

                              mrna.fa.gz - Human mRNA from GenBank. This sequence data is updated
                              once a week via automatic GenBank updates.

                              refMrna.fa.gz - RefSeq mRNA from the same species as the genome.
                              This sequence data is updated once a week via automatic GenBank
                              updates.

                              upstream1000.fa.gz - Sequences 1000 bases upstream of annotated
                              transcription starts of RefSeq genes with annotated 5' UTRs.
                              This file is updated weekly so it might be slightly out of sync with
                              the RefSeq data which is updated daily for most assemblies.

                              upstream2000.fa.gz - Same as upstream1000, but 2000 bases.

                              upstream5000.fa.gz - Same as upstream1000, but 5000 bases.

                              xenoMrna.fa.gz - GenBank mRNAs from species other than that of
                              the genome. This sequence data is updated once a week via automatic
                              GenBank updates.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X