Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK realignment

    Hi there,
    I'm trying to produce a list of possible indels using the GATK Realigner Target Creator. I have the latest version of hg19.fasta as my reference and db135.vcf file as my known indels file. I have removed all the chr tags so my chromosomes are intergers (1 - 2 - 3...) but for some reason my reference fatsa file after being indexed is sorted 1 2 3 4 5 6 7 X 8 9 10 11 12 13 14 15 16 17 18 20 Y 19 22 21 M. I cant find any way of getting GATK to run with the data in this format and I have no idea how to change the file, using 'sort' doesn't work?

    Any help would be much appreciated - Thanks H

  • #2
    I don't know the answer to your question but I am interested in how you renamed your contigs. Right now I have a ref with chr1, chr2 etc and the dbSNP file is 1,2 etc. I am also a novice so help would be appreciated.

    Comment


    • #3
      Originally posted by shawpa View Post
      I don't know the answer to your question but I am interested in how you renamed your contigs. Right now I have a ref with chr1, chr2 etc and the dbSNP file is 1,2 etc. I am also a novice so help would be appreciated.
      I just used the UNIX command:

      $ sed "s/chr//g" file_to.change > new.file

      try that it worked for me.

      Comment


      • #4
        Why don't you just change the name back to having the "chr"?

        Comment


        • #5
          Originally posted by Heisman View Post
          Why don't you just change the name back to having the "chr"?
          I'm relatively new to UNIX and so for me the easiest way to overcome the problem was to make everything into an interger, how would you convert chromosomal intergers in a .vcf or .fasta file back into the chr1 format?

          Comment


          • #6
            Well I don't know how to get it to work either way. Working on Countcovariates step and I did what HGENETIC suggests (thanks by the way) and I stopped having the issue with my known sites file and reference. Now it is giving error because my bam input still has chr 1 chr2 etc. Tried the "fix" from above and it didn't seem to work on the bam file.

            Comment


            • #7
              Wait, I'm being dumb. Why did you remove the "chr" tags in the first place?

              Anyways, if you wanted to go back, your headers in the fasta file are like ">1" and ">2", and nothing else is, correct? Then you can type sed 's/>/>chr/' input_file > output_file

              Comment


              • #8
                Originally posted by shawpa View Post
                Well I don't know how to get it to work either way. Working on Countcovariates step and I did what HGENETIC suggests (thanks by the way) and I stopped having the issue with my known sites file and reference. Now it is giving error because my bam input still has chr 1 chr2 etc. Tried the "fix" from above and it didn't seem to work on the bam file.
                Yeah, bam files are compressed so that wouldn't work.

                There is no need to rename your reference sequence for this purpose.

                Comment


                • #9
                  I removed the chr from the file because GATK gave me an error saying "known site and reference have incompatible contigs: No overlapping contigs found" So I took out the chr from my reference file to match the other. Now I run it and it says "Input files reads and reference have incompatible contigs: No overlapping contigs found." I think it is talking about my bam file and since I aligned my bam file with a reference that still had chr in it I am having an issue. Atleast I think this is what the error meant.

                  Comment


                  • #10
                    That makes sens. I guess my question is, why don't you have "chr" in every file?

                    Comment


                    • #11
                      Originally posted by Heisman View Post
                      Wait, I'm being dumb. Why did you remove the "chr" tags in the first place?

                      Anyways, if you wanted to go back, your headers in the fasta file are like ">1" and ">2", and nothing else is, correct? Then you can type sed 's/>/>chr/' input_file > output_file
                      Thanks for that I think that would work nicely, the reason I removed the chr tags was because i was trying to use the dbSNP135 known variant file which only had intergers whereas my fasta file had the chr tags - I think? To make things easier I'm just going to download and use the data from the GATK bundle as that should all be compatible.

                      Comment


                      • #12
                        Yes, just use the stuff in their data bundle. There are a lot of errors in dbSNP 135 anyway. I emailed the NCBI about this awhile ago and to my knowledge they are still working on it.

                        Comment


                        • #13
                          Originally posted by Heisman View Post
                          Yes, just use the stuff in their data bundle. There are a lot of errors in dbSNP 135 anyway. I emailed the NCBI about this awhile ago and to my knowledge they are still working on it.
                          Out of curiosity do you know the difference between the data in the GATK bundle for b37 and hg19, all the file names are the same except for this?

                          Comment


                          • #14
                            Originally posted by HGENETIC View Post
                            Out of curiosity do you know the difference between the data in the GATK bundle for b37 and hg19, all the file names are the same except for this?
                            I am curious about this too. If I did alignment using hg19 but now I switch to b37 for the countcovariates step will everything be screwed up?

                            Comment


                            • #15
                              This is one of those things I tend to ignore although I shouldn't. I think the vast majority of it is the same, but I could be completely wrong.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X