Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • read groups

    This is a problem for which there are many posts already. Yet having spent two days googling my problems, and having them unresolved, I am going to ask. I have 17 samples from two subspecies that I want to eventually analyze in vcftools for some basic population genetics. From what I can tell, this requires them to be all in one file with some sort of distinguishing header for the different subspecies. So I am trying to add headers to the .sam files and then merge them such that this will be possible.

    I have tried using sampe:

    ./bwa sampe -r "@RG\tID:3\tSM:3\tPL:illumina" reference.txt in.sai in.sai in.fastq in.fastq > header.sam

    And this does add the header, but just as one line in the file @RG. Adding it using samtools merge -rh and a text file does not seem to work at all.

    Using picard to merge the files seems to maintain the headers, again as one line.

    java -Xmx2g -jar MergeSamFiles.jar

    but when it is finalized as a vcf all distinguishing tags are gone.

    Perhaps I am mistaken as to the input that vcf requires? How else is it supposed to tell between populations?

    Are these headers not enough to tag the reads?
    Last edited by sasignor; 01-07-2014, 01:24 PM.

  • #2
    If you already have the alignments, just use picard tools AddOrReplaceReadGroups (adding read groups should alter every alignment, so if you "samtools view sample1.bam | head" before and after adding read groups you should see a difference). We'd have to see and example of from the VCF file to see if things are going amiss there.

    Comment


    • #3
      Originally posted by sasignor View Post
      This is a problem for which there are many posts already. Yet having spent two days googling my problems, and having them unresolved, I am going to ask. I have 17 samples from two subspecies that I want to eventually analyze in vcftools for some basic population genetics. From what I can tell, this requires them to be all in one file with some sort of distinguishing header for the different subspecies. So I am trying to add headers to the .sam files and then merge them such that this will be possible.

      I have tried using sampe:

      ./bwa sampe -r "@RG\tID:3\tSM:3\tPL:illumina" reference.txt in.sai in.sai in.fastq in.fastq > header.sam

      And this does add the header, but just as one line in the file @RG. Adding it using samtools merge -rh and a text file does not seem to work at all.

      Using picard to merge the files seems to maintain the headers, again as one line.

      java -Xmx2g -jar MergeSamFiles.jar

      but when it is finalized as a vcf all distinguishing tags are gone.

      Perhaps I am mistaken as to the input that vcf requires? How else is it supposed to tell between populations?

      Are these headers not enough to tag the reads?
      I use the -h option in samtools merge to specify a separate header file, for the header file I print the header from one of the bam file then concatenate the @RG tags for all of the samples to the header.

      Something like:

      Samtools view -h file1.bam | grep ^@ > header.txt
      Cat rg.txt header.txt > rg_header.txt
      Samtools merge -h rg_header.txt out.bam file*.bam

      Where rg.txt has the @RG tags.

      It is a weird operation IMO, if there were a simpler way to do it, I would like to know, I can post more details if it isn't clear how to do it this way.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 08:47 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X