Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Formatting a custom genome reference

    Hi All,

    I have a transgenic mouse whose genome I've sequenced and I'm interesting is answering a few questions concerning insertions, deletions, rearrangements, and transgene copy numbers, all of which I verified (except the copy numbers of course) via PCR before sequencing the genome.

    Since I'm interested in structural information, I created a 550bp insert paired end read library and sequenced the transgenic mouse genome on the Illumina Hi-Seq.

    I took those reads, and mapped them back to a reference using bowtie2 (and BWA for practice!). In summary; I downloaded the chromosome files from UCSC for the latest mus genome build, concatenated them, and I then performed the alignment. Great results all around. (yay)

    Next, since there are at least two copies of the transgene in the genome, I went back and inserted two copies of the transgene sequence in to the appropriate chr#.fa file, concatenated it with the rest of the chr#.fa files, and performed the alignment using bowtie2 again. Everything seemed normal.

    HOWEVER, (and this is the part I need help on...), when I went to visualize the alignment in a genome browser (after converting to .bam, creating indexes for everything, etc.), I ran in to a problem where my chromosome files were formatted so that each line was exactly 50 characters long.

    I understand this is normal, so I wrote a program to reformat all of the lines in the chr#.fa file. That program can be found here:
    https://github.com/pkstarstorm05/Ref...Formatter50.py

    Longer story now short: In spite of my attempts, I keep getting errors from the genome browsers saying my genome for this specific chromosome does not have even sized lines. I've tried fixing it, but it still seems to not work.

    Is there anything special we have to do to properly prepare a chromosome file for use as a reference? I'm working on a windows PC and I know there are differences with carriage returns vs. line breaks, etc. Would that cause a problem here? I've tried getting rid of the carriage returns...

    Thanks in advance for any help.

    Paul

    (P.s. I'm new here... so I apologize if this has been posted already. I wasn't able to find it when I searched.)

  • #2
    Paul,

    You can try my reformat program, available here:

    Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).


    In Windows:

    java -ea -Xmx200m -cp path/to/bbmap/current/ jgi.ReformatReads in=chr.fa out=formatted.fa fastawrap=50

    ...where "fastawrap" specifies the length of fasta lines (there are many other options, documented in reformat.sh). I have never had a problem with it. Also, there's a neat tool called Notepad++ that I use in Windows, which has (under the 'edit' menu) an "EOL conversion" option, which lets you change between Unix (\n) and Windows (\n\r) format. It also supports search-and-replace for control characters. The limit is roughly 300MB files which should be fine for a single chromosome.

    Also, if you use WinSCP for transferring between Windows and non-Windows computers, be sure to set the text transfer mode to binary, or it will mess up the EOL characters.

    Lastly, if you are interested in features like long insertions and deletions, I suggest you give BBMap a try - it is much more sensitive to these than Bowtie2 or BWA.

    Comment


    • #3
      Wow, thanks for the speedy response Brian.

      I will admit that I'm a bit of a novice with bioinformatics. I'm working on my PhD in developmental and genetics - and despite being told that we shouldn't both much with getting too deep in to the bioinformatics, if I want to be able to have an intelligent conversation with a solid bioinformatic, I believe I'm going to need these skills. If I hadn't Thanks very much for you help!

      My set up is currently working from a Windows 8 laptop and ssh'ing in to a super computer. I use Kitty for the ssh, and I am indeed using WinSCP for my file transfers. Thanks for the tip about the binary transfer, I wasn't aware of that previously.

      The computer that is handling all of my data processing is a Unix machine that has about 256Gb RAM, 24 cores, and plenty of HD space. Would the commands be the same for that machine? I've run them, but I received the following message followed by what seemed like a rather long run time (which I canceled):

      ./current$ ls
      align2 chr#.fa cluster dna driver fileIO jgi kmer pacbio stream var

      ./current$ java -ea -Xmx200m -cp ./ jgi.ReformatReads in=chr8.fa out=formatted.fa fastawrap=50
      Executing jgi.ReformatReads [in=chr8.fa, out=formatted.fa, fastawrap=50]

      Input is being processed as unpaired
      Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
      at java.util.Arrays.copyOf(Arrays.java:2271)
      at stream.FastaReadInputStream.fillBuffer(FastaReadInputStream.java:441)
      at stream.FastaReadInputStream.nextHeader(FastaReadInputStream.java:309)
      at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:205)
      at stream.FastaReadInputStream.hasMore(FastaReadInputStream.java:137)
      at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:739)
      at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:731)

      I didn't think that was supposed to happen... but I've literally no experience with java. What little skill I have is with python and command line. Also, the 'ReformatReads' program is appropriate for reformatting a genome reference chromosome file?

      I also used Notepad++ to make the amendment to the original chr.fa file and I didn't realize it could convert the EOL symbols to Unix! Good to know as well!

      A million thanks for your help!

      Paul

      Comment


      • #4
        Hi Brian -

        Immediately after submitting that response, I went back and ran the usage for the program you suggested. This line in particular was useful:

        java -ea -Xmx512m -cp <path> jgi.ReformatReads in=<infile> in2=<infile2> out=<outfile> out2=<outfile2>

        I figured that I ran out of memory, and that looks like the allocated memory for the program. I changed it from 200 to 512, and the program ran without any problems.

        I'll post back if there are any other problems!

        Cheers,
        Paul

        Comment


        • #5
          Paul,

          Thanks for the update - sorry, I forgot it would need more memory when dealing with whole chromosomes. Normally I use it with raw reads and de-novo assemblies with short contigs so 200 megabytes is plenty. But yes, as long as it gets enough memory, it should work with anything.

          -Brian

          P.S. If you are running stuff on a Linux computer in bash rather than Windows, you can just use the shellscripts which are easier as they automatically set the classpath; e.g.

          bash reformat.sh -Xmx512m in=<file> out=<file>
          Last edited by Brian Bushnell; 08-18-2014, 09:23 PM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM
          • seqadmin
            The Impact of AI in Genomic Medicine
            by seqadmin



            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
            02-26-2024, 02:07 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-14-2024, 06:13 AM
          0 responses
          34 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-08-2024, 08:03 AM
          0 responses
          72 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-07-2024, 08:13 AM
          0 responses
          81 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-06-2024, 09:51 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X