Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample FASTA data and questions about paired end reads

    Hello,

    I have some basic questions I think.

    1. Where can I find some sample FASTA files? I mean something like a file for the complete sequence (the result) and a file containing the reads which would result in the complete sequence through proper assembly. It should'nt be that big. I want to play with it.

    2. Is it common to represent paired end reads in the FASTA format? How?

    3. I'm not sure if I understand it correctly but are there only two paired end reads? The paired end reads are from the ends of a DNA molecule (http://seqanswers.com/forums/showthread.php?t=503). Therefore we have two paired end reads and a whole bunch of other "normal" reads? Am I correct?


    Btw I'm no bioinformatician so I apologize for the stupid questions in advance.

  • #2
    Fasta file format is meant for plain sequence files (without quality information). There may be extensions of Fasta format but the normal usage is for plain sequence.

    What you are looking for are Fastq format files, which has become the de facto standard for NGS data.

    You can get fastq data files (there is a utility needed to retrieve data called srftoolkit) from NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) Fastq files can be several gigabytes in size.

    Comment


    • #3
      1. Probably JGI. They seem to have a LOT of bacteria datasets, that should be more tractable.
      2. No, that would be incredibly unusual. They're usually stored in fastq, typically in separate files.
      3. Each fragment sequenced produces two reads, one from each end. So if you sequence 100 million fragments, you'll have 200 million reads (100 million pairs). This is as opposed to single-end reads, where you just sequence one end of each fragment.

      Comment


      • #4
        2. fastq format

        Comment


        • #5
          Thank you for your answers.
          So I have to look up fastq and write a parser. I hoped I can avoid it.

          Comment


          • #6
            2. Some assemblers require input in fasta format, e.g. IDBA wants pairs to be consecutive sequences in fasta format and they have bundled a small script for converting from fastq to fasta..

            Also, don't reinvent the wheel. I'm sure there are many OSS fastq parsers available. A good place to start could be https://github.com/samtools/htslib
            savetherhino.org

            Comment


            • #7
              HTSlib doesn't have a fastq parser. Anyway, with any modern data it's fine to assume that fastq entries are always 4 lines, so a parser is then trivial to write.

              Comment


              • #8
                If you're using Python, there's a decent parser already at https://scipher.wordpress.com/2010/0...-fastq-parser/
                Scott Monsma
                Sr Scientist at Lucigen

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                58 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X