Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample FASTA data and questions about paired end reads

    Hello,

    I have some basic questions I think.

    1. Where can I find some sample FASTA files? I mean something like a file for the complete sequence (the result) and a file containing the reads which would result in the complete sequence through proper assembly. It should'nt be that big. I want to play with it.

    2. Is it common to represent paired end reads in the FASTA format? How?

    3. I'm not sure if I understand it correctly but are there only two paired end reads? The paired end reads are from the ends of a DNA molecule (http://seqanswers.com/forums/showthread.php?t=503). Therefore we have two paired end reads and a whole bunch of other "normal" reads? Am I correct?


    Btw I'm no bioinformatician so I apologize for the stupid questions in advance.

  • #2
    Fasta file format is meant for plain sequence files (without quality information). There may be extensions of Fasta format but the normal usage is for plain sequence.

    What you are looking for are Fastq format files, which has become the de facto standard for NGS data.

    You can get fastq data files (there is a utility needed to retrieve data called srftoolkit) from NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) Fastq files can be several gigabytes in size.

    Comment


    • #3
      1. Probably JGI. They seem to have a LOT of bacteria datasets, that should be more tractable.
      2. No, that would be incredibly unusual. They're usually stored in fastq, typically in separate files.
      3. Each fragment sequenced produces two reads, one from each end. So if you sequence 100 million fragments, you'll have 200 million reads (100 million pairs). This is as opposed to single-end reads, where you just sequence one end of each fragment.

      Comment


      • #4
        2. fastq format

        Comment


        • #5
          Thank you for your answers.
          So I have to look up fastq and write a parser. I hoped I can avoid it.

          Comment


          • #6
            2. Some assemblers require input in fasta format, e.g. IDBA wants pairs to be consecutive sequences in fasta format and they have bundled a small script for converting from fastq to fasta..

            Also, don't reinvent the wheel. I'm sure there are many OSS fastq parsers available. A good place to start could be https://github.com/samtools/htslib
            savetherhino.org

            Comment


            • #7
              HTSlib doesn't have a fastq parser. Anyway, with any modern data it's fine to assume that fastq entries are always 4 lines, so a parser is then trivial to write.

              Comment


              • #8
                If you're using Python, there's a decent parser already at https://scipher.wordpress.com/2010/0...-fastq-parser/
                Scott Monsma
                Sr Scientist at Lucigen

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Working...
                X