Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina paired-end sra data in three separate files - what next?

    Hi,

    I have used fastq-dump to split paired-end illumina data. I get three files, one for each different pair and one file with barcodes. This is transcriptome data and I want to do de novo assembly. I have two questions:

    First, on the SRA website where I got the data it is only mentioned one barcode while there are several different in the barcodes file. Should I only use the sequences with the barcode given on the web?

    Second, how can I split the files according to the different barcodes while keeping the pairs? I looked at the fastx toolkit and the qiime split_libraries, but I don't think my illumina barcodes are inlcuded in the sequences themselves?

    Examples of the files:

    Code:
    -bash-4.1$ head SRR343051_1.fastq 
    @SRR343051.1.1 B0A05ABXX110604:3:1101:18610:1087 length=101
    NTCTTCTTGCGTACGCATTTGGACTTAATCCTAATCTTGGATTTGTTTCTTCTAAATATGTACCAATCACAATGCTTGAATCTCTTATTATAATATATTTA
    +SRR343051.1.1 B0A05ABXX110604:3:1101:18610:1087 length=101
    #####################################################################################################
    @SRR343051.2.1 B0A05ABXX110604:3:1101:14471:1088 length=101
    NCGAAGGGCAATGTAATAAAGTTTATTATTATGTGTGTACAATGCAAAAAAAAGGGACTCGACTCTAATCCTGGTCGAAGCACAGGGCAAGACCACCAATG
    +SRR343051.2.1 B0A05ABXX110604:3:1101:14471:1088 length=101
    #####################################################################################################
    @SRR343051.3.1 B0A05ABXX110604:3:1101:20187:1088 length=101
    NATCATAATCTTCAATTTTCAAATTACTCTTGTTGCCTTTGGAAAGATCGTTAGTTTTCGGGTCTTTTATATTTTACTATTGCTTTATACTTGTTTTCACT
    
    -bash-4.1$ head SRR343051_2.fastq 
    @SRR343051.1.2 B0A05ABXX110604:3:1101:18610:1087 length=8
    TTGAGCCT
    +SRR343051.1.2 B0A05ABXX110604:3:1101:18610:1087 length=8
    CCCFFFFF
    @SRR343051.2.2 B0A05ABXX110604:3:1101:14471:1088 length=8
    TTGAGCCT
    +SRR343051.2.2 B0A05ABXX110604:3:1101:14471:1088 length=8
    CCCFFFFF
    @SRR343051.3.2 B0A05ABXX110604:3:1101:20187:1088 length=8
    TTGAGCCT
    
    -bash-4.1$ head SRR343051_3.fastq 
    @SRR343051.1.3 B0A05ABXX110604:3:1101:18610:1087 length=101
    GAGAAAATAAAATATGAGAAAATAGTAAAGAAGAAATTAACTGATATAATTACAGAAGAGAATGAATAATTGAAACAATTAAAAAATCATTAAATGAAGAT
    +SRR343051.1.3 B0A05ABXX110604:3:1101:18610:1087 length=101
    CCCFFFFFGHHHHJJJIJIJJIJJJHJIJJJJJJJJJJJJJJJJJJJJHIGIIIIGHHIJIJJJJJJIJJJJJEGIIJJJJGFHHFFCEEEECCDDDCCCC
    @SRR343051.2.3 B0A05ABXX110604:3:1101:14471:1088 length=101
    CTGATGGTGTACGTTGAACTTGGTCTGGTGGTGCTGATTCTGAGCAACAGTCTGCGTCGCGCCGCCTCCTTCTTCCTGATTCTCTCGCTGGCCGTGTCGCT
    +SRR343051.2.3 B0A05ABXX110604:3:1101:14471:1088 length=101
    BCCFFFFDHHHHHJJIIGIJJJJHIJJIIJJFHIJJIJJJJIIJJJJJJJJIIJJIGIJJHFFDDDBDDDDDDDDDDDCDDDDCDD<BD39??&09B?9A<
    @SRR343051.3.3 B0A05ABXX110604:3:1101:20187:1088 length=101
    AGGTGATTCATCATCTTCAAAATATTAATAAAAAGTATATTAATATAAAGACAATTATATATCGAAAGTGAATAGTACTGTGAAGGAAAGTAGGAAATATT

  • #2
    Hopefully you have the information about barcode <--> sample.

    Try this script for demultiplexing: http://qiime.org/scripts/split_libraries_fastq.html

    Comment


    • #3
      @Jon B: You have not used the

      -F | --origfmt Defline contains only original sequence name.
      option with fastq-dump so you have the SRR* in the names. Just keep that in mind.

      Comment


      • #4
        Originally posted by GenoMax View Post
        @Jon B: You have not used the



        option with fastq-dump so you have the SRR* in the names. Just keep that in mind.
        Thanks! I didn't see that option.

        Comment


        • #5
          Originally posted by GenoMax View Post
          Hopefully you have the information about barcode <--> sample.

          Try this script for demultiplexing: http://qiime.org/scripts/split_libraries_fastq.html
          GenoMax, do you mind telling me how I could use this script? I was looking at it before, but I don't understand how it assigns my reads into files based on the barcodes, and how does it deal with the two read pairs? Can I still use it on my data with the pairs in separate files?

          Thanks

          Comment


          • #6
            Jon: This appears to be a single sample even though the barcode read is included as a separate file in the SRA archive. See the corresponding ENA record (http://www.ebi.ac.uk/ena/data/view/SRR343051).

            In short, demultiplexing is not needed for this sample. You can use the _1 and _3 files as the R1/R2 read pair.

            Comment


            • #7
              Originally posted by GenoMax View Post
              Jon: This appears to be a single sample even though the barcode read is included as a separate file in the SRA archive. See the corresponding ENA record (http://www.ebi.ac.uk/ena/data/view/SRR343051).

              In short, demultiplexing is not needed for this sample. You can use the _1 and _3 files as the R1/R2 read pair.
              Thank you!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              39 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X