Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Illumina Paired-End Data

    I am a beginner at bioinformatics but have a some experience with python and software development.

    I am trying to take some Illumina sequence data (mRNA-level complementary DNA I think) and prepare it for BLAST alignment. It is supposed to be paired-end. However, I'm trying to make sure this is true.

    For example, I have the following data files:
    J06643_NoIndex_L002_R1_001.fastq
    J06643_NoIndex_L002_R1_002.fastq
    J06643_NoIndex_L002_R1_003.fastq
    J06643_NoIndex_L002_R1_004.fastq
    J06643_NoIndex_L002_R1_005.fastq
    J06643_NoIndex_L002_R1_006.fastq
    J06643_NoIndex_L002_R1_007.fastq
    J06643_NoIndex_L002_R1_008.fastq
    J06643_NoIndex_L002_R1_009.fastq
    J06643_NoIndex_L002_R1_010.fastq
    J06643_NoIndex_L002_R1_011.fastq
    J06643_NoIndex_L002_R1_012.fastq
    J06643_NoIndex_L002_R1_013.fastq
    J06643_NoIndex_L002_R1_014.fastq
    J06643_NoIndex_L002_R2_001.fastq
    J06643_NoIndex_L002_R2_002.fastq
    J06643_NoIndex_L002_R2_003.fastq
    J06643_NoIndex_L002_R2_004.fastq
    J06643_NoIndex_L002_R2_005.fastq
    J06643_NoIndex_L002_R2_006.fastq
    J06643_NoIndex_L002_R2_007.fastq
    J06643_NoIndex_L002_R2_008.fastq
    J06643_NoIndex_L002_R2_009.fastq
    J06643_NoIndex_L002_R2_010.fastq
    J06643_NoIndex_L002_R2_011.fastq
    J06643_NoIndex_L002_R2_012.fastq
    J06643_NoIndex_L002_R2_013.fastq
    J06643_NoIndex_L002_R2_014.fastq

    It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)

    R1_001:

    <#0@@#############################################
    @D3NH4HQ1:710G1KACXX:2:1101:1488:2217 1:N:0:
    GTAAGGGCAAGGGCACTGAGCTATGTCATCTGGGCTCAAATTCTGCTACC
    +
    B@@FFFFFHHHHHJJJIJJJJJIJJIIGIIIJIJJGIGGIIIGJIEIIIH
    @D3NH4HQ1:710G1KACXX:2:1101:1279:2224 1:Y:0:
    GGCTTATTTGATACTCATGGTACAGAAGCGACGATCAAATAGATTGAGAA

    R2_001:

    ###4##22ADFHG#####################################
    @D3NH4HQ1:710G1KACXX:2:1101:2135:2174 2:N:0:
    NNGATGCAGGTGGCNNGGANNNNNNNNCGCCATNNTGCCTNNNNNNNNNN
    +
    ##14A?DBD<CACB##42<########11??FE##00?B@##########
    @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
    NNTGTTGTCACTTTNNAGANNNNNNNNTTGCTATNAAGCTNNNNNNNNNN

    Does this mean the data are not paired end?

  • #2
    I'm not sure what the exact specificities of the new format are, but the 1:N:0 or 2:N:0 in the header denote what /1 and /2 used to. This wikipedia page is helpful:

    Comment


    • #3

      It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)
      What did your parents (or teachers) tell you about not trusting everything you read on the internet.

      The Illumina specs have changed back and forth a couple of times in the last several months. It looks like you received files from the time that they decided to remove the '/1' and '/2' designations. Instead look at the first number after the white space:

      @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
      The above is an R2 read.

      Comment


      • #4
        Great. That makes total sense.

        The first thing I would like to do is subtract all human sequences from the data. We are only interested in viruses. I have attempted this with the following process. Does this look correct?

        2. Each set of R1 and R2 files were concatenated together using the following command, producing one R1 fastq file and one R2 fastq.
        a. cat J06643_NoIndex_L002_R1_001.fastq J06643_NoIndex_L002_R1_002.fastq J06643_NoIndex_L002_R1_003.fastq J06643_NoIndex_L002_R1_004.fastq J06643_NoIndex_L002_R1_005.fastq J06643_NoIndex_L002_R1_006.fastq J06643_NoIndex_L002_R1_007.fastq J06643_NoIndex_L002_R1_008.fastq J06643_NoIndex_L002_R1_009.fastq J06643_NoIndex_L002_R1_010.fastq J06643_NoIndex_L002_R1_011.fastq J06643_NoIndex_L002_R1_012.fastq J06643_NoIndex_L002_R1_013.fastq J06643_NoIndex_L002_R1_014.fastq > J06_R1.fastq

        3. Illumina adapters and low quality reads were removed using cutadapt.
        a. cutadapt -f fastq -q 20 -a AGATCGGAAGAGC J06_R1.fastq > ./J06_trimmed.fastq

        4. Bowtie against hg19 to subtract out all human sequences
        a. bowtie --un J06_subtracted.fastq -p 8 --chunkmbs 512 hg19 -1 J06_R1_trimmed.fastq -2 J06_R2_trimmed.fastq J06.sam

        Comment


        • #5
          cat *R1*.fastq > JO6_R1.fq
          Probably would have worked just as well, with a lot less typing.

          If you know the virus you expect to see, it might work slightly better if you align against a genome that has human sequence and virus sequence together. You'll have to make the index for that yourself, rather than downloading the pre-made one. You can then filter the .bam for the lines that aligned to virus.

          But that won't make a very big difference.

          Comment


          • #6
            Hah, thanks. That would've saved me some time.

            How about the cutadapt and bowtie commands?

            For cutadapt, is -q 20 appropriate? Did I select the right adapter sequence, and is there a way to make sure of this?

            For Bowtie, do I need to alter the "maxins" parameter? My reads are 50bp, and the default maxins parameter is 250.

            Right now, Bowtie is outputting some blank and incomplete reads. Is that normal, and will it screw up the assembly step?

            For example, here are the first few lines of the R1 bowtie output:

            @D3NH4HQ1:710G1KACXX:2:1101:1233:2172 1:Y:0:
            A
            +
            <
            @D3NH4HQ1:710G1KACXX:2:1101:1406:2044 1:Y:0:
            AAAA
            +
            <<<@
            @D3NH4HQ1:710G1KACXX:2:1101:1317:2025 1:Y:0:
            AGCT
            +
            <<<?
            @D3NH4HQ1:710G1KACXX:2:1101:15237:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15197:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15556:2000 1:Y:0:

            +

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X