Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Illumina Paired-End Data

    I am a beginner at bioinformatics but have a some experience with python and software development.

    I am trying to take some Illumina sequence data (mRNA-level complementary DNA I think) and prepare it for BLAST alignment. It is supposed to be paired-end. However, I'm trying to make sure this is true.

    For example, I have the following data files:
    J06643_NoIndex_L002_R1_001.fastq
    J06643_NoIndex_L002_R1_002.fastq
    J06643_NoIndex_L002_R1_003.fastq
    J06643_NoIndex_L002_R1_004.fastq
    J06643_NoIndex_L002_R1_005.fastq
    J06643_NoIndex_L002_R1_006.fastq
    J06643_NoIndex_L002_R1_007.fastq
    J06643_NoIndex_L002_R1_008.fastq
    J06643_NoIndex_L002_R1_009.fastq
    J06643_NoIndex_L002_R1_010.fastq
    J06643_NoIndex_L002_R1_011.fastq
    J06643_NoIndex_L002_R1_012.fastq
    J06643_NoIndex_L002_R1_013.fastq
    J06643_NoIndex_L002_R1_014.fastq
    J06643_NoIndex_L002_R2_001.fastq
    J06643_NoIndex_L002_R2_002.fastq
    J06643_NoIndex_L002_R2_003.fastq
    J06643_NoIndex_L002_R2_004.fastq
    J06643_NoIndex_L002_R2_005.fastq
    J06643_NoIndex_L002_R2_006.fastq
    J06643_NoIndex_L002_R2_007.fastq
    J06643_NoIndex_L002_R2_008.fastq
    J06643_NoIndex_L002_R2_009.fastq
    J06643_NoIndex_L002_R2_010.fastq
    J06643_NoIndex_L002_R2_011.fastq
    J06643_NoIndex_L002_R2_012.fastq
    J06643_NoIndex_L002_R2_013.fastq
    J06643_NoIndex_L002_R2_014.fastq

    It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)

    R1_001:

    <#0@@#############################################
    @D3NH4HQ1:710G1KACXX:2:1101:1488:2217 1:N:0:
    GTAAGGGCAAGGGCACTGAGCTATGTCATCTGGGCTCAAATTCTGCTACC
    +
    B@@FFFFFHHHHHJJJIJJJJJIJJIIGIIIJIJJGIGGIIIGJIEIIIH
    @D3NH4HQ1:710G1KACXX:2:1101:1279:2224 1:Y:0:
    GGCTTATTTGATACTCATGGTACAGAAGCGACGATCAAATAGATTGAGAA

    R2_001:

    ###4##22ADFHG#####################################
    @D3NH4HQ1:710G1KACXX:2:1101:2135:2174 2:N:0:
    NNGATGCAGGTGGCNNGGANNNNNNNNCGCCATNNTGCCTNNNNNNNNNN
    +
    ##14A?DBD<CACB##42<########11??FE##00?B@##########
    @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
    NNTGTTGTCACTTTNNAGANNNNNNNNTTGCTATNAAGCTNNNNNNNNNN

    Does this mean the data are not paired end?

  • #2
    I'm not sure what the exact specificities of the new format are, but the 1:N:0 or 2:N:0 in the header denote what /1 and /2 used to. This wikipedia page is helpful:

    Comment


    • #3

      It would seem logical that R1 is one end of the pair, and that R2 is the other. However, when I look at each set of files, I do not see the "/1" and "/2" designations. (according to this site, they should be there: http://loblolly.ucdavis.edu/bipod/ft...al_RNA-Seq.pdf)
      What did your parents (or teachers) tell you about not trusting everything you read on the internet.

      The Illumina specs have changed back and forth a couple of times in the last several months. It looks like you received files from the time that they decided to remove the '/1' and '/2' designations. Instead look at the first number after the white space:

      @D3NH4HQ1:710G1KACXX:2:1101:2088:2176 2:N:0:
      The above is an R2 read.

      Comment


      • #4
        Great. That makes total sense.

        The first thing I would like to do is subtract all human sequences from the data. We are only interested in viruses. I have attempted this with the following process. Does this look correct?

        2. Each set of R1 and R2 files were concatenated together using the following command, producing one R1 fastq file and one R2 fastq.
        a. cat J06643_NoIndex_L002_R1_001.fastq J06643_NoIndex_L002_R1_002.fastq J06643_NoIndex_L002_R1_003.fastq J06643_NoIndex_L002_R1_004.fastq J06643_NoIndex_L002_R1_005.fastq J06643_NoIndex_L002_R1_006.fastq J06643_NoIndex_L002_R1_007.fastq J06643_NoIndex_L002_R1_008.fastq J06643_NoIndex_L002_R1_009.fastq J06643_NoIndex_L002_R1_010.fastq J06643_NoIndex_L002_R1_011.fastq J06643_NoIndex_L002_R1_012.fastq J06643_NoIndex_L002_R1_013.fastq J06643_NoIndex_L002_R1_014.fastq > J06_R1.fastq

        3. Illumina adapters and low quality reads were removed using cutadapt.
        a. cutadapt -f fastq -q 20 -a AGATCGGAAGAGC J06_R1.fastq > ./J06_trimmed.fastq

        4. Bowtie against hg19 to subtract out all human sequences
        a. bowtie --un J06_subtracted.fastq -p 8 --chunkmbs 512 hg19 -1 J06_R1_trimmed.fastq -2 J06_R2_trimmed.fastq J06.sam

        Comment


        • #5
          cat *R1*.fastq > JO6_R1.fq
          Probably would have worked just as well, with a lot less typing.

          If you know the virus you expect to see, it might work slightly better if you align against a genome that has human sequence and virus sequence together. You'll have to make the index for that yourself, rather than downloading the pre-made one. You can then filter the .bam for the lines that aligned to virus.

          But that won't make a very big difference.

          Comment


          • #6
            Hah, thanks. That would've saved me some time.

            How about the cutadapt and bowtie commands?

            For cutadapt, is -q 20 appropriate? Did I select the right adapter sequence, and is there a way to make sure of this?

            For Bowtie, do I need to alter the "maxins" parameter? My reads are 50bp, and the default maxins parameter is 250.

            Right now, Bowtie is outputting some blank and incomplete reads. Is that normal, and will it screw up the assembly step?

            For example, here are the first few lines of the R1 bowtie output:

            @D3NH4HQ1:710G1KACXX:2:1101:1233:2172 1:Y:0:
            A
            +
            <
            @D3NH4HQ1:710G1KACXX:2:1101:1406:2044 1:Y:0:
            AAAA
            +
            <<<@
            @D3NH4HQ1:710G1KACXX:2:1101:1317:2025 1:Y:0:
            AGCT
            +
            <<<?
            @D3NH4HQ1:710G1KACXX:2:1101:15237:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15197:2000 1:Y:0:

            +

            @D3NH4HQ1:710G1KACXX:2:1101:15556:2000 1:Y:0:

            +

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            33 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            34 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X