Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data Analysis Suggestion for a Newbie?

    Hi all, I am a newbie in Bioinformatics. I've got data for paired-end DNA sequencing on Illumina platform and I am wondering if anyone is kind enough to give me some suggestions on how to start with the analyzing part. It is in fastq format, with average of 2 millions reads (~50bp) each for read 1 and read 2.

    For preprocessing, I will not trim any of the reads (for downstream analysis purposes) but I am concerned about the low quality. It's either I trim all the reads with a fixed length, say...5bp(?), or I just leave them that way. The next I try to filter reads with at least 70% Q20 (any comment?)

    Then I will have to map all these processed reads with a reference genome. How do I combine read 1 and read 2 and map them as a single read? I was told to use <cat> but is it the right way? How do I specify the gap between read 1 and read 2 then?

    And for mapping, someone suggested me to use Tophat (any others?) and I did a prerun with the reads(after command cat) and map it to a reference (built with bowtie previously with reference.fasta) and I've got a bam file.

    Here's the important part. How do I merge all the same repeated reads into one? After this, is there a way to calculate how many unique reads are there per gene? By unique I mean reads which start/mapped at different site, in one gene. Kinda similar to determining expression profile but maybe a little different.


    ABCDEFGHIJKLMNOPQRSTUVWXYZ (reference gene)
    CDEFGHIJ (1)
    1234ABCDEFG (2)
    KLMNOPQR (3)
    UVWXYZ (4)
    XYZ1234567 (5)
    MNOPQRST (6)

    For example, unique reads = 5.
    Read 2: 1234ABCDEFG is not counted because it doesn't start from that gene.
    Read 5: XYZ1234567 is counted because it starts from the gene region, although the sequence continues to span the adjacent region.

    Is there a way to do so and how?

    Sorry if this is getting too long. I just need some ideas on what kind of tools should I use, because by far I only know the names of a few like fastx toolkit and bowtie and tophat, I know there are many more but I am not familiar with their functions and etc. It will be great if anyone can give me a brief guideline, very much appreciated!

    Have a nice day!

  • #2
    For starters, don't worry about quality filtering your reads.

    Software like bwa and bowtie are designed to take two separate fastqs for paired end data. If you cat them together, that's like submitting a single end file. Which is fine, but you lose the benefit of paired end reads. As long as the order of the reads in the two separate files is undisturbed, bwa and bowtie will understand that they are paired.

    You can get rid of exact duplicate pairs with samtools rmdup, or Picard's MarkDuplicates.

    samtools idxstats will tell you how many reads there are per sequence in your reference.

    Comment


    • #3
      Thanks for the reply!

      But would you mind further explain what do you mean by "lose the benefit of paired end reads"? Does it mean it will be better if I don't cat the two files into one and just let bwa or bowtie to take them separately since bwa/bowtie can recognize them as pair?

      Comment


      • #4
        Yes, this is what he means.

        Comment


        • #5
          Hi Krispy,

          The Rsubread package (http://www.bioconductor.org/packages.../Rsubread.html) seems to be able to answer most of your questions, if you know how to program in R. Have a look at its vignette that describes that this package can do.

          Cheers,
          Wei

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X