SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
NGS Data Analysis Workshop & Conference: NGS 2017 Glasgow (15-16 May) Biotexcel Events / Conferences 0 02-09-2017 10:11 AM
Upcoming NGS Workshop: A Beginner's Guide to NGS Data Analysis (early march 2015) ecSeq Bioinformatics Events / Conferences 9 07-01-2015 12:39 AM
Webinar on Methyl Seq data analysis in Strand NGS- Formerly Avadis NGS Strandlife Events / Conferences 1 10-21-2014 02:28 AM
Looking for a few NGS-ers willing to share a bad experience about NGS data analysis CHoyt Bioinformatics 8 12-09-2011 11:06 PM

Reply
 
Thread Tools
Old 06-08-2018, 01:48 AM   #1
eadonyo
Junior Member
 
Location: South Africa

Join Date: Jun 2018
Posts: 4
Post NGS seq. Analysis

Hello,
I have a question that i need help. i have this seq in fasta file.
Code:
>HG2FEE201A723Q	SAMPLE=USERID-19_JOBID-10_HG2FEE201_166281C_MID21	GENE=PR	STRAND=-	NOTRIM_LEN=512	Mean:33	Len:497	Trimmedat5':0	Trimmedat3':5 	AlignmentScore: 21630	AmpliconCoverage: 402	FullCoverage: Y
---CTTGTCTCAAT-AAGGTAGGGGGCCA---GATAAGGGAGGCTCTCTTAGACACAGGAGCAGATGATACAGTATTAGAAGAAATAAGTTTGCCAGGAAAATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAAGACAGTATGATCAAGTACCTATAGAAATTTGTGGAAAAAAGGCTATAGGCACAGTATTAATAGGACCTACACCTATCAACATAATTGGAAGGAATATGTTGACTCAACTTGGATGCACACTAAATTTTCCAATTAGTCCCATTGAAACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAGGTCAAACAATGGCCATTGACAGAAGAGAAAATAAAAGCATTAACAGC---A---ATTTGTGAAGA---AATGGAGAAGGAA
>HG2FEE201B2MWP	SAMPLE=USERID-19_JOBID-10_HG2FEE201_166281C_MID21	GENE=PR	STRAND=+	NOTRIM_LEN=544	Mean:31	Len:450	Trimmedat5':0	Trimmedat3':61 	AlignmentScore: 19950	AmpliconCoverage: 402	FullCoverage: Y
Qustions

1) How many genes are represented in this data and how many sequences are there for each sequenced gene.

2) What is the average read length before and after trimming (denoted by NOTRIM_LEN and Len respectively)

3) Are any of the DNA sequences in the file identical to each other, and if so what is the highest number of identical sequences? (Hint: sort isn’t just for numbers!)

Last edited by GenoMax; 06-08-2018 at 04:18 AM.
eadonyo is offline   Reply With Quote
Old 06-08-2018, 04:17 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Do you know how this file was generated?

It looks like this may actually be an aligned fasta format file so it would be not straight forward to identify how many duplicate sequences there were.
GenoMax is offline   Reply With Quote
Old 06-08-2018, 04:30 AM   #3
eadonyo
Junior Member
 
Location: South Africa

Join Date: Jun 2018
Posts: 4
Default

I have the whole fasta file. just copied the first three lines using the BASH command: head -3.
we were asked t use BASH commands to find the genes and sequences
eadonyo is offline   Reply With Quote
Old 06-08-2018, 05:09 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Is this an assignment?
GenoMax is offline   Reply With Quote
Old 06-08-2018, 05:15 AM   #5
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 69
Default

Paraphrasing Stack Overflow, "What have you tried so far?"
r.rosati is offline   Reply With Quote
Old 06-08-2018, 05:15 AM   #6
eadonyo
Junior Member
 
Location: South Africa

Join Date: Jun 2018
Posts: 4
Default

Yes it a class assignment. That is why i put only the first three lines for a file size of about 1.6MB. what i am looking at is how to i identify and count genes within sequences of nucleotides
eadonyo is offline   Reply With Quote
Old 06-08-2018, 05:21 AM   #7
eadonyo
Junior Member
 
Location: South Africa

Join Date: Jun 2018
Posts: 4
Default

This is what i have done so far:


Put the whole sequences in one line
awk '{printf /^>/ ? $0 :$0}' BigData1.fasta

and break the lines using the ">" separator

awk '{printf /^>/ ? "\n" $0 :$0}' BigData1.fasta.

Now i can count the word occurrences using wc -w and wc -l Bash
eadonyo is offline   Reply With Quote
Old 06-13-2018, 04:40 AM   #8
finswimmer
Member
 
Location: Europe

Join Date: Oct 2016
Posts: 54
Default

Hello,

before you can start to answers your question you have to get familiar with the fileformat. Let's analyse the format you show us.

In a fasta file each sequence information consist of a headline introduced with a ">" at the beginning and one more lines with the sequence itself. In your case it seems that sequence is only in one line.

The headline for each sequence have several information which are arranged in columns delimited by tabs. It seems that the same informations are all in the same column number.

So whenever we like to extract information from the header we have to look for lines that started with ">". If we are interested in the sequence we need line without ">"

Let's have a look at your first question:

1) How many genes are represented in this data and how many sequences are there for each sequenced gene.

The information about the gene name is
  1. in the header line
  2. in the 3. column
  3. prefixed with "GENE="
  4. a gene name can exist multiple time

One way to get the list of distinct name is this:

Code:
grep "^>" your.fasta|cut -f3|sed 's/GENE=//'|sort -u > genes.txt
grep finds all line starting with ">", cut selects the third column, sed removes the "GENE=" leaving behind the pure gene name, sort -u sortes the names and remove duplicates.

With this list of gene names we can answers the second part of the question. We need to iterate over the list and count the lines which contain the gennames.

Code:
for gene in $(cat genes.txt); do echo $gene; grep -wc "GENE=$gene" your.fasta; done|paste - -
paste is used to show the gene name and the counts in a row.

2) What is the average read length before and after trimming (denoted by NOTRIM_LEN and Len respectively)

How you extract the values for each read I showed you before so I will not post a full solution here. The result of extacted each read length can be piped to awk which can calculated the average read length.

Code:
[extracted_read_length]|awk '{ total += $1; } END { print total/NR }'
3) Are any of the DNA sequences in the file identical to each other, and if so what is the highest number of identical sequences? (Hint: sort isnít just for numbers!)

As this is a assignment I gave you just some hints. Check the manpages for grep, sort and uniq for helpful options
fin swimmer

Last edited by finswimmer; 06-13-2018 at 05:35 AM.
finswimmer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO