Consensus and Profile

Problem
A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write A_i,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P_1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P_2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

		A T C C A G C T
		G G G C A A C T
		A T G G A T C T
DNA Strings	A A G C A A C C
		T T G G A A C T
		A T G C C A T T
		A T G G C A C T

	    A   5 1 0 0 5 5 0 0
Profile	    C   0 0 1 4 2 0 6 1
	    G   1 1 6 3 0 1 0 0
	    T   1 5 0 0 0 1 1 6

Consensus	A T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

FASTA format was introduced previously in Computing GC content, in which there is no need to store the sequence. In this problem, the situation is a little more complicated since the storage issue should be considered. The storage issue is easy to solve if using high level languages, for instance R, which support memory dynamic allocation.

Here, I will using C to implement readFASTA function. The FASTA file of this problem is easy to parse for all its description lines are in equal length and all the sequence lines are also in equal length. But it is not the true in reality. The description/sequence lines are very common to have unequal lengths, and the sequence lines can contain return characters. All these situations should be considered when implementing a general function.

If we predefined character array to store the description and sequence line, the drawback is obvious. If the array size is not large enough, the function will lost generality, but if it is large, the function will waste a lot of memory.

To solve the storage issue, I use Link List to store description and sequence lines, re-store them in character array and then pack them in SEQ structure. The sequence number in FASTA file is also unknown, so the SEQ structures are also organized as Link List. It's easy to implement a wrapper function to return the SEQ structure in pointer array if necessary.

Here is the readFASTA.h :

^?View Code C

typedef struct SEQ {
  char * desc;
  char * seq;
  int width;
  struct SEQ * next;
} SEQ;
 
typedef struct CHLL {
  // CHaracter Link-List
  char ch;
  struct CHLL * next;
} CHLL;
 
 
SEQ * readFASTA(char *filename);
 
SEQ * readFASTA(char *filename) {
  FILE *INFILE;
  INFILE = fopen(filename, "r");
 
  SEQ *head = malloc(sizeof(SEQ));
  SEQ *curr;
  curr = head;
 
  char ch;
  int desc_line = 0, i;
  while( (ch = fgetc(INFILE)) != EOF) {
    SEQ *SEQ_NODE = malloc(sizeof(SEQ));
 
    CHLL *CHLL_head = malloc(sizeof(CHLL));
    CHLL *CHLL_curr;
    CHLL_curr = CHLL_head;
 
    // description line
    if ( ch == '>') {
      desc_line = 1;
    }
    if (desc_line == 1) {
      i = 0;
      while( (ch = fgetc(INFILE) ) != '\n') {
	CHLL *CHLL_NODE = malloc(sizeof(CHLL));
	i++;
	CHLL_NODE->ch = ch;
	CHLL_curr->next = CHLL_NODE;
	CHLL_curr = CHLL_NODE;
      }
    }
    int desc_width = i;
    CHLL_curr = CHLL_head->next;
    char* desc = malloc(sizeof(char) * desc_width);
    for (i=0; i < desc_width; i++) {
      desc[i] = CHLL_curr->ch;
      CHLL_curr = CHLL_curr->next;
    }
 
    SEQ_NODE->desc = desc;
 
    // sequence lines
    // re-initial the CHLL link list to store sequence characters
    CHLL_curr = CHLL_head;
    desc_line = 0;
    i = 0;
    while( (ch=fgetc(INFILE)) != '>' && ch != EOF) {
       if (ch == '\n') {
	continue;
      }
      CHLL *CHLL_NODE = malloc(sizeof(CHLL));
      CHLL_NODE->ch = ch;
      CHLL_curr->next = CHLL_NODE;
      CHLL_curr = CHLL_NODE;
      i++;
    }
 
    int seq_width = i;
    char * seq = malloc(sizeof(char)*seq_width);
    CHLL_curr = CHLL_head->next;
    for (i=0; i < seq_width; i++) {
      seq[i] = CHLL_curr->ch;
      CHLL_curr = CHLL_curr->next;
    }
 
    SEQ_NODE->seq = seq;
    SEQ_NODE->width = seq_width;
 
    curr->next = SEQ_NODE;
    curr = SEQ_NODE;
 
    desc_line = 1;
  }
 
  return head;
}

This problem is very easy after the readFASTA function was implemented. Read the file, count the nucleotide and print out the result.

Although the sequences are in equal length, the main function assumes it is unequal.

^?View Code C

#include<stdio.h>
#include<stdlib.h>
#include "include/readFASTA.h"
 
int main() {
  SEQ * head, *curr;
  head = readFASTA("../DATA/rosalind_cons.txt");
  curr = head;
  int width=0;
  while(curr = curr->next) {
    if (width < curr->width) {
      width = curr->width;
    }
  }
 
  int NT_type = 4; // ACGT
  int cons[NT_type][width];
  int r, c;
  for (r=0; r < NT_type; r++) {
    for (c=0; c < width; c++) {
      cons[r][c] = 0;
    }
  }
 
  curr=head;
  int i;
  while(curr = curr->next) {
    for (i=0; i < curr->width; i++) {
      switch((curr->seq)[i]) {
      case 'A':
	cons[0][i]++;
	break;
      case 'C':
	cons[1][i]++;
	break;
      case 'G':
	cons[2][i]++;
	break;
      case 'T':
	cons[3][i]++;
	break;
      }
    }
  }
 
  char NT[4] = "ACGT";
  int max_r_idx, max_r=0;
  for (c=0; c < width; c++) {
    for (r=0; r < NT_type; r++) {
      if (max_r < cons[r][c]) {
	max_r = cons[r][c];
	max_r_idx = r;
      }
    }
    printf("%c", NT[max_r_idx]);
    max_r = 0;
  }
  printf("\n");
  for (r=0; r < NT_type; r++) {
    printf("%c: ", NT[r]);
    for (c=0; c < width; c++) {
      printf("%d ", cons[r][c]);
    }
    printf("\n");
  }
}

May 14, 2013 -- Finding a Motif in DNA (0)
February 4, 2013 -- Complementing a Strand of DNA (0)
February 22, 2013 -- Computing GC Content (0)
April 2, 2013 -- Rabbits and Recurrence Relations (0)
February 4, 2013 -- Transcribing DNA into RNA (0)

Consensus and Profile

Related Posts

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Muloraki Au

Thread: Ticket to Ride Legacy: Legends of the West:: General:: [SPOILERS]...

Kalank - Malayalam (1CD ) - subtitles

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

libdevinfo を使ってネットワークインターフェイスデバイスの一覧を取得する

Ilahi mera jee aaye/ Shaame Malang si Lyrics Translation

spreading clines

Procedure for conduct of supplementary DPC

Brunei reaffirms healthcare commitment

Practice Sheet of Right form of verbs for HSC Students

99 God Status for Whatsapp, Facebook

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

Himachal Pradesh TET Answer Key Download 2019

DD Kashir channel packaging bids invited by 29 june

Mp3 Download: Mdu - Nammer

Srinagar Kitty’s brother dies at 67 due to Covid-19

HResult: 0x80240033 Context: uecGeneral Msg: The license terms of one or more...

Re: How to fix error on printer HP Color LaserJet Pro MFP 3303 with event...