Problem A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j. Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below). A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings. A T C C A G C T G G G C A A C T A T G G A T C T DNA Strings A A G C A A C C T T G G A A C T A T G C C A T T A T G G C A C T A 5 1 0 0 5 5 0 0 Profile C 0 0 1 4 2 0 6 1 G 1 1 6 3 0 1 0 0 T 1 5 0 0 0 1 1 6 Consensus A T G C A A C T Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format. Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.) Sample Dataset >Rosalind_1 ATCCAGCT >Rosalind_2 GGGCAACT >Rosalind_3 ATGGATCT >Rosalind_4 AAGCAACC >Rosalind_5 TTGGAACT >Rosalind_6 ATGCCATT >Rosalind_7 ATGGCACT Sample Output ATGCAACT A: 5 1 0 0 5 5 0 0 C: 0 0 1 4 2 0 6 1 G: 1 1 6 3 0 1 0 0 T: 1 5 0 0 0 1 1 6
FASTA format was introduced previously in Computing GC content, in which there is no need to store the sequence. In this problem, the situation is a little more complicated since the storage issue should be considered. The storage issue is easy to solve if using high level languages, for instance R, which support memory dynamic allocation.
Here, I will using C to implement readFASTA function. The FASTA file of this problem is easy to parse for all its description lines are in equal length and all the sequence lines are also in equal length. But it is not the true in reality. The description/sequence lines are very common to have unequal lengths, and the sequence lines can contain return characters. All these situations should be considered when implementing a general function.
If we predefined character array to store the description and sequence line, the drawback is obvious. If the array size is not large enough, the function will lost generality, but if it is large, the function will waste a lot of memory.
To solve the storage issue, I use Link List to store description and sequence lines, re-store them in character array and then pack them in SEQ structure. The sequence number in FASTA file is also unknown, so the SEQ structures are also organized as Link List. It's easy to implement a wrapper function to return the SEQ structure in pointer array if necessary.
Here is the readFASTA.h :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | typedef struct SEQ { char * desc; char * seq; int width; struct SEQ * next; } SEQ; typedef struct CHLL { // CHaracter Link-List char ch; struct CHLL * next; } CHLL; SEQ * readFASTA(char *filename); SEQ * readFASTA(char *filename) { FILE *INFILE; INFILE = fopen(filename, "r"); SEQ *head = malloc(sizeof(SEQ)); SEQ *curr; curr = head; char ch; int desc_line = 0, i; while( (ch = fgetc(INFILE)) != EOF) { SEQ *SEQ_NODE = malloc(sizeof(SEQ)); CHLL *CHLL_head = malloc(sizeof(CHLL)); CHLL *CHLL_curr; CHLL_curr = CHLL_head; // description line if ( ch == '>') { desc_line = 1; } if (desc_line == 1) { i = 0; while( (ch = fgetc(INFILE) ) != '\n') { CHLL *CHLL_NODE = malloc(sizeof(CHLL)); i++; CHLL_NODE->ch = ch; CHLL_curr->next = CHLL_NODE; CHLL_curr = CHLL_NODE; } } int desc_width = i; CHLL_curr = CHLL_head->next; char* desc = malloc(sizeof(char) * desc_width); for (i=0; i < desc_width; i++) { desc[i] = CHLL_curr->ch; CHLL_curr = CHLL_curr->next; } SEQ_NODE->desc = desc; // sequence lines // re-initial the CHLL link list to store sequence characters CHLL_curr = CHLL_head; desc_line = 0; i = 0; while( (ch=fgetc(INFILE)) != '>' && ch != EOF) { if (ch == '\n') { continue; } CHLL *CHLL_NODE = malloc(sizeof(CHLL)); CHLL_NODE->ch = ch; CHLL_curr->next = CHLL_NODE; CHLL_curr = CHLL_NODE; i++; } int seq_width = i; char * seq = malloc(sizeof(char)*seq_width); CHLL_curr = CHLL_head->next; for (i=0; i < seq_width; i++) { seq[i] = CHLL_curr->ch; CHLL_curr = CHLL_curr->next; } SEQ_NODE->seq = seq; SEQ_NODE->width = seq_width; curr->next = SEQ_NODE; curr = SEQ_NODE; desc_line = 1; } return head; } |
This problem is very easy after the readFASTA function was implemented. Read the file, count the nucleotide and print out the result.
Although the sequences are in equal length, the main function assumes it is unequal.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | #include<stdio.h> #include<stdlib.h> #include "include/readFASTA.h" int main() { SEQ * head, *curr; head = readFASTA("../DATA/rosalind_cons.txt"); curr = head; int width=0; while(curr = curr->next) { if (width < curr->width) { width = curr->width; } } int NT_type = 4; // ACGT int cons[NT_type][width]; int r, c; for (r=0; r < NT_type; r++) { for (c=0; c < width; c++) { cons[r][c] = 0; } } curr=head; int i; while(curr = curr->next) { for (i=0; i < curr->width; i++) { switch((curr->seq)[i]) { case 'A': cons[0][i]++; break; case 'C': cons[1][i]++; break; case 'G': cons[2][i]++; break; case 'T': cons[3][i]++; break; } } } char NT[4] = "ACGT"; int max_r_idx, max_r=0; for (c=0; c < width; c++) { for (r=0; r < NT_type; r++) { if (max_r < cons[r][c]) { max_r = cons[r][c]; max_r_idx = r; } } printf("%c", NT[max_r_idx]); max_r = 0; } printf("\n"); for (r=0; r < NT_type; r++) { printf("%c: ", NT[r]); for (c=0; c < width; c++) { printf("%d ", cons[r][c]); } printf("\n"); } } |