Nuclear magnetic resonance(NMR) spectroscopy is widely used to solve protein structure. There are three steps to determin protein structure by NMR. Data generated by NMR is called resonance peaks. A spin system is a group of resonance peaks that corresponding to the same amino acide, which is the basic character of the protein sequence. One of the interesting computing problem is NMR spectral peak assignment eg how to relate spin systems with amino acids. |
Nuclear magnetic resonance(NMR) spectroscopy is widely used to solve protein structure[4]. There are three steps to determin protein structure by NMR. Data generated by NMR is called resonance peaks. A spin system is a group of resonance peaks corresponding to the same amino acide, which is the basic character of a protein sequence. One of the interesting computing problem is NMR spectral peak assignment i.e. how to relate spin systems with amino acids. |
This NMR spectral peak assignment can be viewed as a Constrained Bipartite Matching problem (CBM). We view each spin system as a vertex in set V and each amino acid as a vertex in set U. Edges between U and V represent potential assignments. The edges may be weighted which specify the preference of a particular assignment. The constraint of the CBM problem is introduced by the biological fact: Spin systems from one NMR experiment are known to belong to the consecutive amino acides. |
This NMR spectral peak assignment can be viewed as a Constrained Bipartite Matching problem (CBM). We view each spin system as a vertex in set V and each amino acid as a vertex in set U. Edges between U and V represent potential assignments. The edges may be weighted which specify the preference of a particular assignment. The constraint of the CBM problem is introduced by the biological fact: Spin systems from one NMR experiment are known to belong to the consecutive amino acides. The goal is asking for the MAXIMUM MATCHING among all the feasible matchings. In a feasible matching, vertices in V corresponding to spin systmes from same NMR experiment should match the vertices in U corresponding to consecutive amino acides. |
2-string unweighted CBM |
Let's consider the simplified version of the NMR spectral peak assignment problem: Given a graph G=(U, V, E). The verticies in V set are given as 2 types. Any edge associated with type-1 vertices can be chosen into a match. Type-2 vertices are adjacently located in V, eg vj and vj+1 , and the the edge (ui, vj) can be chosen into a match iff (ui+1, vj+1) is chosen into the match as well. We denote the type-2 vertices as string as they require the edges being selected as a whole. The maximum length of a string is denoted by B. And the problem can be viewed as B-string constrained bipartite matching. The traditional maximum matching problem is the 1-string unweighted CBM problem, which has polynomial solution. But when D is bigger than 2, the problem becomes hard to tackle. From now on, we will only consider the unweighted 2-string CBM problem in which the maximum length of a type-2 string is 2. Fig1 is an instance of 2-string unweighted CBM problem. |
2-string unweighted CBM is MAX SNP-hard |
*MAX SNP-HARD Class and L-reduction In 1991, Papadimitrou and Yannakakis[5] introduced MAX-SNP to evaluate the hardness of approximating an optimization problem. If an optimization problem has constant factor approximatation algorithm but has no PTAS (Polynomial Time Approximation Schema) unless P=NP, we claim that this problem belongs to MAX SNP. The L-reduction (Linear reduction) is also introduced to treat the approximation. *The L-reduction is defined as follows: Given two optimization problem A, B. A L-reducible to B if for any instance a of A, there is a polynomial algorithm f to generate an instance b of B such that the OPT(b) ≤ OPT(a). And given any feasible solution cb for b with cost(cb), there is a polynomial algorithm g to generate feasible solution ca for a with cost(ca) s.t. the relative error of ca is no more than the relative error of cb eg. |cost(ca) - OPT(a)| ≤ |cost(cb) - OPT(b)| |
*Theorem 2-string unweighted CBM is MAX SNP-hard To prove the MAX SNP hardness of a problem, one need to construct an L-reduction from a known MAX SNP-complete problem. Here we construct the L-reduction from the MAXIMUM BOUNDED 3-DIMENSIONAL MATCHING (MB3DM), which is MAX SNP-complete. THe MB3DM is defined as follows: Given a universal set H = {1,2,3... m} and subsets of H as S1, S2, ..., Sn, with each subset Si contain exactly 3 elements in U. Each element in U apears in at most 3 subsets. The goal is asking for the maximum number of pairwise disjoint subsets. *L-reduction from an instance of MB3DM to an instance of CBM: For a given instance of MB3DM, assume m = 3q and n ≥ q . We'd like to construct the instance of 2-string CBM. i.e, generate bipartite graph G = (U, V, E). The approximation solution of these two algorithms should follow some constraint, such that if CBM problem has a PTAS, so does the MB3DM instance. Here is the construction detail: In order to construct an instance of CBM, one need to define U, V and E of the bipartite graph G. #Set U is consisting of 7n vertices, with each subset Si in the MB3DM corresponding to 7 vertices as ai1, ai2, ai3, ai4, ai5, ai6, ai7 in set U of the instance of CBM. #Set V is consisting of 3 different kinds of vertices. ## Construct q vertices f1, f2, ... fq in set V, we call them f-type vertices in V. ## For each element i in the universal set H of MB3DM instance, construct 2 vertices called bi1, bi2 in the V set of CBM instance. We call them b-type vertices in V. ## For each subset Si, construct 6 vertices as ci1, ci2, ci3, ci4, ci5, ci6 in the V set of CBM instance. We call them c-type vertices in V. #Set E is defined as associtated with different type of V vertices. ## for each f-type vertices fi in V, connect it with aj1 for j = 1 .. n, i.e., connect it with all the first vertices in the 7 vertices group which corresponding to the subset in MB3DM. ## for each b-type vertices pair bi1bi2 corresponding to each element in universal set H of MB3DM , connect them with aj2aj3, aj4aj5, aj6aj7 in vertices set U respectively if subset Sj contains element i in the MB3DM instance. ## for each c-type vertices group ci1, ci2, ci3, ci4, ci5, ci6, connect them with aj1, aj2, aj3, aj4, aj5, aj6 respectively for j = 1 .. n, i.e., each 6 c-type vertices have edges with all the first 6 vertices in the 7 vertices group in set U, with each group corresponding to the subset Si in MB3DM. The above construction gives an instance of CBM with |U| = 7n, |V| = q + 2 × m + 6n = 7q + 6n. Based on the definition of E, we can tell that in the V set, b and c are type-2 vertices, which have constraint on them, whereas f vertices are type-1 vertices. Here is an example of the L-reduction. The instance of MB3DM is given as U = {1,2,3,4}, S1 = {1,2,3} and S2 = {2,3,4}. The corresponding instance of CBM is shown in the Fig2: *Why the reduction works? To prove the correcness of the L-reduction, let's first analyze the relationship between a feasible solution to MB3DM with that of the CBM problem. Assume a feasible solution to MB3DM has size p (p ≤ q), which means there are p subsets which are pair wise disjoint. The feasible solution to the corresponding CBM problem has the size of 7p + 6(n-p) = 6n + p. Here is the explanation: To get the matching number maximized, one prefer to match as many vertices in each 7-vertices of U as possible. The idea is for every pair disjoint subset, match all their corresponding 7-vertices with a f-type vertices in V and 3 more b-type vertices pairs corresponding to its containg elements. For the rest of the subset, simply choose its corresponding c-type vertices to get 6 out of its corresponding 7-vertices in U being matched. The rest of the subset can not achieve 7 matches simply because the way to get 7 match is have 1 f-type with 6 b-type vertices. And the subset may share the element with other subset, thus can not be used to get 7. Also f-type and c-type conflict with each other as they all use the first vertex in every 7-vertices group of U. For example in fig2: If we pick S1 as the feasible solution to MB3DM. In set U, a11, a12, a13, a14, a15, a16,a17 can be matched with f1, b11, b12, b21, b22, b31, b32 respectively. As for S2, since b21, b22, b31, b32 have already been used by S1, its corresponding a21, a22, a23, a24, a25, a26,a27 can only have a22, a23, a24, a25, a26,a27 match with c21, c22, c23, c24, c25, c26. Now we need to prove that if CBM has PTAS, so does MB3DM. Proof: Assume MB3DM has a feasible solution with cost p, so the CBM has cost 6n + p, denote the OPT of MB3DM as p*, the OPT for CBM is 6n + p* Assume CBM has PTAS, i.e. there is a feasible solution with cost 6n+p having the following inequality holds. 6n+p ≤ (1+ε) (6n + p*) p ≤ (1+ε)p* + 6nε Recall that in MB3DM instance, each element can appear in no more than 3 subset. Thus for a given subset, the maximum possible number of subsets that may conflict with it is 6 with each element appear in 2 more different subsets. So we have : p* ≥ n/7 which means if we pick one subset out of seven, we can sure have no overlap ones. Combine with the previous inequality, we get: p ≤ (1+43ε)p* if we replace 43ε with ε' , we get the PTAS for MB3DM problem. |
*The 5/3 approximation algorithm Denote C as the maximum of A and B, the approximation ratio of C is cost(C) ≥ 3/5 × OPT Proof: Let a be real number between [0,1], which stands for the partial of algorithm A. we try to estimate the approximation ration b of the Max(A,B) Max(A, B) ≥ a × cost(A) + (1-a) × cost(B) ≥ a × (m1* + m2*) + (1-a) × (m1*/3 + 4m2*/3) = (1/3 + 2/3 × a) × m1* + (4/3 - a/3) × m2*) Recall that we are trying to get the approximation ratio b, which means trying to find b s.t. Max(A,B) ≥ b × OPT (1/3 + 2a/3) × m1* + (4/3 - a/3) × m2*) ≥ b OPT gives b value of 3/5. Thus the maximum of A and B can achieve 5/3 approximation. |
* Chen ZZ, Jiang T, Lin GH, et al. More reliable protein NMR peak assignment via improved 2-interval scheduling LECTURE NOTES IN COMPUTER SCIENCE 2832: 580-592 2003 * Chen ZZ, Jiang T, Lin GH, et al. Approximation algorithms for NMR spectral peak assignment THEOR COMPUT SCI 299 (1-3): 211-229 APR 18 2003 * Chen ZZ, Jiang T, Lin GH, et al. Improved approximation algorithms for NMR spectral peak assignment LECT NOTES COMPUT SCI 2452: 82-96 2002 * Xu Y, Xu D, Kim D, et al. Automated assignment of backbone NMR peaks using constrained bipartite matching COMPUT SCI ENG 4 (1): 50-62 JAN-FEB 2002 |
[1] Chen ZZ, Jiang T, Lin GH, et al. More reliable protein NMR peak assignment via improved 2-interval scheduling LECTURE NOTES IN COMPUTER SCIENCE 2832: 580-592 2003 [2] Chen ZZ, Jiang T, Lin GH, et al. Approximation algorithms for NMR spectral peak assignment THEOR COMPUT SCI 299 (1-3): 211-229 APR 18 2003 [3] Chen ZZ, Jiang T, Lin GH, et al. Improved approximation algorithms for NMR spectral peak assignment LECT NOTES COMPUT SCI 2452: 82-96 2002 [4] Xu Y, Xu D, Kim D, et al. Automated assignment of backbone NMR peaks using constrained bipartite matching COMPUT SCI ENG 4 (1): 50-62 JAN-FEB 2002 [5] Papadimitriou CH, Yannakakis M. Optimization, Approximation, and Complexity Classes JOURNAL OF COMPUTER AND SYSTEM SCIENCES 43: 425-440 1991 |
Nuclear magnetic resonance(NMR) spectroscopy is widely used to solve protein structure[4]. There are three steps to determin protein structure by NMR. Data generated by NMR is called resonance peaks. A spin system is a group of resonance peaks corresponding to the same amino acide, which is the basic character of a protein sequence. One of the interesting computing problem is NMR spectral peak assignment i.e. how to relate spin systems with amino acids.
This NMR spectral peak assignment can be viewed as a Constrained Bipartite Matching problem (CBM). We view each spin system as a vertex in set V and each amino acid as a vertex in set U. Edges between U and V represent potential assignments. The edges may be weighted which specify the preference of a particular assignment. The constraint of the CBM problem is introduced by the biological fact: Spin systems from one NMR experiment are known to belong to the consecutive amino acides. The goal is asking for the MAXIMUM MATCHING among all the feasible matchings. In a feasible matching, vertices in V corresponding to spin systmes from same NMR experiment should match the vertices in U corresponding to consecutive amino acides.
Let's consider the simplified version of the NMR spectral peak assignment problem: Given a graph G=(U, V, E). The verticies in V set are given as 2 types. Any edge associated with type-1 vertices can be chosen into a match. Type-2 vertices are adjacently located in V, eg vj and vj+1 , and the the edge (ui, vj) can be chosen into a match iff (ui+1, vj+1) is chosen into the match as well. We denote the type-2 vertices as string as they require the edges being selected as a whole. The maximum length of a string is denoted by B. And the problem can be viewed as B-string constrained bipartite matching. The traditional maximum matching problem is the 1-string unweighted CBM problem, which has polynomial solution. But when D is bigger than 2, the problem becomes hard to tackle. From now on, we will only consider the unweighted 2-string CBM problem in which the maximum length of a type-2 string is 2. Fig1 is an instance of 2-string unweighted CBM problem.
Another way to help understanding the unweighted 2-string CBM problem is to treat it as an interval scheduling problem[1], in which each vertex in U is viewed as a time slot and a vertex in V is viewed as a job. More particularly, type-1 vertices in V are jobs that need 1 time slot whereas a type-2 vertex-pair(2-string pair) is viewed as a job that need 2 consecutive time slots. The goal is asking for the maximum job numbers that can be executed without conflits and the job number is counted as type-1 job being 1 and type-2 job being 2.
Assume a feasible solution to MB3DM has size p (p ≤ q), which means there are p subsets which are pair wise disjoint. The feasible solution to the corresponding CBM problem has the size of 7p + 6(n-p) = 6n + p. Here is the explanation:
To get the matching number maximized, one prefer to match as many vertices in each 7-vertices of U as possible. The idea is for every pair disjoint subset, match all their corresponding 7-vertices with a f-type vertices in V and 3 more b-type vertices pairs corresponding to its containg elements. For the rest of the subset, simply choose its corresponding c-type vertices to get 6 out of its corresponding 7-vertices in U being matched. The rest of the subset can not achieve 7 matches simply because the way to get 7 match is have 1 f-type with 6 b-type vertices. And the subset may share the element with other subset, thus can not be used to get 7. Also f-type and c-type conflict with each other as they all use the first vertex in every 7-vertices group of U.
For example in fig2: If we pick S1 as the feasible solution to MB3DM. In set U, a11, a12, a13, a14, a15, a16,a17 can be matched with f1, b11, b12, b21, b22, b31, b32 respectively. As for S2, since b21, b22, b31, b32 have already been used by S1, its corresponding a21, a22, a23, a24, a25, a26,a27 can only have a22, a23, a24, a25, a26,a27 match with c21, c22, c23, c24, c25, c26.
Now we need to prove that if CBM has PTAS, so does MB3DM.
Proof:
Assume MB3DM has a feasible solution with cost p, so the CBM has cost 6n + p, denote the OPT of MB3DM as p*, the OPT for CBM is 6n + p*
Assume CBM has PTAS, i.e. there is a feasible solution with cost 6n+p having the following inequality holds.
6n+p ≤ (1+ε) (6n + p*)
p ≤ (1+ε)p* + 6nε
Recall that in MB3DM instance, each element can appear in no more than 3 subset. Thus for a given subset, the maximum possible number of subsets that may conflict with it is 6 with each element appear in 2 more different subsets. So we have :
p* ≥ n/7 which means if we pick one subset out of seven, we can sure have no overlap ones.
Combine with the previous inequality, we get:
p ≤ (1+43ε)p* if we replace 43ε with ε' , we get the PTAS for MB3DM problem.
Denote the size of constrained bipartite matching of an algrithm A as cost(A), the size of the maximum constrained bipartite matching as M* and m1* as the number of type-1 edges in M* and m2* as the number of type-2 vertex-pairs (2-string). Obviously |M*| = m1* + 2m2*
Analysis: Based on the construction, we have |M| ≥ |M'| ≥ m1* + m2*.
Analysis
We need to show that |M| ≥ m1*/3 + 4m2*/3. Denote M* as the OPT of CBM. Define Mi* as follows:
Starts from M*, in the Gi
Since each edge in M* that incident to a type-1 vertex belongs to exactly 1 of M0*, M1*, M2* and each edge in M* that incident to a type-2 vertex pair belongs to exactly 2 of M0*, M1*, M2*. Max(M0*, M1*, M2*) ≥ m1*/3 + 4m2* /3.
A matching Mi' in Gi can be obtained by Modifying Mi* reversly with the same weight. Again, since Mi is a maximum-weighted matching in Gi, the following inequality holds:
|M| ≥ MAX(Mi') = m1*/3 + 4m2*/3
[1] Chen ZZ, Jiang T, Lin GH, et al. More reliable protein NMR peak assignment via improved 2-interval scheduling LECTURE NOTES IN COMPUTER SCIENCE 2832: 580-592 2003
[2] Chen ZZ, Jiang T, Lin GH, et al. Approximation algorithms for NMR spectral peak assignment THEOR COMPUT SCI 299 (1-3): 211-229 APR 18 2003
[3] Chen ZZ, Jiang T, Lin GH, et al. Improved approximation algorithms for NMR spectral peak assignment LECT NOTES COMPUT SCI 2452: 82-96 2002
[4] Xu Y, Xu D, Kim D, et al. Automated assignment of backbone NMR peaks using constrained bipartite matching COMPUT SCI ENG 4 (1): 50-62 JAN-FEB 2002
[5] Papadimitriou CH, Yannakakis M. Optimization, Approximation, and Complexity Classes JOURNAL OF COMPUTER AND SYSTEM SCIENCES 43: 425-440 1991