Top: ClassW04ApproxAlgs/ZhengLiu

Approximation algorithms for NMR spectral peak assignment

Problem Origin:

Nuclear magnetic resonance(NMR) spectroscopy is widely used to solve protein structure[4]. There are three steps to determin protein structure by NMR. Data generated by NMR is called resonance peaks. A spin system is a group of resonance peaks corresponding to the same amino acide, which is the basic character of a protein sequence. One of the interesting computing problem is NMR spectral peak assignment i.e. how to relate spin systems with amino acids.

This NMR spectral peak assignment can be viewed as a Constrained Bipartite Matching problem (CBM). We view each spin system as a vertex in set V and each amino acid as a vertex in set U. Edges between U and V represent potential assignments. The edges may be weighted which specify the preference of a particular assignment. The constraint of the CBM problem is introduced by the biological fact: Spin systems from one NMR experiment are known to belong to the consecutive amino acides. The goal is asking for the MAXIMUM MATCHING among all the feasible matchings. In a feasible matching, vertices in V corresponding to spin systmes from same NMR experiment should match the vertices in U corresponding to consecutive amino acides.

Problem Definition:

Let's consider the simplified version of the NMR spectral peak assignment problem: Given a graph G=(U, V, E). The verticies in V set are given as 2 types. Any edge associated with type-1 vertices can be chosen into a match. Type-2 vertices are adjacently located in V, eg v_j and v_j+1 , and the the edge (u_i, v_j) can be chosen into a match iff (u_i+1, v_j+1) is chosen into the match as well. We denote the type-2 vertices as string as they require the edges being selected as a whole. The maximum length of a string is denoted by B. And the problem can be viewed as B-string constrained bipartite matching. The traditional maximum matching problem is the 1-string unweighted CBM problem, which has polynomial solution. But when D is bigger than 2, the problem becomes hard to tackle. From now on, we will only consider the unweighted 2-string CBM problem in which the maximum length of a type-2 string is 2. Fig1 is an instance of 2-string unweighted CBM problem.

Another way to help understanding the unweighted 2-string CBM problem is to treat it as an interval scheduling problem[1], in which each vertex in U is viewed as a time slot and a vertex in V is viewed as a job. More particularly, type-1 vertices in V are jobs that need 1 time slot whereas a type-2 vertex-pair(2-string pair) is viewed as a job that need 2 consecutive time slots. The goal is asking for the maximum job numbers that can be executed without conflits and the job number is counted as type-1 job being 1 and type-2 job being 2.

Hardness of Constrained Bipartite Matching:

MAX SNP-HARD Class and L-reduction

In 1991, Papadimitrou and Yannakakis[5] introduced MAX-SNP to evaluate the hardness of approximating an optimization problem. If an optimization problem has constant factor approximatation algorithm but has no PTAS (Polynomial Time Approximation Schema) unless P=NP, we claim that this problem belongs to MAX SNP. The L-reduction (Linear reduction) is also introduced to treat the approximation.

The L-reduction is defined as follows:

Given two optimization problem A, B. A L-reducible to B if for any instance a of A, there is a polynomial algorithm f to generate an instance b of B such that the OPT(b) ≤ OPT(a). And given any feasible solution c_b for b with cost(c_b), there is a polynomial algorithm g to generate feasible solution c_a for a with cost(c_a) s.t. the relative error of c_a is no more than the relative error of c_b eg. |cost(c_a) - OPT(a)| ≤ |cost(c_b) - OPT(b)|

Theorem 2-string unweighted CBM is MAX SNP-hard

To prove the MAX SNP hardness of a problem, one need to construct an L-reduction from a known MAX SNP-complete problem. Here we construct the L-reduction from the MAXIMUM BOUNDED 3-DIMENSIONAL MATCHING (MB3DM), which is MAX SNP-complete. THe MB3DM is defined as follows: Given a universal set H = {1,2,3... m} and subsets of H as S1, S2, ..., Sn, with each subset Si contain exactly 3 elements in U. Each element in U apears in at most 3 subsets. The goal is asking for the maximum number of pairwise disjoint subsets.

L-reduction from an instance of MB3DM to an instance of CBM:

For a given instance of MB3DM, assume m = 3q and n ≥ q . We'd like to construct the instance of 2-string CBM. i.e, generate bipartite graph G = (U, V, E). The approximation solution of these two algorithms should follow some constraint, such that if CBM problem has a PTAS, so does the MB3DM instance. Here is the construction detail:
In order to construct an instance of CBM, one need to define U, V and E of the bipartite graph G.

Set U is consisting of 7n vertices, with each subset Si in the MB3DM corresponding to 7 vertices as a_i1, a_i2, a_i3, a_i4, a_i5, a_i6, a_i7 in set U of the instance of CBM.
Set V is consisting of 3 different kinds of vertices.
1. Construct q vertices f₁, f₂, ... f_q in set V, we call them f-type vertices in V.
2. For each element i in the universal set H of MB3DM instance, construct 2 vertices called b_i1, b_i2 in the V set of CBM instance. We call them b-type vertices in V.
3. For each subset Si, construct 6 vertices as c_i1, c_i2, c_i3, c_i4, c_i5, c_i6 in the V set of CBM instance. We call them c-type vertices in V.
Set E is defined as associtated with different type of V vertices.
1. for each f-type vertices f_i in V, connect it with a_j1 for j = 1 .. n, i.e., connect it with all the first vertices in the 7 vertices group which corresponding to the subset in MB3DM.
2. for each b-type vertices pair b_i1b_i2 corresponding to each element in universal set H of MB3DM , connect them with a_j2a_j3, a_j4a_j5, a_j6a_j7 in vertices set U respectively if subset Sj contains element i in the MB3DM instance.
3. for each c-type vertices group c_i1, c_i2, c_i3, c_i4, c_i5, c_i6, connect them with a_j1, a_j2, a_j3, a_j4, a_j5, a_j6 respectively for j = 1 .. n, i.e., each 6 c-type vertices have edges with all the first 6 vertices in the 7 vertices group in set U, with each group corresponding to the subset Si in MB3DM.

The above construction gives an instance of CBM with |U| = 7n, |V| = q + 2 × m + 6n = 7q + 6n. Based on the definition of E, we can tell that in the V set, b and c are type-2 vertices, which have constraint on them, whereas f vertices are type-1 vertices. Here is an example of the L-reduction. The instance of MB3DM is given as U = {1,2,3,4}, S1 = {1,2,3} and S2 = {2,3,4}. The corresponding instance of CBM is shown in the Fig2:

Why the reduction works?

To prove the correcness of the L-reduction, let's first analyze the relationship between a feasible solution to MB3DM with that of the CBM problem.

Assume a feasible solution to MB3DM has size p (p ≤ q), which means there are p subsets which are pair wise disjoint. The feasible solution to the corresponding CBM problem has the size of 7p + 6(n-p) = 6n + p. Here is the explanation:

To get the matching number maximized, one prefer to match as many vertices in each 7-vertices of U as possible. The idea is for every pair disjoint subset, match all their corresponding 7-vertices with a f-type vertices in V and 3 more b-type vertices pairs corresponding to its containg elements. For the rest of the subset, simply choose its corresponding c-type vertices to get 6 out of its corresponding 7-vertices in U being matched. The rest of the subset can not achieve 7 matches simply because the way to get 7 match is have 1 f-type with 6 b-type vertices. And the subset may share the element with other subset, thus can not be used to get 7. Also f-type and c-type conflict with each other as they all use the first vertex in every 7-vertices group of U.

For example in fig2: If we pick S1 as the feasible solution to MB3DM. In set U, a₁₁, a₁₂, a₁₃, a₁₄, a₁₅, a₁₆,a₁₇ can be matched with f₁, b₁₁, b₁₂, b₂₁, b₂₂, b₃₁, b₃₂ respectively. As for S2, since b₂₁, b₂₂, b₃₁, b₃₂ have already been used by S1, its corresponding a₂₁, a₂₂, a₂₃, a₂₄, a₂₅, a₂₆,a₂₇ can only have a₂₂, a₂₃, a₂₄, a₂₅, a₂₆,a₂₇ match with c₂₁, c₂₂, c₂₃, c₂₄, c₂₅, c₂₆.

Now we need to prove that if CBM has PTAS, so does MB3DM. Proof:
Assume MB3DM has a feasible solution with cost p, so the CBM has cost 6n + p, denote the OPT of MB3DM as p^*, the OPT for CBM is 6n + p^* Assume CBM has PTAS, i.e. there is a feasible solution with cost 6n+p having the following inequality holds.
6n+p ≤ (1+ε) (6n + p^*)
p ≤ (1+ε)p^* + 6nε

Recall that in MB3DM instance, each element can appear in no more than 3 subset. Thus for a given subset, the maximum possible number of subsets that may conflict with it is 6 with each element appear in 2 more different subsets. So we have :
p^* ≥ n/7 which means if we pick one subset out of seven, we can sure have no overlap ones. Combine with the previous inequality, we get:
p ≤ (1+43ε)p^* if we replace 43ε with ε^' , we get the PTAS for MB3DM problem.

The 5/3 approximation algorithm
Since there is no PTAS existing for 2-string CBM problem. We will try to make a constant factor approximation algorithm. Here we introduce the 5/3 approximation algorithm, which means that cost(AppAlg)≥ OPT . To help understanding the algorithm, Fig1 will be used as an example CBM instance. Fig3 shows the optimal matching of Fig1.

Denote the size of constrained bipartite matching of an algrithm A as cost(A), the size of the maximum constrained bipartite matching as M* and m1* as the number of type-1 edges in M* and m2* as the number of type-2 vertex-pairs (2-string). Obviously |M*| = m₁^* + 2m₂^*

Algorithm A with cost(A) ≥ m₁^* + m₂^*

Modify the instance of CBM G = (U, V, E) to construct a new bipartite graph G' = (U, V, E') with E' as removing all the edges associated with the second vertex of the 2-string vertices pair. Find the maximum matching of G' as M'. We have |M'| ≥ m₁^* + m₂^*
Obtain a feasible matching M of G from M' by following steps:
1. Simply expand M' to be TEMP_M by adding all the edges associated with the second vertex of the 2-string vertices pair if the edge associated with the first one is in M', i.e. copy edge e into TEMP_M if e associated with type-1 vertex; expand e into two edges e = u_iv_j and e' = u_i+1v_j+1 if e associated with the first vertex of a 2-string vertex pair v_j, v_j+1 i.e.(e = u_iv_j).
2. For each edge e in TEMP_M, if it conflict with no other edges in TEMP_M, copy them into M.
3. For the edges that conflict with each other in TEMP_M, get its corresponding edge in M' say (u_i,v_j₁),(u_i+1,v_j₂),...(u_i+h-1,v_{j_h}), it is obvious that (u_i,v_j₁) is the first edge of the two edges associated with a 2-string vertex pair. There are three cases to add edges into M':
  1. If h is even, add (u_i,v_j₁)(u_i+1,v_j₁+1), (u_i+2,v_j₃)(u_i+3,v_j₃+1)..., (u_i+h-2,v_{j_h-1})(u_i+h-1,v_{j_h-1+1}) to M.
  2. If h is odd and v_{j_h} is a type-1 vertex in V: add (u_i,v_j₁)(u_i+1,v_j₁+1), (u_i+2,v_j₃)(u_i+3,v_j₃+1)..., (u_i+h-4,v_{j_h-3})(u_i+h-3,v_{j_h-2}) and (u_i+h-1,v_{j_h}) to M.
  3. If h is odd and v_{j_h} is the first vertex of a type-2 vertex-pair in V: add (u_i,v_j₁)(u_i+1,v_j₁+1), (u_i+2,v_j₃)(u_i+3,v_j₃+1)..., (u_i+h-1,v_{j_h})(u_i+h,v_{j_h+1}) to M.

Fig 4 shows modified graph G', matching M' of G' and M of G.

Analysis: Based on the construction, we have |M| ≥ |M'| ≥ m₁^* + m₂^*.

Algorithm B with cost(B) ≥ m₁^*/3 + 4m₂^*/3

For the given CBM instance G=(U, V, E), construct 3 different edge-weighted bipartite graph G₁, G₂, G₃, G_i = (U_i, V_i, E_i), as follows:
1. V_i is defined as: For set V, merge every type-2 vertices v_jv_j+1 in V into a single super-vertex s_j,j+1.
2. U_i is defined as: For set U, regroup consecutive u_i into tuples with each tuple renamed as t_j,j+1,j+2 if the group is consisting of u_j,u_j+1,u_j+2. For G_i U is regrouped from u_i. If neither u₁ nor u₂ is grouped, group them as t_1,2. If neither u_n-1 nor u_n is grouped, group them as t_n-1,n.
3. E_i is defined as:
  1. For type-1 vertex v_h in V_i: If there is an edge between v_h and u_j+1, add a 1-weight edge between v_h and t_j,j+1,j+2 into E_i if super-vertex t_j,j+1,j+2 is in E_i. If there is an edge between v_h and u₁ or between v_h and u_n, add a 1-weight edge between v_h and t_1,2 or t_n-1,n into E_i respectively if super-vertex t_1,2 or t_n-1,n is in E_i.
  2. For type-2 super vertex s_hs_h+1 in V: add a 2-weight edge between s_h,h+1 and t_j,j+1,j+2 into E_i if there is an edge pair either between v_hv_h+1 and u_ju_j+1 or between v_hv_h+1 and u_j+1u_j+2 in original grah G; add a 2-weight edge between s_h,h+1 and t_1,2 or t_n-1,n into E_i if there is an edge pair either between v_hv_h+1 and u₁u₂ or between v_hv_h+1 and u_n-1u_n in original graph G.
For each graph G_i, find the maximum-weighted matching of G_i as M_i.
Expand each M_i into a feasible matching M_i of G by reversing the construction steps.
Take the maximum of M₁, M₂ and M₃ as M to be the feasible solution of G.

Fig 5 shows modified graph G₀, G₁, G2, maximum weighted matching M_i of G_i and constrained matching M_i of G.

Analysis
We need to show that |M| ≥ m₁^*/3 + 4m₂^*/3. Denote M* as the OPT of CBM. Define M_i^* as follows: Starts from M^*, in the G_i

For the edge associated with type-1 vertices in G, say (u_j, v_h)
1. if there is vertex t_j-1,j,j+1, add a weight 1 edge t_j-1,j,j+1, v_h to M_i^*.
2. if there is vertex t_j-1,j or t_j,j+1 or u_j, add a weight 1 edge t_j-1,j or t_j,j+1 or u_j to M_i^* respectively.
For the pair of edges associated with type-2 vertices in G, say (u_j,v_h),(u_j+1,v_h+1), add a weight 2 edge to M_i^* if u_j,u_j+1 belongs to the same super-vertex in G_i.

Since each edge in M* that incident to a type-1 vertex belongs to exactly 1 of M₀^*, M₁^*, M₂^* and each edge in M* that incident to a type-2 vertex pair belongs to exactly 2 of M₀^*, M₁^*, M₂^*. Max(M₀^*, M₁^*, M₂^*) ≥ m₁*/3 + 4m₂* /3.

A matching M_i' in G_i can be obtained by Modifying M_i* reversly with the same weight. Again, since M_i is a maximum-weighted matching in G_i, the following inequality holds:

|M| ≥ MAX(M_i') = m₁*/3 + 4m₂*/3

The 5/3 approximation algorithm

Denote C as the maximum of A and B, the approximation ratio of C is cost(C) ≥ 3/5 × OPT
Proof:
Let a be real number between [0,1], which stands for the partial of algorithm A. we try to estimate the approximation ration b of the Max(A,B)
Max(A, B) ≥ a × cost(A) + (1-a) × cost(B) ≥ a × (m₁^* + m₂^*) + (1-a) × (m₁^*/3 + 4m₂^*/3) = (1/3 + 2/3 × a) × m₁^* + (4/3 - a/3) × m₂^*)
Recall that we are trying to get the approximation ratio b, which means trying to find b s.t. Max(A,B) ≥ b × OPT
(1/3 + 2a/3) × m₁^* + (4/3 - a/3) × m₂^*) ≥ b OPT gives b value of 3/5. Thus the maximum of A and B can achieve 5/3 approximation.

Reference:

[1] Chen ZZ, Jiang T, Lin GH, et al. More reliable protein NMR peak assignment via improved 2-interval scheduling LECTURE NOTES IN COMPUTER SCIENCE 2832: 580-592 2003
[2] Chen ZZ, Jiang T, Lin GH, et al. Approximation algorithms for NMR spectral peak assignment THEOR COMPUT SCI 299 (1-3): 211-229 APR 18 2003
[3] Chen ZZ, Jiang T, Lin GH, et al. Improved approximation algorithms for NMR spectral peak assignment LECT NOTES COMPUT SCI 2452: 82-96 2002
[4] Xu Y, Xu D, Kim D, et al. Automated assignment of backbone NMR peaks using constrained bipartite matching COMPUT SCI ENG 4 (1): 50-62 JAN-FEB 2002
[5] Papadimitriou CH, Yannakakis M. Optimization, Approximation, and Complexity Classes JOURNAL OF COMPUTER AND SYSTEM SCIENCES 43: 425-440 1991