Home Artificial Intelligence The way to Encode Constraints to the Output of Neural Networks If one shot doesn’t work, try multiple shots Perhaps… Let’s learn the constraints? One more interesting perspective: Solving a convex optimization problem For non-convex problems: Which gradient approximation do you favor? Self-promotion time: Projection without solving optimization A fast example How does LinSAT work? Afterword

The way to Encode Constraints to the Output of Neural Networks If one shot doesn’t work, try multiple shots Perhaps… Let’s learn the constraints? One more interesting perspective: Solving a convex optimization problem For non-convex problems: Which gradient approximation do you favor? Self-promotion time: Projection without solving optimization A fast example How does LinSAT work? Afterword

0
The way to Encode Constraints to the Output of Neural Networks
If one shot doesn’t work, try multiple shots
Perhaps… Let’s learn the constraints?
One more interesting perspective: Solving a convex optimization problem
For non-convex problems: Which gradient approximation do you favor?
Self-promotion time: Projection without solving optimization
A fast example
How does LinSAT work?
Afterword

A summary of obtainable approaches

Towards Data Science
Image generated by ChatGPT based on this text’s content.

Neural networks are indeed powerful. Nevertheless, as the applying scope of neural networks moves from “standard” classification and regression tasks to more complex decision-making and AI for Science, one drawback is becoming increasingly apparent: the output of neural networks is often unconstrained, or more precisely, constrained only by easy 0–1 bounds (Sigmoid activation function), non-negative constraints (ReLU activation function), or constraints that sum to 1 (Softmax activation function). These “standard” activation layers have been used to handle classification and regression problems and have witnessed the vigorous development of deep learning. Nevertheless, as neural networks began to be widely used for decision-making, optimization solving, and other complex scientific problems, these “standard” activation layers are clearly now not sufficient. This text will briefly discuss the present methodologies available that may add constraints to the output of neural networks, with some personal insights included. Be at liberty to critique and discuss any related topics.

[中文版本(知乎)]

In the event you are conversant in reinforcement learning, it’s possible you’ll already know what I’m talking about. Applying constraints to an n-dimensional vector seems difficult, but you possibly can break an n-dimensional vector into n outputs. Every time an output is generated, you possibly can manually write the code to limit the motion space for the following variable to make sure its value stays inside a feasible domain. This so-called “autoregressive” method has obvious benefits: it is straightforward and might handle a wealthy number of constraints (so long as you possibly can write the code). Nevertheless, its disadvantages are also clear: an n-dimensional vector requires n calls to the network’s forward computation, which is inefficient; furthermore, this method often must be modeled as a Markov Decision Process (MDP) and trained through reinforcement learning, so common challenges in reinforcement learning comparable to large motion spaces, sparse reward functions, and long training times are also unavoidable.

Within the domain of solving combinatorial optimization problems with neural networks, the autoregressive method coupled with reinforcement learning was once mainstream, nevertheless it is currently being replaced by more efficient methods.

During training, a penalty term may be added to the target function, representing the degree to which the present neural network output violates constraints. In the standard optimization field, the Lagrangian dual method also offers an identical trick. Unfortunately, when applied to neural networks, these methods have thus far only been proven on some easy constraints, and it continues to be unclear whether or not they are applicable to more complex constraints. One shortcoming is that inevitably a number of the model’s capability is used to learn easy methods to meet corresponding constraints, thereby limiting the model’s ability in other directions (comparable to optimization solving).

For instance, Karalias and Loukas, NeurIPS’21 “Erdo˝s Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs” demonstrated that the so-called “box constraints”, where variable values lie between [a, b], may be learned through a penalty term, and the network can solve some relatively easy combinatorial optimization problems. Nevertheless, our further study found that this system lacks generalization ability. Within the training set, the neural network can maintain constraints well; but within the testing set, the constraints are almost completely lost. Furthermore, although adding a penalty term in principle can apply to any constraint, it cannot handle tougher constraints. Our paper Wang et al, ICLR’23 “Towards One-Shot Neural Combinatorial Optimization Solvers: Theoretical and Empirical Notes on the Cardinality-Constrained Case” discusses the above phenomena and presents the theoretical evaluation.

Then again, the design philosophy of generative models, where outputs need to adapt to a selected distribution, seems more suited to the “learning constraints” approach. Sun and Yang, NeurIPS’23 “DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization” showed that Diffusion models can output solutions that meet the constraints of the Traveling Salesman Problem (i.e., can output an entire circuit). We further presented Li et al, NeurIPS’23 “T2T: From Distribution Learning in Training to Gradient Search in Testing for Combinatorial Optimization”, where the generative model (Diffusion) is liable for meeting constraints, with one other optimizer providing optimization guidance throughout the gradual denoising technique of Diffusion. This strategy performed pretty much in experiments, surpassing all previous neural network solvers.

Possibly you might be concerned that autoregressive is simply too inefficient, and generative models may not solve your problem. You could be fascinated about a neural network that does just one forward pass, and the output needs to fulfill the given constraints — is that possible?

The reply is yes. We will solve a convex optimization problem to project the neural network’s output right into a feasible domain bounded by convex constraints. This system utilizes the property that a convex optimization problem is differentiable at its KKT conditions in order that this projection step may be considered an activation layer, embeddable in an end-to-end neural network. This system was proposed and promoted by Zico Kolter’s group at CMU, they usually currently offer the cvxpylayers package to ease the implementation steps. The corresponding convex optimization problem is

where y is the unconstrained neural network output, x is the constrained neural network output. Since the purpose of this step is only a projection, a linear objective function can achieve this (adding an entropy regularizer can be reasonable). Axb are the linear constraints it’s worthwhile to apply, which will also be quadratic or other convex constraints.

It’s a private note: there appear to be some known issues, and evidently this repository has not been updated/maintained for a very long time (04/2024). I would really appreciate it if anyone is willing to research what is happening.

Deriving gradients using KKT conditions is theoretically sound, nevertheless it cannot tackle non-convex or non-continuous problems. In truth, for non-continuous problems, when changes in problem parameters cause solution jumps, the true gradient becomes a delta function (i.e., infinite on the jump), which obviously can’t be utilized in training neural networks. Fortunately, there are some gradient approximation methods that may tackle this problem.

The Georg Martius group at Max Planck Institute introduced a black-box approximation method Vlastelica et al, ICLR’2020 “Differentiation of Blackbox Combinatorial Solvers”, which views the solver as a black box. It first calls the solver once, then perturbs the issue parameters in a selected direction, after which calls the solver again. The residual between the outputs of the 2 solver calls serves because the approximate gradient. If this system is applied to the output of neural networks to implement constraints, we are able to define an optimization problem with a linear objective function:

where y is the unconstrained neural network output, and x is the constrained neural network output. The next move is to implement an algorithm to resolve the above problem (not necessarily to be optimal), after which it could be integrated into the black-box approximation framework. A drawback of the black-box approximation method is that it could only handle linear objective functions, but a linear objective function just happens to work for those who are in search of some methods to implement constraints; furthermore, because it is only a gradient approximation method if the hyperparameters aren’t well-tuned, it’d encounter sparse gradients and convergence issues.

One other method for approximating gradients involves using a considerable amount of random noise perturbation, repeatedly calling the solver to estimate a gradient, as discussed in Berthet et al, NeurIPS’2020 “Learning with Differentiable Perturbed Optimizers”. Theoretically, the gradient obtained this manner ought to be just like the gradient obtained through the LinSAT method (which will probably be discussed in the following section), being the gradient of an entropy-regularized linear objective function; nonetheless, in practice, this method requires numerous random samples, which is sort of impractical (a minimum of on my use cases).

Whether it’s deriving gradients from KKT conditions for convex problems or approximating gradients for non-convex methods, each require calling/writing a solver, whereby the CPU-GPU communication could possibly be a bottleneck because most solvers are often designed and implemented for CPUs. Is there a solution to project specific constraints directly on the GPU like an activation layer, without solving optimization problems explicitly?

The reply is yes, and our Wang et al, ICML’2023 “LinSATNet: The Positive Linear Satisfiability Neural Networks” presents a viable path and derives the convergence property of the algorithm. LinSAT stands for Linear SATisfiability Network.

LinSAT may be seen as an activation layer, allowing you to use general positive linear constraints to the output of a neural network.

Image by writer

The LinSAT layer is fully differentiable, and the true gradients are computed by autograd, identical to other activation layers. Our implementation now supports PyTorch.

You possibly can install it by

pip install linsatnet

And start with

from LinSATNet import linsat_layer

In the event you download and run the source code, you’ll find a straightforward example. In this instance, we apply doubly stochastic constraints to a 3×3 matrix.

To run the instance, first clone the repo:

git clone https://github.com/Thinklab-SJTU/LinSATNet.git

Go into the repo, and run the instance code:

cd LinSATNet
python LinSATNet/linsat.py

In this instance, we attempt to implement doubly-stochastic constraints to a 3×3 matrix. The doubly stochastic constraint implies that all rows and columns of the matrix should sum to 1.

The 3×3 matrix is flattened right into a vector, and the next positive linear constraints are considered (for Ex=f):

E = torch.tensor(
[[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 1]], dtype=torch.float32
)
f = torch.tensor([1, 1, 1, 1, 1, 1], dtype=torch.float32)

We randomly init w and regard it because the output of some neural networks:

w = torch.rand(9) # w could possibly be the output of neural network
w = w.requires_grad_(True)

We even have a “ground-truth goal” for the output of linsat_layer, which is a diagonal matrix in this instance:

x_gt = torch.tensor(
[1, 0, 0,
0, 1, 0,
0, 0, 1], dtype=torch.float32
)

The forward/backward passes of LinSAT follow the usual PyTorch style and are readily integrated into existing deep learning pipelines.

The forward pass:

linsat_outp = linsat_layer(w, E=E, f=f, tau=0.1, max_iter=10, dummy_val=0)

The backward pass:

loss = ((linsat_outp — x_gt) ** 2).sum()
loss.backward()

You too can set E as a sparse matrix to enhance the time & memory efficiency (especially for large-sized input). Here’s a dumb example (consider to construct E in sparse for the perfect efficiency):

linsat_outp = linsat_layer(w, E=E.to_sparse(), f=f, tau=0.1, max_iter=10, dummy_val=0)

We can even do gradient-based optimization over w to make the output of linsat_layer closer to x_gt. That is what happens if you train a
neural network.

niters = 10
opt = torch.optim.SGD([w], lr=0.1, momentum=0.9)
for i in range(niters):
x = linsat_layer(w, E=E, f=f, tau=0.1, max_iter=10, dummy_val=0)
cv = torch.matmul(E, x.t()).t() — f.unsqueeze(0)
loss = ((x — x_gt) ** 2).sum()
loss.backward()
opt.step()
opt.zero_grad()
print(f’{i}/{niters}n’
f’ underlying obj={torch.sum(w * x)},n’
f’ loss={loss},n’
f’ sum(constraint violation)={torch.sum(cv[cv > 0])},n’
f’ x={x},n’
f’ constraint violation={cv}’)

And you might be prone to see the loss decreasing throughout the training steps.

For full API references, please try the GitHub repository.

Warning, tons of math ahead! You possibly can safely skip this part for those who are only using LinSAT.

If you must learn more details and proofs, please check with the important paper.

Here we introduce the mechanism inside LinSAT. It really works by extending the Sinkhorn algorithm to multiple sets of marginals (to our greatest knowledge, we’re the primary to check Sinkhorn with multi-sets of marginals). The positive linear constraints are then enforced by transforming the constraints into marginals.

Classic Sinkhorn with single-set marginals

Let’s start with the classic Sinkhorn algorithm. Given non-negative rating matrix S with size m×n, and a set of marginal distributions on rows (non-negative vector v with size m) and columns (non-negative vector u with size n), where

the Sinkhorn algorithm outputs a normalized matrix Γ with size m×n and values in [0,1] in order that

Conceptually, Γᵢ ⱼ means the proportion of uⱼ moved to vᵢ.

The algorithm steps are:

Note that the above formulation is modified from the traditional Sinkhorn formulation. Γᵢ ⱼ uⱼ is comparable to the weather within the “transport” matrix in papers comparable to (Cuturi 2013). We prefer this latest formulation because it generalizes easily to Sinkhorn with multi-set marginals in the next.

To make a clearer comparison, the transportation matrix in (Cuturi 2013) is P with size m×n, and the constraints are

Pᵢ ⱼ means the exact mass moved from uⱼ to vᵢ.

The algorithm steps are:

Prolonged Sinkhorn with multi-set marginals

We discover that the Sinkhorn algorithm can generalize to multiple sets of marginals. Recall that Γᵢ ⱼ ∈ [0,1] means the proportion of uⱼ moved to vᵢ. Interestingly, it yields the identical formulation if we simply replace u, v with one other set of marginal distributions, suggesting the potential of extending the Sinkhorn algorithm to multiple sets of marginal distributions. Denote that there are k sets of marginal distributions which might be jointly enforced to suit more complicated real-world scenarios. The sets of marginal distributions are

and we’ve got:

It assumes the existence of a normalized Z ∈ [0,1] with size m×n, s.t.

i.e., the multiple sets of marginal distributions have a non-empty feasible region (it’s possible you’ll understand the meaning of “non-empty feasible region” after reading the following section about easy methods to handle positive linear constraints). Multiple sets of marginal distributions could possibly be jointly enforced by traversing the Sinkhorn iterations over k sets of marginal distributions. The algorithm steps are:

In our paper, we prove that the Sinkhorn algorithm for multi-set marginals shares the identical convergence pattern with the classic Sinkhorn, and its underlying formulation can be just like the classic Sinkhorn.

Transforming positive linear constraints into marginals

Then we show easy methods to transform the positive linear constraints into marginals, that are handled by our proposed multi-set Sinkhorn.

Encoding neural network’s output
For an l-length vector denoted as y (which may be the output of a neural network, also it’s the input to linsat_layer), the next matrix is built

where W is of size 2 × (l + 1), and β is the dummy variable, the default is β = 0. y is put on the upper-left region of W. The entropic regularizer is then enforced to manage discreteness and handle potential negative inputs:

The rating matrix S is taken because the input of Sinkhorn for multi-set marginals.

From linear constraints to marginals

1) Packing constraint Axb. Assuming that there is barely one constraint, we rewrite the constraint as

Following the “transportation” view of Sinkhorn, the output x moves at most b unit of mass from a, a, …, aₗ, and the dummy dimension allows the inequality by moving mass from the dummy dimension. Additionally it is ensured that the sum of u equals the sum of v. The marginal distributions are defined as

2 ) Covering constraint Cx d. Assuming that there is barely one constraint, we rewrite the constraint as

We introduce the multiplier

because we at all times have

(else the constraint is infeasible), and we cannot reach a feasible solution where all elements in x are 1s without this multiplier. Our formulation ensures that a minimum of d unit of mass is moved from c₁, c, …, cₗ by x, thus representing the covering constraint of “greater than”. Additionally it is ensured that the sum of u_c equals the sum of v_c. The marginal distributions are defined as

3) Equality constraint Ex = f. Representing the equality constraint is more straightforward. Assuming that there is barely one constraint, we rewrite the constraint as

The output x moves e₁, e, …, eₗ to f, and we want no dummy element in uₑ since it is an equality constraint. Additionally it is ensured that the sum of uₑ equals the sum of vₑ. The marginal distributions are defined as

After encoding all constraints and stacking them as multiple sets of marginals, we are able to call the Sinkhorn algorithm for multi-set marginals to encode the constraints.

Experimental Validation of LinSAT

In our ICML paper, we validated the LinSATNet method for routing constraints beyond the final case (used for solving variants of the Traveling Salesman Problem), partial graph matching constraints (utilized in graph matching where only subsets of graphs match one another), and general linear constraints (utilized in specific preference with portfolio optimization). All these problems may be represented with positive linear constraints and handled using the LinSATNet method. In experiments, neural networks are able to learning easy methods to solve all three problems.

It ought to be noted that the LinSATNet method can only handle positive linear constraints, meaning that it’s unable to handle constraints like x₁ — x₂ ≤ 0 which contain negative terms. Nevertheless, positive linear constraints already cover an enormous array of scenarios. For every specific problem, the mathematical modeling is usually not unique, and in lots of cases, an inexpensive positive linear formulation could possibly be found. Along with the examples mentioned above, let the network output organic molecules (represented as graphs, ignoring hydrogen atoms, considering only the skeleton) can consider constraints comparable to C atoms having not more than 4 bonds, O atoms having not more than 2 bonds.

Adding constraints to neural networks has a big selection of application scenarios, and thus far, several methods can be found. It’s vital to notice that there isn’t a golden standard to guage their superiority over one another — the perfect method is often relevant to a certain scenario.

After all, I like to recommend trying out LinSATNet! Anyway, it is so simple as an activation layer in your network.

In the event you found this text helpful, please be happy to cite:

@inproceedings{WangICML23,
title={LinSATNet: The Positive Linear Satisfiability Neural Networks},
writer={Wang, Runzhong and Zhang, Yunhao and Guo, Ziao and Chen, Tianyi and Yang, Xiaokang and Yan, Junchi},
booktitle={International Conference on Machine Learning (ICML)},
12 months={2023}
}

All aforementioned content has been discussed on this paper.

LEAVE A REPLY

Please enter your comment!
Please enter your name here