Aggregation¶
A mapping \(\mathcal A: \mathbb R^{m\times n} \to \mathbb R^n\) reducing any matrix \(J \in \mathbb R^{m\times n}\) into its aggregation \(\mathcal A(J) \in \mathbb R^n\) is called an aggregator.
In the context of JD, the matrix to aggregate is a Jacobian whose rows are the gradients of the individual objectives. The aggregator is used to reduce this matrix into an update vector for the parameters of the model
In TorchJD, an aggregator is a class that inherits from the abstract class Aggregator. We provide the following list of aggregators from the literature:
UPGrad (recommended) |
|||
Hint
This table is an adaptation of the one available in Jacobian Descent For Multi-Objective Optimization. The paper provides precise justification of the properties in Section 2.2 as well as proofs in Appendix B.
Non-conflicting
An aggregator \(\mathcal A: \mathbb R^{m\times n} \to \mathbb R^n\) is said to be non-conflicting if for any \(J\in\mathbb R^{m\times n}\), \(J\cdot\mathcal A(J)\) is a vector with only non-negative elements.
In other words, \(\mathcal A\) is non-conflicting whenever the aggregation of any matrix has non-negative inner product with all rows of that matrix. In the context of JD, this ensures that no objective locally increases.
Linear under scaling
An aggregator \(\mathcal A: \mathbb R^{m\times n} \to \mathbb R^n\) is said to be linear under scaling if for any \(J\in\mathbb R^{m\times n}\), the mapping from any positive \(c\in\mathbb R^{n}\) to \(\mathcal A(\operatorname{diag}(c)\cdot J)\) is linear in \(c\).
In other words, \(\mathcal A\) is linear under scaling whenever scaling a row of the matrix to aggregate scales its influence proportionally. In the context of JD, this ensures that even when the gradient norms are imbalanced, each gradient will contribute to the update proportionally to its norm.
Weighted
An aggregator \(\mathcal A: \mathbb R^{m\times n} \to \mathbb R^n\) is said to be weighted if for any \(J\in\mathbb R^{m\times n}\), there exists a weight vector \(w\in\mathbb R^m\) such that \(\mathcal A(J)=J^\top w\).
In other words, \(\mathcal A\) is weighted whenever the aggregation of any matrix is always in the span of the rows of that matrix. This ensures a higher precision of the Taylor approximation that JD relies on.