Representation Change in Model-Agnostic Meta-Learning

Last year, an exciting adaptation of one of the most popular optimization-based meta-learning approaches, model-agnostic meta-learning (MAML) [Finn et al., 2017], was proposed in

   ▶  Jaehoon Oh, Hyungjun Yoo, ChangHwan Kim, Se-Young Yun (ICLR, 2021) BOIL: Towards Representation Change for Few-shot Learning

The authors adapt MAML by freezing the last layer to force body only inner learning (BOIL). Interestingly, this is complementary to ANIL (almost no inner loop) proposed in

   ▶  Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals (ICLR, 2020) Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML

Both papers attempt to understand the success of MAML and improve it. Oh et al. [2021] compare BOIL, ANIL, and MAML and show that both improve the performance of MAML. Albeit, BOIL outperforms ANIL, especially when the task distribution varies between training and testing.

MAML

Before studying BOIL and ANIL, it is worth recalling how MAML works, as it forms the basis of both algorithms. MAML learns an initialization using second-order methods across tasks from the same distribution. The optimization is done in two nested loops (bi-level optimization), with meta-optimization happening in the outer loop. The entire optimization objective can be expressed as:

\[\begin{equation}\label{equ:outer} \theta^* := \underset{\theta \in \Theta}{\mathrm{argmin}} \frac{1}{M} \sum_{i=1}^M \mathcal{L}(in(\theta, \mathcal{D}_i^{tr}), \mathcal{D}_i^{test}), \end{equation}\]

where $M$ describes the number of tasks in a batch, and $\mathcal{D}_i^{tr}$ and $\mathcal{D}_i^{test}$ are the training and test set of the task $i$. The function $\mathcal{L}$ describes the task loss and the function $in(\theta, \mathcal{D}_i^{tr})$ represents the inner loop. For every task $i$ in a batch, the neural network is initialized with $\theta$. In the inner loop, this value is optimized for one or a few training steps of gradient descent on the training set $\mathcal{D}_i^{tr}$ to obtain fine-tuned task parameters $\phi_i$. Taking only one training step in the inner loop, for example, the task parameters correspond to

\[\begin{equation} \phi_i \equiv in(\theta, \mathcal{D}_i^{tr}) = \theta - \alpha \nabla_{\theta} \mathcal{L}(\theta, \mathcal{D}_i^{tr}). \end{equation}\]

The meta-parameters $\theta$ are now updated with respect to the average loss of each tasks’ fine-tuned parameters $\phi_i$ on test set $\mathcal{D}_i^{test}$. Thus, MAML optimizes with respect to the loss after fine-tuning, having far superior performance to simple pre-training, as is outlined in the original publication. Many different adaptations to MAML improve learning speed or performance and solve various new tasks and task distributions. A deeper introduction and an interactive comparison of some variants can be found in Müller et al. [2021].

Freezing layers

In the standard version of MAML and most of its variants, all parameters of the meta-optimized model are updated in the inner loop when applied on a few-shot learning task. However, Raghu et al. [2020] have discovered that during fine-tuning in the inner loop, representations of the body (convolutional layers before the fully-connected head) of the network hardly change (see the section of representation similarity analysis). Therefore, the authors propose to skip updating the network body altogether, saving a significant amount of time, as now the expensive second-order updates are only required for the network head. Additionally, they observe regularization effects that further improve the model’s performance. The authors’ empirical results confirm a slight increase in performance while achieving an optimization speed by a factor of $1.7$ during training and $4.1$ during inference. They conclude that MAML is rather reusing the features than learning them rapidly. Here, the reuse of features is attributed to layers whose performance does not rely on a change of representation in the inner loop (which, according to the authors, goes along with small changes in the layers’ weights). Rapid learning can therefore be found only in the head, where a lot of change happens during fine-tuning.