チュートリアル 1: 最適化手法

第1週、第4日目: 最適化

Neuromatch Academyによる

コンテンツ作成者: ホセ・ガジェゴ＝ポサダ、イオアニス・ミトリアガス

コンテンツレビュアー: ピユシュ・チャウハン、ウラジミール・ハルタコフ、スイウェイ・バイ、ケルソン・シリング＝スクリボ

コンテンツ編集者: チャールズ・J・エデルソン、ガガナ・B、スピロス・チャヴリス

制作編集者: アルシュ・タガデ、R. クリシュナクマラン、ガガナ・B、スピロス・チャヴリス

チュートリアルの目的

目的:

最適化の必要性と重要性
よく使われる最適化手法の紹介
非凸損失ランドスケープにおける最適化
『適応的』ハイパーパラメータ調整
倫理的懸念

# @title Tutorial slides
from IPython.display import IFrame
link_id = "ft2sz"
print(f"If you want to download the slides: https://osf.io/download/{link_id}/")
IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/{link_id}/?direct%26mode=render%26action=download%26mode=render", width=854, height=480)

セットアップ

# @title Install and import feedback gadget


from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "neuromatch_dl",
            "user_key": "f379rz8y",
        },
    ).render()


feedback_prefix = "W1D4_T1"

# Imports
import copy

import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np

import time
import torch
import torchvision
import torchvision.datasets as datasets
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from tqdm.auto import tqdm

# @title Figure settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

import ipywidgets as widgets  # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")
plt.rc('axes', unicode_minus=False)

# @title Helper functions
def print_params(model):
  """
  Lists the name and current value of the model's
  named parameters

  Args:
    model: an nn.Module inherited model
      Represents the ML/DL model

  Returns:
    Nothing
  """
  for name, param in model.named_parameters():
    if param.requires_grad:
      print(name, param.data)

# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# for DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html

# Call the `set_seed` function in the exercises to ensure reproducibility.
import random
import torch

def set_seed(seed=None, seed_torch=True):
  """
  Handles variability by controlling sources of randomness
  through set seed values

  Args:
    seed: Integer
      Set the seed value to given integer.
      If no seed, set seed value to random integer in the range 2^32
    seed_torch: Bool
      Seeds the random number generator for all devices to
      offer some guarantees on reproducibility

  Returns:
    Nothing
  """
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
  print(f'Random seed {seed} has been set.')


# In case that `DataLoader` is used
def seed_worker(worker_id):
  """
  DataLoader will reseed workers following randomness in
  multi-process data loading algorithm.

  Args:
    worker_id: integer
      ID of subprocess to seed. 0 means that
      the data will be loaded in the main process
      Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details

  Returns:
    Nothing
  """
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)

# @title Set device (GPU or CPU). Execute `set_device()`
# especially if torch modules are used.

# inform the user if the notebook uses GPU or CPU.

def set_device():
  """
  Set the device. CUDA if available, CPU otherwise

  Args:
    None

  Returns:
    Nothing
  """
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device

SEED = 2021
set_seed(seed=SEED)
DEVICE = set_device()

セクション1. はじめに

所要時間の目安: 約15分

# @title Video 1: Introduction
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'zm9oekdkJbQ'), ('Bilibili', 'BV1VB4y1K7Vr')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Introduction_Video")

ディスカッション: 予期せぬ結果

不適切に選ばれたインセンティブや目的が予期せぬ結果を招いた、あなた自身の経験や生活の中での例を考えられますか？

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Unexpected_consequences_Discussion")

セクション2: ケーススタディ: 画像分類のためのMLPの成功裏の訓練

所要時間の目安: 約40分

現代の深層学習における最適化の多くの核心的なアイデア（およびトリック）は、画像分類タスクを解くためのMLPの訓練というシンプルな設定で説明できます。このチュートリアルでは、高次元かつ非凸 $^\dagger$ 問題を最適化する際に生じる主要な課題を案内します。これらの課題を用いて、よく使われる解決策のいくつかを動機付け、説明します。

免責事項: このチュートリアルでコーディングするいくつかの関数は、Pytorchや他の多くのライブラリですでに実装されています。教育的な理由から、これらのシンプルなコーディングタスクに焦点を当て、特定のライブラリの使用よりもアルゴリズムの理解に比較的高い重点を置くことにしました。

日常的な研究プロジェクトでは、今日書く『手動実装』よりも、コミュニティで検証され最適化されたライブラリに依存することが多いでしょう。セクション8では、MLPのパラメータを調整して手書き数字を分類するために、Pytorchの全機能を活用する機会があります。

$^\dagger$ : 厳密に凸な 関数は、同じグローバル最小値と局所最小値を持ちます。これは最適化において非常に良い性質で、グローバル最小値でない局所最小値にハマることがありません（例： $f(x)=x^2 + 2x + 1$ ）。一方、非凸関数は波打っており、全体で最も深い谷（グローバル最小値）より浅い『谷』（局所最小値）がいくつか存在します。そのため、最適化アルゴリズムは局所最小値にハマってしまうことがあり、それが起きているかどうかを判断するのは難しいです（例： $f(x) = x^4 + x^3 - 2x^2 - 2x$ ）。詳細はセクション5も参照してください。

# @title Video 2: Case Study - MLP Classification
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'pJc2ENhYbqA'), ('Bilibili', 'BV1GB4y1K7Ha')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Case_study_MLP_classification_Video")

セクション2.1: データ

手書き数字のMNISTデータセットを使用します。W1D1で学んだように、Pytorchのdatasetsモジュールを使ってデータを読み込みます。

注意: datasetsのオプション引数download=Trueを使えばMNISTデータセットを直接ダウンロードできますが、ネットワークの信頼性を確保するためにOSFのNMAディレクトリからダウンロードします。

# @title Download MNIST dataset
import tarfile, requests, os

fname = 'MNIST.tar.gz'
name = 'MNIST'
url = 'https://osf.io/y2fj6/download'

if not os.path.exists(name):
  print('\nDownloading MNIST dataset...')
  r = requests.get(url, allow_redirects=True)
  with open(fname, 'wb') as fh:
    fh.write(r.content)
  print('\nDownloading MNIST completed.')

if not os.path.exists(name):
  with tarfile.open(fname) as tar:
    tar.extractall()
    os.remove(fname)
else:
  print('MNIST dataset has been downloaded.')

def load_mnist_data(change_tensors=False, download=False):
  """
  Load training and test examples for the MNIST handwritten digits dataset
  with every image: 28*28 x 1 channel (greyscale image)

  Args:
    change_tensors: Bool
      Argument to check if tensors need to be normalised
    download: Bool
      Argument to check if dataset needs to be downloaded/already exists

  Returns:
    train_set:
      train_data: Tensor
        training input tensor of size (train_size x 784)
      train_target: Tensor
        training 0-9 integer label tensor of size (train_size)
    test_set:
      test_data: Tensor
        test input tensor of size (test_size x 784)
      test_target: Tensor
        training 0-9 integer label tensor of size (test_size)
  """
  # Load train and test sets
  train_set = datasets.MNIST(root='.', train=True, download=download,
                             transform=torchvision.transforms.ToTensor())
  test_set = datasets.MNIST(root='.', train=False, download=download,
                            transform=torchvision.transforms.ToTensor())

  # Original data is in range [0, 255]. We normalize the data wrt its mean and std_dev.
  # Note that we only used *training set* information to compute mean and std
  mean = train_set.data.float().mean()
  std = train_set.data.float().std()

  if change_tensors:
    # Apply normalization directly to the tensors containing the dataset
    train_set.data = (train_set.data.float() - mean) / std
    test_set.data = (test_set.data.float() - mean) / std
  else:
    tform = torchvision.transforms.Compose([torchvision.transforms.ToTensor(),
                                            torchvision.transforms.Normalize(mean=[mean / 255.], std=[std / 255.])
                                            ])
    train_set = datasets.MNIST(root='.', train=True, download=download,
                               transform=tform)
    test_set = datasets.MNIST(root='.', train=False, download=download,
                              transform=tform)

  return train_set, test_set


train_set, test_set = load_mnist_data(change_tensors=True)

まだ始めたばかりなので、全6万件の訓練データのうち、500件の小さなサブセットに集中します。

# Sample a random subset of 500 indices
subset_index = np.random.choice(len(train_set.data), 500)

# We will use these symbols to represent the training data and labels, to stay
# as close to the mathematical expressions as possible.
X, y = train_set.data[subset_index, :], train_set.targets[subset_index]

以下のセルを実行して、訓練セットの3つの例の内容を可視化してください。正規化後に適用した前処理がピクセル値の範囲をどのように変えるかに注目してください。

# @title Run me!

# Exploratory data analysis and visualisation

num_figures = 3
fig, axs = plt.subplots(1, num_figures, figsize=(5 * num_figures, 5))

for sample_id, ax in enumerate(axs):
  # Plot the pixel values for each image
  ax.matshow(X[sample_id, :], cmap='gray_r')
  # 'Write' the pixel value in the corresponding location
  for (i, j), z in np.ndenumerate(X[sample_id, :]):
    text = '{:.1f}'.format(z)
    ax.text(j, i, text, ha='center',
            va='center', fontsize=6, c='steelblue')

  ax.set_title('Label: ' + str(y[sample_id].item()))
  ax.axis('off')

plt.show()

セクション2.2: モデル

来週見るように、画像のようなデータに適した特定のモデル構造、例えば畳み込みニューラルネットワーク（CNN）がありますが、このチュートリアルではシンプルに多層パーセプトロン（MLP）モデルに限定して説明します。MLPはより高度なニューラルネットワーク設計と共有する多くの重要な最適化課題を強調できるためです。

class MLP(nn.Module):
  """
  This class implements MLPs in Pytorch of an arbitrary number of hidden
  layers of potentially different sizes. Since we concentrate on classification
  tasks in this tutorial, we have a log_softmax layer at prediction time.
  """

  def __init__(self, in_dim=784, out_dim=10, hidden_dims=[], use_bias=True):
    """
    Constructs a MultiLayerPerceptron

    Args:
      in_dim: Integer
        dimensionality of input data (784)
      out_dim: Integer
        number of classes (10)
      hidden_dims: List
        containing the dimensions of the hidden layers,
        empty list corresponds to a linear model (in_dim, out_dim)

    Returns:
      Nothing
    """

    super(MLP, self).__init__()

    self.in_dim = in_dim
    self.out_dim = out_dim

    # If we have no hidden layer, just initialize a linear model (e.g. in logistic regression)
    if len(hidden_dims) == 0:
      layers = [nn.Linear(in_dim, out_dim, bias=use_bias)]
    else:
      # 'Actual' MLP with dimensions in_dim - num_hidden_layers*[hidden_dim] - out_dim
      layers = [nn.Linear(in_dim, hidden_dims[0], bias=use_bias), nn.ReLU()]

      # Loop until before the last layer
      for i, hidden_dim in enumerate(hidden_dims[:-1]):
        layers += [nn.Linear(hidden_dim, hidden_dims[i + 1], bias=use_bias),
                   nn.ReLU()]

      # Add final layer to the number of classes
      layers += [nn.Linear(hidden_dims[-1], out_dim, bias=use_bias)]

    self.main = nn.Sequential(*layers)

  def forward(self, x):
    """
    Defines the network structure and flow from input to output

    Args:
      x: Tensor
        Image to be processed by the network

    Returns:
      output: Tensor
        same dimension and shape as the input with probabilistic values in the range [0, 1]

    """
    # Flatten each images into a 'vector'
    transformed_x = x.view(-1, self.in_dim)
    hidden_output = self.main(transformed_x)
    output = F.log_softmax(hidden_output, dim=1)
    return output

線形モデルは非常に特殊な種類のMLPで、隠れ層がゼロのMLPに相当します。これは単にアフィン変換、つまり「線形」写像 $W x$ に「オフセット」 $b$ を加えたものです。その後にソフトマックス関数が続きます。

f(x) = \text{softmax}(W x + b)

ここで $x \in \mathbb{R}^{784}$ 、 $W \in \mathbb{R}^{10 \times 784}$ 、 $b \in \mathbb{R}^{10}$ です。重み行列の次元が $10 \times 784$ なのは、入力テンソルがフラット化された画像（ $28 \times 28 = 784$ 次元）であり、出力層が10ノードからなるためです。また、ソフトマックスの実装は $b$ を $W$ に内包しており、入力の列ではなく行を写像します。つまり、出力の $i$ 行目は入力の $i$ 行目を $W$ で写像し、バイアス項を加えたものです。アフィン写像については https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html#affine-maps を参照してください。

# Empty hidden_dims means we take a model with zero hidden layers.
model = MLP(in_dim=784, out_dim=10, hidden_dims=[])

# We print the model structure with 784 inputs and 10 outputs
print(model)

セクション2.3: 損失関数

モデルの精度は重要ですが、0-1損失の『離散的』な性質は最適化を難しくします。良いパラメータを学習するために、前回の講義で見た交差エントロピー損失（負の対数尤度）を代理目的関数として最小化します。

この特定のモデルと最適化目的の組み合わせは、パラメータ $W$ と $b$ に関して凸最適化問題になります。

loss_fn = F.nll_loss

セクション2.4: 解釈性

前回の講義で、モデルの重みを調べることでモデルが学習した『概念』について洞察が得られることを見ました。ここでは部分的に訓練されたモデルの重みを示します。各クラスに対応する重みは、そのクラスの入力が検出されたときに_発火_するように『学習』されます。

#@markdown Run _this cell_ to train the model. If you are curious about how the training
#@markdown takes place, double-click this cell to find out. At the end of this tutorial
#@markdown you will have the opportunity to train a more complex model on your own.

cell_verbose = False
partial_trained_model = MLP(in_dim=784, out_dim=10, hidden_dims=[])

if cell_verbose:
  print('Init loss', loss_fn(partial_trained_model(X), y).item()) # This matches around np.log(10 = # of classes)

# Invoke an optimizer using Adaptive gradient and Momentum (more about this in Section 7)
optimizer = optim.Adam(partial_trained_model.parameters(), lr=7e-4)
for _ in range(200):
  loss = loss_fn(partial_trained_model(X), y)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

if cell_verbose:
  print('End loss', loss_fn(partial_trained_model(X), y).item()) # This should be less than 1e-2

# Show class filters of a trained model
W = partial_trained_model.main[0].weight.data.numpy()

fig, axs = plt.subplots(1, 10, figsize=(15, 4))
for class_id in range(10):
  axs[class_id].imshow(W[class_id, :].reshape(28, 28), cmap='gray_r')
  axs[class_id].axis('off')
  axs[class_id].set_title('Class ' + str(class_id) )

plt.show()

セクション3: 高次元探索

所要時間の目安: 約25分

モデルとそれに対応する学習可能なパラメータ、そして最適化すべき目的関数が揃いました。次に何をすればよいでしょうか？どのようにして「良い」パラメータの設定を見つけるのでしょうか？

一つのアイデアは、ランダムな方向を選び、目的関数が減少する場合のみその方向に進むことです。しかし、高次元空間ではこれは非効率的であり、適切なステップサイズを用いた勾配降下法が目的関数の改善を一貫して保証できることがわかります。

# @title Video 3: Optimization of an Objective Function
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'aSJTRdjRvvw'), ('Bilibili', 'BV1aL411H7Ce')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Optimization_of_an_Objective_Function_Video")

コーディング演習3: 勾配降下法の実装

この演習では、PyTorchの自動微分機能を使ってモデルのパラメータに関する損失の勾配を計算します。次に、その勾配を用いて勾配降下法によるパラメータ更新を実装します。

def zero_grad(params):
  """
  Clear gradients as they accumulate on successive backward calls

  Args:
    params: an iterator over tensors
      i.e., updating the Weights and biases

  Returns:
    Nothing
  """
  for par in params:
    if not(par.grad is None):
      par.grad.data.zero_()


def random_update(model, noise_scale=0.1, normalized=False):
  """
  Performs a random update on the parameters of the model to help
  understand the effectiveness of updating random directions
  for the problem of optimizing the parameters of a high-dimensional linear model.

  Args:
    model: nn.Module derived class
      The model whose parameters are to be updated

    noise_scale: float
      Specifies the magnitude of random weight

    normalized: Bool
      Indicates if the parameter has been normalised or not

  Returns:
    Nothing
  """
  for par in model.parameters():
    noise = torch.randn_like(par)
    if normalized:
      noise /= torch.norm(noise)
    par.data +=  noise_scale * noise

勾配降下法を実装してみましょう！

def gradient_update(loss, params, lr=1e-3):
  """
  Perform a gradient descent update on a given loss over a collection of parameters

  Args:
    loss: Tensor
      A scalar tensor containing the loss through which the gradient will be computed
    params: List of iterables
      Collection of parameters with respect to which we compute gradients
    lr: Float
      Scalar specifying the learning rate or step-size for the update

  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)

  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for par in params:
      #################################################
      ## TODO for students: update the value of the parameter ##
      raise NotImplementedError("Student exercise: implement gradient update")
      #################################################
      # Here we work with the 'data' attribute of the parameter rather than the
      # parameter itself.
      # Hence - use the learning rate and the parameter's .grad.data attribute to perform an update
      par.data -= ...


set_seed(seed=SEED)
model1 = MLP(in_dim=784, out_dim=10, hidden_dims=[])
print('\n The model1 parameters before the update are: \n')
print_params(model1)
loss = loss_fn(model1(X), y)

## Uncomment below to test your function
# gradient_update(loss, list(model1.parameters()), lr=1e-1)
# print('\n The model1 parameters after the update are: \n')
# print_params(model1)

 モデル1の更新後のパラメータは以下の通りです:

main.0.weight tensor([[-0.0263,  0.0010,  0.0174,  ...,  0.0298,  0.0278, -0.0220],
        [-0.0047, -0.0302, -0.0093,  ..., -0.0077,  0.0248, -0.0240],
        [ 0.0234, -0.0237,  0.0335,  ...,  0.0117,  0.0263, -0.0187],
        ...,
        [-0.0006,  0.0156,  0.0110,  ...,  0.0143, -0.0302, -0.0145],
        [ 0.0164,  0.0286,  0.0238,  ..., -0.0127, -0.0191,  0.0188],
        [ 0.0206, -0.0354, -0.0184,  ..., -0.0272,  0.0098,  0.0002]])
main.0.bias tensor([-0.0292, -0.0018,  0.0115, -0.0370,  0.0054,  0.0155,  0.0317,  0.0246,
         0.0198, -0.0061])

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_Gradient_descent_Exercise")

更新の比較

これらのプロットは、高次元線形モデルのパラメータ最適化問題においてランダム方向の更新の効果を比較しています。初期状態と学習途中の段階で、100個の異なるランダム方向による損失変化のヒストグラムと勾配降下法による損失変化を対比しています。

覚えておいてください: ここでは損失を最小化しようとしているため、値がより負の方が良いことを意味します！

# @markdown _Run this cell_ to visualize the results
fig, axs = plt.subplots(1, 2, figsize=(10, 4))

for id, (model_name, my_model) in enumerate([('Initialization', model),
                                              ('Partially trained', partial_trained_model)]):
  # Compute the loss we will be comparing to
  base_loss = loss_fn(my_model(X), y)

  # Compute the improvement via gradient descent
  dummy_model = copy.deepcopy(my_model)
  loss1 = loss_fn(dummy_model(X), y)
  gradient_update(loss1, list(dummy_model.parameters()), lr=1e-2)
  gd_delta = loss_fn(dummy_model(X), y) - base_loss

  deltas = []
  for trial_id in range(100):
    # Compute the improvement obtained with a random direction
    dummy_model = copy.deepcopy(my_model)
    random_update(dummy_model, noise_scale=1e-2)
    deltas.append((loss_fn(dummy_model(X), y) - base_loss).item())

  # Plot histogram for random direction and vertical line for gradient descent
  axs[id].hist(deltas, label='Random Directions', bins=20)
  axs[id].set_title(model_name)
  axs[id].set_xlabel('Change in loss')
  axs[id].set_ylabel('% samples')
  axs[id].axvline(0, c='green', alpha=0.5)
  axs[id].axvline(gd_delta.item(), linestyle='--', c='red', alpha=1,
                  label='Gradient Descent')


handles, labels = axs[id].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center',
           bbox_to_anchor=(0.5, 1.05),
           fancybox=False, shadow=False, ncol=2)

plt.show()

考えてみよう！3: 勾配降下法 vs ランダム探索

上記のヒストグラムを基に、勾配降下法とランダム探索の挙動を比較してください。どちらの方法がより信頼できるでしょうか？また、初期状態と学習途中での挙動の違いはどのように説明できますか？

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Gradient_descent_vs_random_search_Discussion")

セクション4: 悪条件問題

所要時間の目安: 約30分

この「単純な」ロジスティック回帰問題においても、悪条件問題が私たちを悩ませています。すべてのパラメータが同じではなく、パラメータの変化に対するネットワークの感度が最適化の動的挙動に大きな影響を与えます。

# @title Video 4: Momentum
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', '3ES5O58Y_2M'), ('Bilibili', 'BV1NL411H71t')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Momentum_Video")

この問題を2次元の設定で示します。ネットワークのパラメータのうち2つだけを動かし、他は固定します。1つはクラス0の重み行列（フィルター）の要素、もう1つはクラス7のバイアスです。これにより2変数の最適化問題となります。

考えてみよう4!: モメンタムはどう働く？

勾配降下法でこれら2つのパラメータの挙動にどれほどの違いがありますか？モメンタムはその差をどのように埋める効果がありますか？

# to remove solution
"""
The landscapes of the two parameters appear to be
flatter under gradient descent as can be seen in interactive demo 4 below.

As randomly-initialised models exhibit chaos, we use the Newton's approach
by tweaking the learning rate i.e., taking smaller steps in the indicated
direction and recomputing gradients to find an optimal solution on a
varied surface. Momentum helps reduce the chaos by maintaining a consistent
direction for exploration (linear combination of the previous heading vector,
and the newly-computed gradient vector).
""";

# @title Submit your feedback
content_review(f"{feedback_prefix}_How_Momentum_works_Discussion")

コーディング演習4: モメンタムの実装

この演習では、以下のモメンタム更新式を実装します:

w_{t+1} = w_t - \eta \nabla J(w_t) + \beta (w_t - w_{t-1})

この更新則は再帰的に表現するのが便利です。ここで「速度」を以下のように定義します:

v_{t-1} := w_{t} - w_{t-1}

すると2段階の更新則は次のようになります:

v_t = - \eta \nabla J(w_t) + \beta (\underbrace{w_t - w_{t-1}}_{v_{t-1}})

w_{t+1} \leftarrow w_t + v_{t}

最後の式の更新が正符号であることに注意してください。これは上記の $v_t$ の定義によるものです。

# @title Run this cell to setup some helper functions!

def loss_2d(model, u, v, mask_idx=(0, 378), bias_id=7):
  """
  Defines a 2-dim function by freezing all
  but two parameters of a linear model.

  Args:
    model: nn.Module
      a pytorch linear model
    u: Scalar
      first free parameter
    u: Scalar
      second free parameter
    mask_idx: Tuple
      selects parameter in weight matrix replaced by u
    bias_idx: Integer
      selects parameter in bias vector replaced by v

  Returns:
    loss: Scalar
      loss of the 'new' model
      over inputs X, y (defined externally)
  """

  # We zero out the element of the weight tensor that will be
  # replaced by u
  mask = torch.ones_like(model.main[0].weight)
  mask[mask_idx[0], mask_idx[1]] = 0.
  masked_weights = model.main[0].weight * mask

  # u is replacing an element of the weight matrix
  masked_weights[mask_idx[0], mask_idx[1]] = u

  res = X.reshape(-1, 784) @ masked_weights.T + model.main[0].bias

  # v is replacing a bias for class 7
  res[:, 7] += v - model.main[0].bias[7]
  res =  F.log_softmax(res, dim=1)

  return loss_fn(res, y)


def plot_surface(U, V, Z, fig):
  """
  Plot a 3D loss surface given
  meshed inputs U, V and values Z

  Args:
    U: nd.array()
      Input to plot for obtaining 3D loss surface
    V: nd.array()
      Input to plot for obtaining 3D loss surface
    Z: nd.array()
      Input to plot for obtaining 3D loss surface
    fig: matplotlib.figure.Figure instance
      Helps create a new figure, or activate an existing figure.

  Returns:
    ax: matplotlib.axes._subplots.AxesSubplot instance
      Plotted subplot data
  """
  ax = fig.add_subplot(1, 2, 2, projection='3d')
  ax.view_init(45, -130)

  surf = ax.plot_surface(U, V, Z, cmap=plt.cm.coolwarm,
                      linewidth=0, antialiased=True, alpha=0.5)

  # Select certain level contours to plot
  # levels = Z.min() * np.array([1.005, 1.1, 1.3, 1.5, 2.])
  # plt.contour(U, V, Z)# levels=levels, alpha=0.5)

  ax.set_xlabel('Weight')
  ax.set_ylabel('Bias')
  ax.set_zlabel('Loss', rotation=90)

  return ax


def plot_param_distance(best_u, best_v, trajs, fig, styles, labels,
                        use_log=False, y_min_v=-12.0, y_max_v=1.5):
  """
  Plot the distance to each of the
  two parameters for a collection of 'trajectories'

  Args:
    best_u: float
      Optimal distance of vector u within trajectory
    best_v: float
      Optimal distance of vector v within trajectory
    trajs: Tensor
      Specifies trajectories
    fig: matplotlib.figure.Figure instance
      Helps create a new figure, or activate an existing figure.
    styles: Tensor
      Specifying Style requirements
    use_log: Bool
      Specifies if log distance should be calculated; else, absolute distance
    y_min_v: float
      Minimum distance from y to v
    y_max_v: float
      Maximum distance from y to v

  Returns:
    ax: matplotlib.axes._subplots.AxesSubplot instance
      Plotted subplot data
  """
  ax = fig.add_subplot(1, 1, 1)

  for traj, style, label in zip(trajs, styles, labels):
    d0 = np.array([np.abs(_[0] - best_u) for _ in traj])
    d1 = np.array([np.abs(_[1] - best_v) for _ in traj])
    if use_log:
      d0 = np.log(1e-16 + d0)
      d1 = np.log(1e-16 + d1)
    ax.plot(range(len(traj)), d0, style, label='weight - ' + label)
    ax.plot(range(len(traj)), d1, style, label='bias - ' + label)
  ax.set_xlabel('Iteration')
  if use_log:
    ax.set_ylabel('Log distance to optimum (per dimension)')
    ax.set_ylim(y_min_v, y_max_v)
  else:
    ax.set_ylabel('Abs distance to optimum (per dimension)')
  ax.legend(loc='right', bbox_to_anchor=(1.5, 0.5),
            fancybox=False, shadow=False, ncol=1)

  return ax


def run_optimizer(inits, eval_fn, update_fn, max_steps=500,
                  optim_kwargs={'lr':1e-2}, log_traj=True):
  """
  Runs an optimizer on a given
  objective and logs parameter trajectory

  Args:
      inits list: Scalar
        initialization of parameters
      eval_fn: Callable
        function computing the objective to be minimized
      update_fn: Callable
        function executing parameter update
      max_steps: Integer
        number of iterations to run
      optim_kwargs: Dictionary
        customizable dictionary containing appropriate hyperparameters for the chosen optimizer
      log_traj: Bool
        Specifies if log distance should be calculated; else, absolute distance

  Returns:
      list: List
        trajectory information [*params, loss] for each optimization step
  """

  # Initialize parameters and optimizer
  params = [nn.Parameter(torch.tensor(_)) for _ in inits]
  # Methods like momentum and rmsprop keep and auxiliary vector of parameters
  aux_tensors = [torch.zeros_like(_) for _ in params]
  if log_traj:
    traj = np.zeros((max_steps, len(params)+1))
  for _ in range(max_steps):
    # Evaluate loss
    loss = eval_fn(*params)
    # Store 'trajectory' information
    if log_traj:
      traj[_, :] = [_.item() for _ in params] + [loss.item()]
    # Perform update
    if update_fn == gradient_update:
      gradient_update(loss, params, **optim_kwargs)
    else:
      update_fn(loss, params, aux_tensors, **optim_kwargs)
  if log_traj:
    return traj


L = 4.
xs = np.linspace(-L, L, 30)
ys = np.linspace(-L, L, 30)
U, V = np.meshgrid(xs, ys)

def momentum_update(loss, params, grad_vel, lr=1e-3, beta=0.8):
  """
  Perform a momentum update over a collection of parameters given a loss and velocities

  Args:
    loss: Tensor
      A scalar tensor containing the loss through which gradient will be computed
    params: Iterable
      Collection of parameters with respect to which we compute gradients
    grad_vel: Iterable
      Collection containing the 'velocity' v_t for each parameter
    lr: Float
      Scalar specifying the learning rate or step-size for the update
    beta: Float
      Scalar 'momentum' parameter

  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)
  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for (par, vel) in zip(params, grad_vel):
      #################################################
      ## TODO for students: update the value of the parameter ##
      raise NotImplementedError("Student exercise: implement momentum update")
      #################################################
      # Update 'velocity'
      vel.data = ...
      # Update parameters
      par.data += ...


set_seed(seed=SEED)
model2 = MLP(in_dim=784, out_dim=10, hidden_dims=[])
print('\n The model2 parameters before the update are: \n')
print_params(model2)
loss = loss_fn(model2(X), y)
initial_vel = [torch.randn_like(p) for p in model2.parameters()]

## Uncomment below to test your function
# momentum_update(loss, list(model2.parameters()), grad_vel=initial_vel, lr=1e-1, beta=0.9)
# print('\n The model2 parameters after the update are: \n')
# print_params(model2)

 モデル2の更新後のパラメータは以下の通りです:

main.0.weight tensor([[ 1.5898,  0.0116, -2.0239,  ..., -1.0871,  0.4030, -0.9577],
        [ 0.4653,  0.6022, -0.7363,  ...,  0.5485, -0.2747, -0.6539],
        [-1.4117, -1.1045,  0.6492,  ..., -1.0201,  0.6503,  0.1310],
        ...,
        [-0.5098,  0.5075, -0.0718,  ...,  1.1192,  0.2900, -0.9657],
        [-0.4405, -0.1174,  0.7542,  ...,  0.0792, -0.1857,  0.3537],
        [-1.0824,  1.0080, -0.4254,  ..., -0.3760, -1.7491,  0.6025]])
main.0.bias tensor([ 0.4147, -1.0440,  0.8720, -1.6201, -0.9632,  0.9430, -0.5180,  1.3417,
         0.6574,  0.3677])

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_momentum_Exercise")

インタラクティブデモ4: モメンタム vs. 勾配降下法

以下のプロットは、2つの変数それぞれについて最適解までの距離を両手法で示し、さらに損失面上のパラメータ軌跡も表示しています。

学習率とモメンタムのパラメータを調整して、100回の反復以内に両変数の損失を $10^{-6}$ 以下にしてください。

# @markdown Run this cell to enable the widget!
from matplotlib.lines import Line2D

def run_newton(func, init_list=[0., 0.], max_iter=200):
  """
  Find the optimum of this 2D problem using Newton's method

  Args:
    func: Callable
      Initialising parameter tensor updates
    init_list: Scalar
      initialization of parameters
    max_iter: Integer
      The maximum number of iterations to complete

  Returns:
    par_tensor.data.numpy(): ndarray
      List of newton's updates
  """

  par_tensor = torch.tensor(init_list, requires_grad=True)
  t_g = lambda par_tensor: func(par_tensor[0], par_tensor[1])

  for _ in tqdm(range(max_iter)):
    eval_loss = t_g(par_tensor)
    eval_grad = torch.autograd.grad(eval_loss, [par_tensor])[0]
    eval_hess = torch.autograd.functional.hessian(t_g, par_tensor)
    # Newton's update is:  - inverse(Hessian) x gradient
    par_tensor.data -= torch.inverse(eval_hess) @ eval_grad

  return par_tensor.data.numpy()


set_seed(2021)
model = MLP(in_dim=784, out_dim=10, hidden_dims=[])
# Define 2d loss objectives and surface values
g = lambda u, v: loss_2d(copy.deepcopy(model), u, v)
Z = np.fromiter(map(g, U.ravel(), V.ravel()), U.dtype).reshape(V.shape)

best_u, best_v  = run_newton(func=g)

# Initialization of the variables
INITS = [2.5, 3.7]

# Used for plotting
LABELS = ['GD', 'Momentum']
COLORS = ['black', 'red']
LSTYLES = ['-', '--']


@widgets.interact_manual
def momentum_experiment(max_steps=widgets.IntSlider(300, 50, 500, 5),
                        lr=widgets.FloatLogSlider(value=1e-1, min=-3, max=0.7, step=0.1),
                        beta=widgets.FloatSlider(value=9e-1, min=0, max=1., step=0.01)
                        ):
  """
  Displays the momentum experiment as a widget

  Args:
    max_steps: widget integer slider
      Maximum number of steps on the slider with default = 300
    lr: widget float slider
      Scalar specifying the learning rate or step-size for the update with default = 1e-1
    beta: widget float slider
      Scalar 'momentum' parameter with default = 9e-1

  Returns:
    Nothing
  """
  # Execute both optimizers
  sgd_traj = run_optimizer(INITS, eval_fn=g, update_fn=gradient_update,
                           max_steps=max_steps, optim_kwargs={'lr': lr})
  mom_traj = run_optimizer(INITS, eval_fn=g, update_fn=momentum_update,
                           max_steps=max_steps, optim_kwargs={'lr': lr, 'beta':beta})

  TRAJS = [sgd_traj, mom_traj]

  # Plot distances
  fig = plt.figure(figsize=(9,4))
  plot_param_distance(best_u, best_v, TRAJS, fig,
                      LSTYLES, LABELS, use_log=True, y_min_v=-12.0, y_max_v=1.5)

  # # Plot trajectories
  fig = plt.figure(figsize=(12, 5))
  ax = plot_surface(U, V, Z, fig)
  for traj, c, label in zip(TRAJS, COLORS, LABELS):
    ax.plot3D(*traj.T, c, linewidth=0.3, label=label)
    ax.scatter3D(*traj.T, '.-', s=1, c=c)

  # Plot optimum point
  ax.scatter(best_u, best_v, Z.min(), marker='*', s=80, c='lime', label='Opt.');
  lines = [Line2D([0], [0],
                  color=c,
                  linewidth=3,
                  linestyle='--') for c in COLORS]
  lines.append(Line2D([0], [0], color='lime', linewidth=0, marker='*'))
  ax.legend(lines, LABELS + ['Optimum'], loc='right',
            bbox_to_anchor=(.8, -0.1), ncol=len(LABELS) + 1)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Momentum_vs_GD_Interactive_Demo")

考えてみよう4: モメンタムと振動

この具体例は最適化における悪条件問題をどのように示していますか？モメンタムはこれらの問題をどのように解決するのに役立ちますか？
これらの手法のどれかで振動が見られますか？なぜそのような振動が起こるのでしょうか？

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Momentum_and_oscillations_Discussion")

セクション5: 非凸性

所要時間の目安: 約30分

ニューラルネットワークに隠れ層を1つ導入するだけで、これまでの凸最適化問題は非凸問題に変わります。そして、非凸性が強くなると、それに伴う責任も大きくなります…（すみません、つい言いたくなってしまいました！）

注意: このセクション以降は、チュートリアルの残りの部分で非凸最適化問題を扱います。

# @title Video 5: Overparameterization
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', '7vUpUEKKl5o'), ('Bilibili', 'BV16h41167Jr')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Overparameterization_Video")

非凸問題におけるニューラルネットワークの損失ランドスケープのより複雑な3D可視化を数分間操作してみましょう。https://losslandscape.com/explorer を訪れてください。

左下の機能を探索してください。右上の（ i ）ボタンをクリックすると各アイコンの説明が表示されます。
「勾配降下法」機能を使って思考実験を行いましょう：
- 初期化を選択
- 学習率を選択
- どのような軌跡が観察されるか仮説を立てる
実験を実行し、直感と観察された挙動を比較してください。
いくつかの初期化・学習率の設定でこの実験を繰り返してください。

インタラクティブデモ5: 過剰パラメータ化が救う！

ご覧の通り、非凸な表面の性質により最適化過程が望ましくない局所最適解に陥ることがあります。『過剰パラメータ化』されたモデルの方が学習しやすいという経験的証拠は豊富にあります。

この主張をMLPの学習において検証します。固定モデルを初期化し、元の初期化済み重みに小さなランダム摂動を加えた複数のモデルを構築します。これらの摂動モデルをそれぞれ学習させ、損失の推移を観察します。もし凸問題であれば、すべてのモデルは学習開始時に非常に近く、局所最適解は大域最適解でもあるため、収束時の目的関数値は非常に似通うはずです。

以下のインタラクティブプロットを使って、これら摂動モデルの損失推移を可視化してください：

hidden_dims のドロップダウンメニューから異なる設定を選択
ステップ数や学習率の影響を調べる

# @markdown Execute this cell to enable the widget!

@widgets.interact_manual
def overparam(max_steps=widgets.IntSlider(150, 50, 500, 5),
              hidden_dims=widgets.Dropdown(options=["10", "20, 20", "100, 100"],
                                           value="10"),
              lr=widgets.FloatLogSlider(value=5e-2, min=-3, max=0, step=0.1),
              num_inits=widgets.IntSlider(7, 5, 10, 1)):
  """
  Displays the overparameterization phenomenon as a widget

  Args:
    max_steps: widget integer slider
      Maximum number of steps on the slider with default = 150
    hidden_dims: widget dropdown menu instance
      The number of hidden dimensions with default = 10
    lr: widget float slider
      Scalar specifying the learning rate or step-size for the update with default = 5e-2
    num_inits: widget integer slider
      Scalar number of epochs

  Returns:
    Nothing
  """

  X, y = train_set.data[subset_index, :], train_set.targets[subset_index]

  hdims = [int(s) for s in hidden_dims.split(',')]
  base_model = MLP(in_dim=784, out_dim=10, hidden_dims=hdims)

  fig, axs = plt.subplots(1, 1, figsize=(5, 4))

  for _ in tqdm(range(num_inits)):
    model = copy.deepcopy(base_model)
    random_update(model, noise_scale=2e-1)
    loss_hist = np.zeros((max_steps, 2))
    for step in range(max_steps):
      loss = loss_fn(model(X), y)
      gradient_update(loss, list(model.parameters()), lr=lr)
      loss_hist[step] = np.array([step, loss.item()])

    plt.plot(loss_hist[:, 0], loss_hist[:, 1])

  plt.xlabel('Iteration')
  plt.ylabel('Loss')
  plt.ylim(0, 3)
  plt.show()

  num_params = sum([np.prod(_.shape) for _ in model.parameters()])
  print('Number of parameters in model:  ' + str(num_params))

# @title Submit your feedback
content_review(f"{feedback_prefix}_Overparameterization_Interactive_Demo")

考えてみよう！5.1: ネットワークの幅と深さ

ネットワークの幅や深さを増やすと、学習が速くなり、異なる初期化間でのばらつきも小さくなることがわかります。この挙動の理由は何でしょうか？
非凸性に対処するこのアプローチの潜在的な欠点は何でしょうか？

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Width_and_depth_of_the_network_Discussion")

セクション6: フル勾配は計算コストが高い

所要時間の目安: 約25分

これまで、モデルパラメータの更新には500件の固定された小さな訓練データセットのみを用いてきました。しかし、もし訓練セット全体を使うとしたらどうでしょう？現在の方法は数万、数百万のデータポイントを持つデータセットにスケールするでしょうか？

このセクションでは、パラメータ更新の前に全ての訓練例で計算を行うことを避ける効率的な代替手段を探ります。

# @title Video 6: Mini-batches
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'hbqUxpNBUGk'), ('Bilibili', 'BV1ty4y1T7Uh')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Mini_batches_Video")

インタラクティブデモ6.1: 計算コスト

ニューラルネットワークの評価は比較的高速ですが、これを何百万回も繰り返すと、順伝播・逆伝播の計算コストが無視できなくなります。

以下の可視化では、入力例の数を変化させたときの順伝播と逆伝播の計算時間（5回の平均）を示しています。ドロップダウンボックスから異なるオプションを選択し、ネットワークのサイズによって縦軸のスケールがどのように変わるかを確認してください。

補足: 順伝播の計算コストは入力例数に対して明確に線形の関係を示し、対応する逆伝播のコストも同様の計算複雑度を持ちます。

# @markdown Execute this cell to enable the widget!

def gradient_update(loss, params, lr=1e-3):
  """
  Perform a gradient descent update on a given loss over a collection of parameters

  Args:
    loss: Tensor
      A scalar tensor containing the loss through which the gradient will be computed
    params: List of iterables
      Collection of parameters with respect to which we compute gradients
    lr: Float
      Scalar specifying the learning rate or step-size for the update

  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)

  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for par in params:
       par.data -= lr * par.grad.data


def measure_update_time(model, num_points):
  """
  Measuring the time for update

  Args:
    model: an nn.Module inherited model
      Represents the ML/DL model
    num_points: integer
      The number of data points in the train_set

  Returns:
    tuple of loss time and time for calculation of gradient
  """
  X, y = train_set.data[:num_points], train_set.targets[:num_points]
  start_time = time.time()
  loss = loss_fn(model(X), y)
  loss_time = time.time()
  gradient_update(loss, list(model.parameters()), lr=0)
  gradient_time = time.time()
  return loss_time - start_time, gradient_time - loss_time


@widgets.interact
def computation_time(hidden_dims=widgets.Dropdown(options=["1", "100", "50, 50"],
                                                  value="100")):
  """
  Demonstrating time taken for computation as a widget

  Args:
    hidden_dims: widgets dropdown
      The number of hidden dimensions with default = 100

  Returns:
    Nothing
  """
  hdims = [int(s) for s in hidden_dims.split(',')]
  model = MLP(in_dim=784, out_dim=10, hidden_dims=hdims)

  NUM_POINTS = [1, 5, 10, 100, 200, 500, 1000, 5000, 10000, 20000, 30000, 50000]
  times_list = []
  for _ in range(5):
    times_list.append(np.array([measure_update_time(model, _) for _ in NUM_POINTS]))

  times = np.array(times_list).mean(axis=0)

  fig, axs = plt.subplots(1, 1, figsize=(5,4))
  plt.plot(NUM_POINTS, times[:, 0], label='Forward')
  plt.plot(NUM_POINTS, times[:, 1], label='Backward')
  plt.xlabel('Number of data points')
  plt.ylabel('Seconds')
  plt.legend()

# @title Submit your feedback
content_review(f"{feedback_prefix}_Cost_of_computation_Interactive_Demo")

コーディング演習6: ミニバッチサンプリングの実装

sample_minibatch のコードを完成させ、目的のサイズの訓練セットのIIDサブセットを生成してください。（これはトリック問題ではありません。）

def sample_minibatch(input_data, target_data, num_points=100):
  """
  Sample a minibatch of size num_point from the provided input-target data

  Args:
    input_data: Tensor
      Multi-dimensional tensor containing the input data
    target_data: Tensor
      1D tensor containing the class labels
    num_points: Integer
      Number of elements to be included in minibatch with default=100

  Returns:
    batch_inputs: Tensor
      Minibatch inputs
    batch_targets: Tensor
      Minibatch targets
  """
  #################################################
  ## TODO for students: sample minibatch of data ##
  raise NotImplementedError("Student exercise: implement gradient update")
  #################################################
  # Sample a collection of IID indices from the existing data
  batch_indices = ...
  # Use batch_indices to extract entries from the input and target data tensors
  batch_inputs = input_data[...]
  batch_targets = target_data[...]

  return batch_inputs, batch_targets



## Uncomment to test your function
# x_batch, y_batch = sample_minibatch(X, y, num_points=100)
# print(f"The input shape is {x_batch.shape} and the target shape is: {y_batch.shape}")

入力の形状は torch.Size([100, 28, 28]) で、ターゲットの形状は torch.Size([100]) です

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_mini_batch_sampling_Exercise")

インタラクティブデモ6.2: 異なるミニバッチサイズの比較

ミニバッチサイズの選択によるトレードオフは何でしょうか？以下のインタラクティブプロットは、各隠れ層に100ユニットを持つ2層MLPの学習推移を示しています。異なるプロットは異なるミニバッチサイズを表します。すべてのケースで固定の時間予算があり、横軸に反映されています。

# @markdown Execute this cell to enable the widget!

@widgets.interact_manual
def minibatch_experiment(batch_sizes='20, 250, 1000',
                         lrs='5e-3, 5e-3, 5e-3',
                         time_budget=widgets.Dropdown(options=["2.5", "5", "10"],
                                                      value="2.5")):
  """
  Demonstration of minibatch experiment

  Args:
    batch_sizes: String
      Size of minibatches
    lrs: String
      Different learning rates
    time_budget: widget dropdown instance
      Different time budgets with default=2.5s

  Returns:
    Nothing
  """
  batch_sizes = [int(s) for s in batch_sizes.split(',')]
  lrs = [float(s) for s in lrs.split(',')]

  LOSS_HIST = {_:[] for _ in batch_sizes}

  X, y = train_set.data, train_set.targets
  base_model = MLP(in_dim=784, out_dim=10, hidden_dims=[100, 100])

  for id, batch_size in enumerate(tqdm(batch_sizes)):
    start_time = time.time()
    # Create a new copy of the model for each batch size
    model = copy.deepcopy(base_model)
    params = list(model.parameters())
    lr = lrs[id]
    # Fixed budget per choice of batch size
    while (time.time() - start_time) < float(time_budget):
      data, labels = sample_minibatch(X, y, batch_size)
      loss = loss_fn(model(data), labels)
      gradient_update(loss, params, lr=lr)
      LOSS_HIST[batch_size].append([time.time() - start_time,
                                    loss.item()])

  fig, axs = plt.subplots(1, len(batch_sizes), figsize=(10, 3))
  for ax, batch_size in zip(axs, batch_sizes):
    plot_data = np.array(LOSS_HIST[batch_size])
    ax.plot(plot_data[:, 0], plot_data[:, 1], label=batch_size,
            alpha=0.8)
    ax.set_title('Batch size: ' + str(batch_size))
    ax.set_xlabel('Seconds')
    ax.set_ylabel('Loss')
  plt.show()

補足: SGDは有効です！適切な注意を払えば、任意のサイズのデータセットに適用可能なアルゴリズムがあります。

しかし、上記のプロット間で縦軸のスケールが異なることに注意してください。ミニバッチが大きい場合、順伝播・逆伝播の計算コストが高いため、パラメータ更新回数は少なくなります。

これはミニバッチサイズと学習率の相互作用を示しています：ミニバッチが大きいと、移動方向の推定がより確信的になるため、より大きな学習率を使えます。一方、非常に小さなミニバッチは計算は速いですが、データ分布を代表せず、勾配の推定に高い分散が生じます。

前のデモで各ミニバッチサイズに対して学習率を調整し、5秒以内に学習損失を0.5以下に安定して下げることをお勧めします。

# @title Submit your feedback
content_review(f"{feedback_prefix}_Compare_different_minibatch_sizes_Interactive_Demo")

セクション7: 適応的手法

所要時間の目安: 約25分

これまでに、機械学習問題に取り組む際に調整すべき多くのパラメータがあることを理解しているはずです。これらは最適化アルゴリズム、モデルの選択、あるいは最小化すべき目的関数に関わります。代表的な例は以下の通りです：

問題: 損失関数、正則化係数（Week 1, Day 5）
モデル: アーキテクチャ、活性化関数
最適化手法: 学習率、バッチサイズ、モーメンタム係数

ここでは最適化に直接関係する選択肢に集中します。特に、悪条件問題を解決し、様々な問題に対して頑健な学習率の自動設定手法を探ります。

# @title Video 7: Adaptive Methods
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'Zr6r2kfmQUM'), ('Bilibili', 'BV1eq4y1W7JG')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Adaptive_Methods_Video")

コーディング演習7: RMSpropの実装

この演習ではRMSpropオプティマイザの更新式を実装します：

\begin{align}
$v_{t}$ &= $\alpha v_{t-1} + (1 - \alpha) \nabla J(w_t)^2$ \ \
$w_{t+1}$ &= $w_t$ - $\eta \frac{\nabla J(w_t)}{\sqrt{v_t + \epsilon}} \end{align}$

ここで、非標準の演算（ベクトルの割り算、ベクトルの二乗など）は要素ごとの演算として解釈します。つまり、ベクトルの各要素に対して個別に実行される実数演算です。

$\epsilon$ ハイパーパラメータは、 $v_t$ が小さいときに学習率が過大にならないよう数値的安定性を提供します。通常、 $\epsilon$ は $10^{-8}$ のような小さなデフォルト値に設定します。

def rmsprop_update(loss, params, grad_sq, lr=1e-3, alpha=0.8, epsilon=1e-8):
  """
  Perform an RMSprop update on a collection of parameters

  Args:
    loss: Tensor
      A scalar tensor containing the loss whose gradient will be computed
    params: Iterable
      Collection of parameters with respect to which we compute gradients
    grad_sq: Iterable
      Moving average of squared gradients
    lr: Float
      Scalar specifying the learning rate or step-size for the update
    alpha: Float
      Moving average parameter
    epsilon: Float
      quotient for numerical stability

  Returns:
    Nothing
  """
  # Clear up gradients as Pytorch automatically accumulates gradients from
  # successive backward calls
  zero_grad(params)
  # Compute gradients on given objective
  loss.backward()

  with torch.no_grad():
    for (par, gsq) in zip(params, grad_sq):
      #################################################
      ## TODO for students: update the value of the parameter ##
      # Use gsq.data and par.grad
      raise NotImplementedError("Student exercise: implement gradient update")
      #################################################
      # Update estimate of gradient variance
      gsq.data = ...
      # Update parameters
      par.data -=  ...




set_seed(seed=SEED)
model3 = MLP(in_dim=784, out_dim=10, hidden_dims=[])
print('\n The model3 parameters before the update are: \n')
print_params(model3)
loss = loss_fn(model3(X), y)
# Initialize the moving average of squared gradients
grad_sq = [1e-6*i for i in list(model3.parameters())]



## Uncomment below to test your function
# rmsprop_update(loss, list(model3.parameters()), grad_sq=grad_sq, lr=1e-3)
# print('\n The model3 parameters after the update are: \n')
# print_params(model3)

 更新後の model3 のパラメータは以下の通りです:

main.0.weight tensor([[-0.0240,  0.0031,  0.0193,  ...,  0.0316,  0.0297, -0.0198],
        [-0.0063, -0.0318, -0.0109,  ..., -0.0093,  0.0232, -0.0255],
        [ 0.0218, -0.0253,  0.0320,  ...,  0.0102,  0.0248, -0.0203],
        ...,
        [-0.0027,  0.0136,  0.0089,  ...,  0.0123, -0.0324, -0.0166],
        [ 0.0159,  0.0281,  0.0233,  ..., -0.0133, -0.0197,  0.0182],
        [ 0.0186, -0.0376, -0.0205,  ..., -0.0293,  0.0077, -0.0019]])
main.0.bias tensor([-0.0313, -0.0011,  0.0122, -0.0342,  0.0045,  0.0199,  0.0329,  0.0265,
         0.0182, -0.0041])

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_RMSProp_Exercise")

インタラクティブデモ 7: 最適化手法の比較

以下では、あなたが実装したSGD、Momentum、RMSpropを比較します。これまでの演習をすべて正しく実装できていれば、おめでとうございます！

あなたは今、ディープラーニングで最もよく使われる強力な最適化ツールのいくつかを理解しています。

# @markdown Execute this cell to enable the widget!
X, y = train_set.data, train_set.targets

@widgets.interact_manual
def compare_optimizers(
    batch_size=(25, 250, 5),
    lr=widgets.FloatLogSlider(value=2e-3, min=-5, max=0),
    max_steps=(50, 500, 5)):
  """
  Demonstration to compare optimisers - stochastic gradient descent, momentum, RMSprop

  Args:
    batch_size: Tuple
      Size of minibatches
    lr: Float log slider instance
      Scalar specifying the learning rate or step-size for the update
    max_steps: Tuple
      Max number of step sizes for incrementing

  Returns:
    Nothing
  """
  SGD_DICT = [gradient_update, 'SGD', 'black', '-', {'lr': lr}]
  MOM_DICT = [momentum_update, 'Momentum', 'red', '--', {'lr': lr, 'beta': 0.9}]
  RMS_DICT = [rmsprop_update, 'RMSprop', 'fuchsia', '-', {'lr': lr, 'alpha': 0.8}]

  ALL_DICTS = [SGD_DICT, MOM_DICT, RMS_DICT]

  base_model = MLP(in_dim=784, out_dim=10, hidden_dims=[100, 100])

  LOSS_HIST = {}

  for opt_dict in tqdm(ALL_DICTS):
    update_fn, opt_name, color, lstyle, kwargs = opt_dict
    LOSS_HIST[opt_name] = []

    model = copy.deepcopy(base_model)
    params = list(model.parameters())

    if opt_name != 'SGD':
      aux_tensors = [torch.zeros_like(_) for _ in params]

    for step in range(max_steps):
      data, labels = sample_minibatch(X, y, batch_size)
      loss = loss_fn(model(data), labels)
      if opt_name == 'SGD':
        update_fn(loss, params, **kwargs)
      else:
        update_fn(loss, params, aux_tensors, **kwargs)
      LOSS_HIST[opt_name].append(loss.item())

  fig, axs = plt.subplots(1, len(ALL_DICTS), figsize=(9, 3))
  for ax, optim_dict in zip(axs, ALL_DICTS):
    opt_name = optim_dict[1]
    ax.plot(range(max_steps), LOSS_HIST[opt_name], alpha=0.8)
    ax.set_title(opt_name)
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Loss')
    ax.set_ylim(0, 2.5)
  plt.show()

# @title Submit your feedback
content_review(f"{feedback_prefix}_Compare_optimizers_Interactive_Demo")

考えてみよう 7.1!: 最適化手法の比較

上記の3つの手法、SGD、Momentum、RMSPropをチューニングして、それぞれが優れるようにし、結果を議論してください。ハイパーパラメータの微小な変化に対する頑健性はどうでしょうか？良いハイパーパラメータ設定を見つけるのはどれくらい簡単でしたか？

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Compare_optimizers_Discussion")

補足: RMSpropは、各次元ごとに学習率を設定する必要なく（自分で各次元の学習率を調整しなくても）、'次元ごとの'学習率を使える点が特徴です。この手法は、訓練中に勾配の分散に関する情報を収集し、それを使って各パラメータのステップサイズを自動的に適応します。RMSpropはSGDや単純なモーメンタムに比べてチューニングの手間が大幅に省けることはこの課題で明らかです。

さらに、適応的最適化手法は現在も活発に研究されている分野であり、Adam、AMSgrad、Adagradなど多くの関連アルゴリズムが実用的に使われ、理論的にも研究されています。

勾配の局所性

このチュートリアルを通じて見てきたように、勾配ベースの最適化では条件の悪さ（poor conditioning）が収束に大きな負担となることがあります。これに対処するために見てきた手法の中で、モーメンタムと適応的学習率の両方が過去の勾配値を更新式に取り入れていることに注目してください。なぜ現在のMLPの重みを更新する際に、過去の損失関数の勾配値を使うのでしょうか？

W1D2で学んだように、関数の勾配 $\nabla f(w_t)$ は局所的な性質であり、点 $w_t$ における $f(w_t)$ の最大変化方向を計算します。しかし、MLPモデルの訓練では、訓練損失の大域的最適解を見つけたいと考えています。過去の勾配値を最適化手法に取り入れることで、単一の勾配だけでは得られない関数全体の形状に関する情報をより多く活用しているのです。

考えてみよう 7.2: 損失関数と最適化

損失関数に関するより多くの情報を最適化手法に取り入れる他の方法を考えられますか？

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Loss_function_and_optimization_Discussion")

セクション 8: 倫理的懸念

所要時間の目安: 約15分

# @title Video 8: Ethical concerns
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', '0EthSI0cknI'), ('Bilibili', 'BV1TU4y1G7Je')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Ethical_concerns_Video")

まとめ

最適化は、収束が保証されるディープラーニングモデルを作るために必要である
確率的勾配降下法（SGD）とモーメンタムはよく使われる最適化手法である
RMSPropは、次元ごとの学習率を利用した適応的ハイパーパラメータ調整の方法である
最適化目標の選択を誤ると、予期しない望ましくない結果を招くことがある

時間があれば、ボーナスマテリアルを読んで、これまでの内容をまとめ、ベンチマークモデルとの比較を行うことができます。

ボーナス: まとめ

所要時間の目安: 約40分

我々は段階的に、数万の訓練例に関する非凸で条件の悪い問題に対処できる高度な最適化アルゴリズムを構築してきました。ここであなたに小さな挑戦を提示します：我々を超えてみてください！:P

あなたのミッションは、我々が事前に訓練したベンチマークモデルと競えるMLPモデルを訓練することです。このセクションでは、データの読み込み、モデルの定義、ミニバッチのサンプリング、そしてPytorchの最適化器実装をフルに使うことができます。

最適化器の設計には大きなエンジニアリングの要素があり、その実装は時に複雑になることがあります。したがって、最適化の研究を直接行っていない限り、広くレビューされたオープンソースライブラリの実装を使うことを推奨します。

# @title Video 9: Putting it all together
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'DP9c13vLiOM'), ('Bilibili', 'BV1MK4y1u7u2')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Putting_it_all_together_Bonus_Video")

# @title Download parameters of the benchmark model
import requests

fname = 'benchmark_model.pt'
url = "https://osf.io/sj4e8/download"
r = requests.get(url, allow_redirects=True)
with open(fname, 'wb') as fh:
  fh.write(r.content)

# Load the benchmark model's parameters
DEVICE = set_device()
if DEVICE == "cuda":
  benchmark_state_dict = torch.load(fname)
else:
  benchmark_state_dict = torch.load(fname, map_location=torch.device('cpu'))

# Create MLP object and update weights with those of saved model
benchmark_model = MLP(in_dim=784, out_dim=10,
                      hidden_dims=[200, 100, 50]).to(DEVICE)
benchmark_model.load_state_dict(benchmark_state_dict)


# Define helper function to evaluate models
def eval_model(model, data_loader, num_batches=np.inf, device='cpu'):
  """
  To evaluate a given model

  Args:
    model: nn.Module derived class
      The model which is to be evaluated
    data_loader: Iterable
      A configured dataloading utility
    num_batches: Integer
      Size of minibatches
    device: String
      Sets the device. CUDA if available, CPU otherwise

  Returns:
    mean of log loss and mean of log accuracy
  """

  loss_log, acc_log = [], []
  model.to(device=device)

  # We are just evaluating the model, no need to compute gradients
  with torch.no_grad():
    for batch_id, batch in enumerate(data_loader):
      # If we only evaluate a number of batches, stop after we reach that number
      if batch_id > num_batches:
        break
      # Extract minibatch data
      data, labels = batch[0].to(device), batch[1].to(device)
      # Evaluate model and loss on minibatch
      preds = model(data)
      loss_log.append(loss_fn(preds, labels).item())
      acc_log.append(torch.mean(1. * (preds.argmax(dim=1) == labels)).item())

  return np.mean(loss_log), np.mean(acc_log)

最適化器は以下のステップで定義します:

パラメータ更新や内部管理を行うクラスを読み込みます。これには:
- 補助変数の作成、
- 移動平均の更新、
- 学習率の調整などが含まれます。
最適化器が制御するPytorchモデルのパラメータを渡します。異なる最適化器は異なるパラメータグループを制御することがあります。
学習率、モーメンタム、移動平均係数などのハイパーパラメータを指定します。

演習ボーナス: 自分のモデルを訓練しよう

好きな最適化器を使ってモデルを訓練し、良いハイパーパラメータの組み合わせを見つけてください。

#################################################
## TODO for students: adjust training settings ##

# The three parameters below are in your full control
MAX_EPOCHS = 2  # select number of epochs to train
LR = 1e-5  # choose the step size
BATCH_SIZE = 64  # number of examples per minibatch

# Define the model and associated optimizer -- you may change its architecture!
my_model = MLP(in_dim=784, out_dim=10, hidden_dims=[200, 100, 50]).to(DEVICE)

# You can take your pick from many different optimizers
# Check the optimizer documentation and hyperparameter meaning before using!
# More details on Pytorch optimizers: https://pytorch.org/docs/stable/optim.html
# optimizer = torch.optim.SGD(my_model.parameters(), lr=LR, momentum=0.9)
# optimizer = torch.optim.RMSprop(my_model.parameters(), lr=LR, alpha=0.99)
# optimizer = torch.optim.Adagrad(my_model.parameters(), lr=LR)
optimizer = torch.optim.Adam(my_model.parameters(), lr=LR)
#################################################

set_seed(seed=SEED)
# Print training stats every LOG_FREQ minibatches
LOG_FREQ = 200
# Frequency for evaluating the validation metrics
VAL_FREQ = 200
# Load data using a Pytorch Dataset
train_set_orig, test_set_orig = load_mnist_data(change_tensors=False)

# We separate 10,000 training samples to create a validation set
train_set_orig, val_set_orig = torch.utils.data.random_split(train_set_orig, [50000, 10000])

# Create the corresponding DataLoaders for training and test
g_seed = torch.Generator()
g_seed.manual_seed(SEED)

train_loader = torch.utils.data.DataLoader(train_set_orig,
                                           shuffle=True,
                                           batch_size=BATCH_SIZE,
                                           num_workers=2,
                                           worker_init_fn=seed_worker,
                                           generator=g_seed)
val_loader = torch.utils.data.DataLoader(val_set_orig,
                                         shuffle=True,
                                         batch_size=256,
                                         num_workers=2,
                                         worker_init_fn=seed_worker,
                                         generator=g_seed)
test_loader = torch.utils.data.DataLoader(test_set_orig,
                                          batch_size=256,
                                          num_workers=2,
                                          worker_init_fn=seed_worker,
                                          generator=g_seed)

# Run training
metrics = {'train_loss':[],
           'train_acc':[],
           'val_loss':[],
           'val_acc':[],
           'val_idx':[]}

step_idx = 0
for epoch in tqdm(range(MAX_EPOCHS)):

  running_loss, running_acc = 0., 0.

  for batch_id, batch in enumerate(train_loader):
    step_idx += 1
    # Extract minibatch data and labels
    data, labels = batch[0].to(DEVICE), batch[1].to(DEVICE)
    # Just like before, refresh gradient accumulators.
    # Note that this is now a method of the optimizer.
    optimizer.zero_grad()
    # Evaluate model and loss on minibatch
    preds = my_model(data)
    loss = loss_fn(preds, labels)
    acc = torch.mean(1.0 * (preds.argmax(dim=1) == labels))
    # Compute gradients
    loss.backward()
    # Update parameters
    # Note how all the magic in the update of the parameters is encapsulated by
    # the optimizer class.
    optimizer.step()
    # Log metrics for plotting
    metrics['train_loss'].append(loss.cpu().item())
    metrics['train_acc'].append(acc.cpu().item())

    if batch_id % VAL_FREQ == (VAL_FREQ - 1):
      # Get an estimate of the validation accuracy with 100 batches
      val_loss, val_acc = eval_model(my_model, val_loader,
                                     num_batches=100,
                                     device=DEVICE)
      metrics['val_idx'].append(step_idx)
      metrics['val_loss'].append(val_loss)
      metrics['val_acc'].append(val_acc)

      print(f"[VALID] Epoch {epoch + 1} - Batch {batch_id + 1} - "
            f"Loss: {val_loss:.3f} - Acc: {100*val_acc:.3f}%")

    # print statistics
    running_loss += loss.cpu().item()
    running_acc += acc.cpu().item()
    # Print every LOG_FREQ minibatches
    if batch_id % LOG_FREQ == (LOG_FREQ-1):
      print(f"[TRAIN] Epoch {epoch + 1} - Batch {batch_id + 1} - "
            f"Loss: {running_loss / LOG_FREQ:.3f} - "
            f"Acc: {100 * running_acc / LOG_FREQ:.3f}%")

      running_loss, running_acc = 0., 0.

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].plot(range(len(metrics['train_loss'])), metrics['train_loss'],
           alpha=0.8, label='Train')
ax[0].plot(metrics['val_idx'], metrics['val_loss'], label='Valid')
ax[0].set_xlabel('Iteration')
ax[0].set_ylabel('Loss')
ax[0].legend()

ax[1].plot(range(len(metrics['train_acc'])), metrics['train_acc'],
           alpha=0.8, label='Train')
ax[1].plot(metrics['val_idx'], metrics['val_acc'], label='Valid')
ax[1].set_xlabel('Iteration')
ax[1].set_ylabel('Accuracy')
ax[1].legend()
plt.tight_layout()
plt.show()

# @title Submit your feedback
content_review(f"{feedback_prefix}_Train_your_own_model_Bonus_Exercise")

考えてみようボーナス: 評価指標

最適な設定を探す際にどの指標を最適化しましたか？訓練セットの損失？正解率？検証/テストセットの指標？理由も含めて議論してください。

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Metrics_Bonus_Discussion")

評価

ついに、これまで見たことのない例でモデルの性能を評価し比較できます。

どのモデルを選びますか？（*ドラムロール*）

print('Your model...')
train_loss, train_accuracy = eval_model(my_model, train_loader, device=DEVICE)
test_loss, test_accuracy = eval_model(my_model, test_loader, device=DEVICE)
print(f'Train Loss {train_loss:.3f} / Test Loss {test_loss:.3f}')
print(f'Train Accuracy {100*train_accuracy:.3f}% / Test Accuracy {100*test_accuracy:.3f}%')

print('\nBenchmark model')
train_loss, train_accuracy = eval_model(benchmark_model, train_loader, device=DEVICE)
test_loss, test_accuracy = eval_model(benchmark_model, test_loader, device=DEVICE)
print(f'Train Loss {train_loss:.3f} / Test Loss {test_loss:.3f}')
print(f'Train Accuracy {100*train_accuracy:.3f}% / Test Accuracy {100*test_accuracy:.3f}%')

チュートリアル 1: 最適化手法

チュートリアルの目的

セットアップ

セクション1. はじめに

ディスカッション: 予期せぬ結果

セクション2: ケーススタディ: 画像分類のためのMLPの成功裏の訓練

セクション2.1: データ

セクション2.2: モデル

セクション2.3: 損失関数

セクション2.4: 解釈性

セクション3: 高次元探索

コーディング演習3: 勾配降下法の実装

更新の比較

考えてみよう！3: 勾配降下法 vs ランダム探索

セクション4: 悪条件問題

考えてみよう4!: モメンタムはどう働く？

コーディング演習4: モメンタムの実装

インタラクティブデモ4: モメンタム vs. 勾配降下法

考えてみよう4: モメンタムと振動

セクション5: 非凸性

インタラクティブデモ5: 過剰パラメータ化が救う！

考えてみよう！5.1: ネットワークの幅と深さ

セクション6: フル勾配は計算コストが高い

インタラクティブデモ6.1: 計算コスト

コーディング演習6: ミニバッチサンプリングの実装

インタラクティブデモ6.2: 異なるミニバッチサイズの比較

セクション7: 適応的手法

コーディング演習7: RMSpropの実装

インタラクティブデモ 7: 最適化手法の比較

考えてみよう 7.1!: 最適化手法の比較

勾配の局所性

考えてみよう 7.2: 損失関数と最適化

セクション 8: 倫理的懸念

まとめ

ボーナス: まとめ

演習ボーナス: 自分のモデルを訓練しよう

考えてみよう ボーナス: 評価指標

評価

考えてみようボーナス: 評価指標