チュートリアル 2: オートエンコーダーの拡張

ボーナスデイ: オートエンコーダー

Neuromatch Academy 提供

コンテンツ作成者: Marco Brigham と CCNSS チーム (2014-2018)

コンテンツレビュアー: Itzel Olivos, Karen Schroeder, Karolina Stosio, Kshitij Dwivedi, Spiros Chavlis, Michael Waskom

制作編集者: Spiros Chavlis

チュートリアルの目的

アーキテクチャ

2次元のボトルネック層を持つ浅いオートエンコーダーの内部表現をどのように改善できるでしょうか？

次のようなアーキテクチャの変更を試みることができます：

追加の隠れ層の導入
潜在空間を球面として扱う

深層ANNオートエンコーダー

隠れ層を追加することで学習可能なパラメータ数が増え、エンコード・デコードにおける非線形操作をより効果的に活用できます。潜在空間の球面幾何は、ネットワークがこれらの追加自由度をより効率的に使うことを促します。

オートエンコーダーの技術的側面をより深く掘り下げ、MNIST 認知課題 に必要なレベルに到達するために内部表現を改善しましょう。

このチュートリアルでは、

追加の隠れ層を導入してネットワークの容量を増やす
潜在空間の幾何学的制約の効果を理解する

セットアップ

# @title Install dependencies

# @title Install and import feedback gadget


from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "neuromatch_cn",
            "user_key": "y1x3mpx5",
        },
    ).render()


feedback_prefix = "Bonus_Autoencoders_T2"

# Imports
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch import nn, optim

from sklearn.datasets import fetch_openml

# @title Figure settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True
import plotly.graph_objects as go
from plotly.colors import qualitative
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/NMA2020/nma.mplstyle")

# @title Helper functions


def downloadMNIST():
  """
  Download MNIST dataset and transform it to torch.Tensor

  Args:
    None

  Returns:
    x_train : training images (torch.Tensor) (60000, 28, 28)
    x_test  : test images (torch.Tensor) (10000, 28, 28)
    y_train : training labels (torch.Tensor) (60000, )
    y_train : test labels (torch.Tensor) (10000, )
  """
  X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
  # Trunk the data
  n_train = 60000
  n_test = 10000

  train_idx = np.arange(0, n_train)
  test_idx = np.arange(n_train, n_train + n_test)

  x_train, y_train = X[train_idx], y[train_idx]
  x_test, y_test = X[test_idx], y[test_idx]

  # Transform np.ndarrays to torch.Tensor
  x_train = torch.from_numpy(np.reshape(x_train,
                                        (len(x_train),
                                         28, 28)).astype(np.float32))
  x_test = torch.from_numpy(np.reshape(x_test,
                                       (len(x_test),
                                        28, 28)).astype(np.float32))

  y_train = torch.from_numpy(y_train.astype(int))
  y_test = torch.from_numpy(y_test.astype(int))

  return (x_train, y_train, x_test, y_test)


def init_weights_kaiming_uniform(layer):
  """
  Initializes weights from linear PyTorch layer
  with kaiming uniform distribution.

  Args:
    layer (torch.Module)
        Pytorch layer

  Returns:
    Nothing.
  """
  # check for linear PyTorch layer
  if isinstance(layer, nn.Linear):
    # initialize weights with kaiming uniform distribution
    nn.init.kaiming_uniform_(layer.weight.data)


def init_weights_kaiming_normal(layer):
  """
  Initializes weights from linear PyTorch layer
  with kaiming normal distribution.

  Args:
    layer (torch.Module)
        Pytorch layer

  Returns:
    Nothing.
  """
  # check for linear PyTorch layer
  if isinstance(layer, nn.Linear):
    # initialize weights with kaiming normal distribution
    nn.init.kaiming_normal_(layer.weight.data)


def get_layer_weights(layer):
  """
  Retrieves learnable parameters from PyTorch layer.

  Args:
    layer (torch.Module)
        Pytorch layer

  Returns:
    list with learnable parameters
  """
  # initialize output list
  weights = []

  # check whether layer has learnable parameters
  if layer.parameters():
    # copy numpy array representation of each set of learnable parameters
    for item in layer.parameters():
      weights.append(item.detach().numpy())

  return weights


def print_parameter_count(net):
  """
  Prints count of learnable parameters per layer from PyTorch network.

  Args:
    net (torch.Sequential)
        Pytorch network

  Returns:
    Nothing.
  """

  params_n = 0

  # loop all layers in network
  for layer_idx, layer in enumerate(net):

    # retrieve learnable parameters
    weights = get_layer_weights(layer)
    params_layer_n = 0

    # loop list of learnable parameters and count them
    for params in weights:
      params_layer_n += params.size

    params_n += params_layer_n
    print(f'{layer_idx}\t {params_layer_n}\t {layer}')

  print(f'\nTotal:\t {params_n}')


def eval_mse(y_pred, y_true):
  """
  Evaluates mean square error (MSE) between y_pred and y_true

  Args:
    y_pred (torch.Tensor)
        prediction samples

    v (numpy array of floats)
        ground truth samples

  Returns:
    MSE(y_pred, y_true)
  """

  with torch.no_grad():
      criterion = nn.MSELoss()
      loss = criterion(y_pred, y_true)

  return float(loss)


def eval_bce(y_pred, y_true):
  """
  Evaluates binary cross-entropy (BCE) between y_pred and y_true

  Args:
    y_pred (torch.Tensor)
        prediction samples

    v (numpy array of floats)
        ground truth samples

  Returns:
    BCE(y_pred, y_true)
  """

  with torch.no_grad():
    criterion = nn.BCELoss()
    loss = criterion(y_pred, y_true)

  return float(loss)


def plot_row(images, show_n=10, image_shape=None):
  """
  Plots rows of images from list of iterables (iterables: list, numpy array
  or torch.Tensor). Also accepts single iterable.
  Randomly selects images in each list element if item count > show_n.

  Args:
    images (iterable or list of iterables)
        single iterable with images, or list of iterables

    show_n (integer)
        maximum number of images per row

    image_shape (tuple or list)
        original shape of image if vectorized form

  Returns:
    Nothing.
  """

  if not isinstance(images, (list, tuple)):
    images = [images]

  for items_idx, items in enumerate(images):

    items = np.array(items)
    if items.ndim == 1:
      items = np.expand_dims(items, axis=0)

    if len(items) > show_n:
      selected = np.random.choice(len(items), show_n, replace=False)
      items = items[selected]

    if image_shape is not None:
      items = items.reshape([-1]+list(image_shape))

    plt.figure(figsize=(len(items) * 1.5, 2))
    for image_idx, image in enumerate(items):

      plt.subplot(1, len(items), image_idx + 1)
      plt.imshow(image, cmap='gray', vmin=image.min(), vmax=image.max())
      plt.axis('off')

    plt.tight_layout()


def to_s2(u):
  """
  Projects 3D coordinates to spherical coordinates (theta, phi) surface of
  unit sphere S2.
  theta: [0, pi]
  phi: [-pi, pi]

  Args:
    u (list, numpy array or torch.Tensor of floats)
        3D coordinates

  Returns:
    Spherical coordinates (theta, phi) on surface of unit sphere S2.
  """

  x, y, z = (u[:, 0], u[:, 1], u[:, 2])
  r = np.sqrt(x**2 + y**2 + z**2)
  theta = np.arccos(z / r)
  phi = np.arctan2(x, y)

  return np.array([theta, phi]).T


def to_u3(s):
  """
  Converts from 2D coordinates on surface of unit sphere S2 to 3D coordinates
  (on surface of S2), i.e. (theta, phi) ---> (1, theta, phi).

  Args:
    s (list, numpy array or torch.Tensor of floats)
        2D coordinates on unit sphere S_2

  Returns:
    3D coordinates on surface of unit sphere S_2
  """

  theta, phi = (s[:, 0], s[:, 1])
  x = np.sin(theta) * np.sin(phi)
  y = np.sin(theta) * np.cos(phi)
  z = np.cos(theta)

  return np.array([x, y, z]).T


def xy_lim(x):
  """
  Return arguments for plt.xlim and plt.ylim calculated from minimum
  and maximum of x.

  Args:
    x (list, numpy array or torch.Tensor of floats)
        data to be plotted

  Returns:
    Nothing.
  """

  x_min = np.min(x, axis=0)
  x_max = np.max(x, axis=0)

  x_min = x_min - np.abs(x_max - x_min) * 0.05 - np.finfo(float).eps
  x_max = x_max + np.abs(x_max - x_min) * 0.05 + np.finfo(float).eps

  return [x_min[0], x_max[0]], [x_min[1], x_max[1]]


def plot_generative(x, decoder_fn, image_shape, n_row=16, s2=False):
  """
  Plots images reconstructed by decoder_fn from a 2D grid in
  latent space that is determined by minimum and maximum values in x.

  Args:
    x (list, numpy array or torch.Tensor of floats)
        2D or 3D coordinates in latent space

    decoder_fn (integer)
        function returning vectorized images from 2D latent space coordinates

    image_shape (tuple or list)
        original shape of image

    n_row (integer)
        number of rows in grid

    s2 (boolean)
        convert 3D coordinates (x, y, z) to spherical coordinates (theta, phi)

  Returns:
    Nothing.
  """

  if s2:
    x = to_s2(np.array(x))

  xlim, ylim = xy_lim(np.array(x))

  dx = (xlim[1] - xlim[0]) / n_row
  grid = [np.linspace(ylim[0] + dx / 2, ylim[1] - dx / 2, n_row),
          np.linspace(xlim[0] + dx / 2, xlim[1] - dx / 2, n_row)]

  canvas = np.zeros((image_shape[0] * n_row, image_shape[1] * n_row))

  cmap = plt.get_cmap('gray')

  for j, latent_y in enumerate(grid[0][::-1]):
    for i, latent_x in enumerate(grid[1]):

      latent = np.array([[latent_x, latent_y]], dtype=np.float32)

      if s2:
        latent = to_u3(latent)

      with torch.no_grad():
        x_decoded = decoder_fn(torch.from_numpy(latent))

      x_decoded = x_decoded.reshape(image_shape)

      canvas[j * image_shape[0]: (j + 1) * image_shape[0],
             i * image_shape[1]: (i + 1) * image_shape[1]] = x_decoded

  plt.imshow(canvas, cmap=cmap, vmin=canvas.min(), vmax=canvas.max())
  plt.axis('off')


def plot_latent(x, y, show_n=500, s2=False, fontdict=None, xy_labels=None):
  """
  Plots digit class of each sample in 2D latent space coordinates.

  Args:
    x (list, numpy array or torch.Tensor of floats)
        2D coordinates in latent space

    y (list, numpy array or torch.Tensor of floats)
        digit class of each sample

    n_row (integer)
        number of samples

    s2 (boolean)
        convert 3D coordinates (x, y, z) to spherical coordinates (theta, phi)

    fontdict (dictionary)
        style option for plt.text

    xy_labels (list)
        optional list with [xlabel, ylabel]

  Returns:
    Nothing.
  """

  if fontdict is None:
    fontdict = {'weight': 'bold', 'size': 12}

  if s2:
    x = to_s2(np.array(x))

  cmap = plt.get_cmap('tab10')

  if len(x) > show_n:
    selected = np.random.choice(len(x), show_n, replace=False)
    x = x[selected]
    y = y[selected]

  for my_x, my_y in zip(x, y):
    plt.text(my_x[0], my_x[1], str(int(my_y)),
             color=cmap(int(my_y) / 10.),
             fontdict=fontdict,
             horizontalalignment='center',
             verticalalignment='center',
             alpha=0.8)

  xlim, ylim = xy_lim(np.array(x))
  plt.xlim(xlim)
  plt.ylim(ylim)

  if s2:
    if xy_labels is None:
      xy_labels = [r'$\varphi$', r'$\theta$']

    plt.xticks(np.arange(0, np.pi + np.pi / 6, np.pi / 6),
               ['0', '$\pi/6$', '$\pi/3$', '$\pi/2$',
                '$2\pi/3$', '$5\pi/6$', '$\pi$'])
    plt.yticks(np.arange(-np.pi, np.pi + np.pi / 3, np.pi / 3),
               ['$-\pi$', '$-2\pi/3$', '$-\pi/3$', '0',
                '$\pi/3$', '$2\pi/3$', '$\pi$'])

  if xy_labels is None:
    xy_labels = ['$Z_1$', '$Z_2$']

  plt.xlabel(xy_labels[0])
  plt.ylabel(xy_labels[1])


def plot_latent_generative(x, y, decoder_fn, image_shape, s2=False,
                           title=None, xy_labels=None):
  """
  Two horizontal subplots generated with encoder map and decoder grid.

  Args:
    x (list, numpy array or torch.Tensor of floats)
        2D coordinates in latent space

    y (list, numpy array or torch.Tensor of floats)
        digit class of each sample

    decoder_fn (integer)
        function returning vectorized images from 2D latent space coordinates

    image_shape (tuple or list)
        original shape of image

    s2 (boolean)
        convert 3D coordinates (x, y, z) to spherical coordinates (theta, phi)

    title (string)
        plot title

    xy_labels (list)
        optional list with [xlabel, ylabel]

  Returns:
    Nothing.
  """

  fig = plt.figure(figsize=(12, 6))

  if title is not None:
    fig.suptitle(title, y=1.05)

  ax = fig.add_subplot(121)
  ax.set_title('Encoder map', y=1.05)
  plot_latent(x, y, s2=s2, xy_labels=xy_labels)

  ax = fig.add_subplot(122)
  ax.set_title('Decoder grid', y=1.05)
  plot_generative(x, decoder_fn, image_shape, s2=s2)

  plt.tight_layout()
  plt.show()


def plot_latent_3d(my_x, my_y, show_text=True, show_n=500):
  """
  Plot digit class or marker in 3D latent space coordinates.

  Args:
    my_x (list, numpy array or torch.Tensor of floats)
        2D coordinates in latent space

    my_y (list, numpy array or torch.Tensor of floats)
        digit class of each sample

    show_text (boolean)
        whether to show text

    image_shape (tuple or list)
        original shape of image

    s2 (boolean)
        convert 3D coordinates (x, y, z) to spherical coordinates (theta, phi)

    title (string)
        plot title

  Returns:
    Nothing.
  """

  layout = {'margin': {'l': 0, 'r': 0, 'b': 0, 't': 0},
            'scene': {'xaxis': {'showspikes': False,
                                'title': 'z1'},
                      'yaxis': {'showspikes': False,
                                'title': 'z2'},
                      'zaxis': {'showspikes': False,
                                'title': 'z3'}}
            }

  selected_idx = np.random.choice(len(my_x), show_n, replace=False)

  colors = [qualitative.T10[idx] for idx in my_y[selected_idx]]

  x = my_x[selected_idx, 0]
  y = my_x[selected_idx, 1]
  z = my_x[selected_idx, 2]

  text = my_y[selected_idx]

  if show_text:

    trace = go.Scatter3d(x=x, y=y, z=z, text=text,
                         mode='text',
                         textfont={'color': colors, 'size': 12}
                         )

    layout['hovermode'] = False

  else:

    trace = go.Scatter3d(x=x, y=y, z=z, text=text,
                         hoverinfo='text', mode='markers',
                         marker={'size': 5, 'color': colors, 'opacity': 0.8}
                         )

  fig = go.Figure(data=trace, layout=layout)

  fig.show()


def runSGD(net, input_train, input_test, criterion='bce',
           n_epochs=10, batch_size=32, verbose=False):
  """
  Trains autoencoder network with stochastic gradient descent with Adam
  optimizer and loss criterion. Train samples are shuffled, and loss is
  displayed at the end of each opoch for both MSE and BCE. Plots training loss
  at each minibatch (maximum of 500 randomly selected values).

  Args:
    net (torch network)
        ANN object (nn.Module)

    input_train (torch.Tensor)
        vectorized input images from train set

    input_test (torch.Tensor)
        vectorized input images from test set

    criterion (string)
        train loss: 'bce' or 'mse'

    n_epochs (boolean)
        number of full iterations of training data

    batch_size (integer)
        number of element in mini-batches

    verbose (boolean)
        print final loss

  Returns:
    Nothing.
  """

  # Initialize loss function
  if criterion == 'mse':
    loss_fn = nn.MSELoss()
  elif criterion == 'bce':
    loss_fn = nn.BCELoss()
  else:
    print('Please specify either "mse" or "bce" for loss criterion')

  # Initialize SGD optimizer
  optimizer = optim.Adam(net.parameters())

  # Placeholder for loss
  track_loss = []

  print('Epoch', '\t', 'Loss train', '\t', 'Loss test')
  for i in range(n_epochs):

    shuffle_idx = np.random.permutation(len(input_train))
    batches = torch.split(input_train[shuffle_idx], batch_size)

    for batch in batches:

      output_train = net(batch)
      loss = loss_fn(output_train, batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      # Keep track of loss at each epoch
      track_loss += [float(loss)]

    loss_epoch = f'{i+1}/{n_epochs}'
    with torch.no_grad():
      output_train = net(input_train)
      loss_train = loss_fn(output_train, input_train)
      loss_epoch += f'\t {loss_train:.4f}'

      output_test = net(input_test)
      loss_test = loss_fn(output_test, input_test)
      loss_epoch += f'\t\t {loss_test:.4f}'

    print(loss_epoch)

  if verbose:
    # Print loss
    loss_mse = f'\nMSE\t {eval_mse(output_train, input_train):0.4f}'
    loss_mse += f'\t\t {eval_mse(output_test, input_test):0.4f}'
    print(loss_mse)

    loss_bce = f'BCE\t {eval_bce(output_train, input_train):0.4f}'
    loss_bce += f'\t\t {eval_bce(output_test, input_test):0.4f}'
    print(loss_bce)

  # Plot loss
  step = int(np.ceil(len(track_loss) / 500))
  x_range = np.arange(0, len(track_loss), step)
  plt.figure()
  plt.plot(x_range, track_loss[::step], 'C0')
  plt.xlabel('Iterations')
  plt.ylabel('Loss')
  plt.xlim([0, None])
  plt.ylim([0, None])
  plt.show()


class NormalizeLayer(nn.Module):
  """
  pyTorch layer (nn.Module) that normalizes activations by their L2 norm.

  Args:
      None.

  Returns:
      Object inherited from nn.Module class.
  """

  def __init__(self):
    super().__init__()

  def forward(self, x):
    return nn.functional.normalize(x, p=2, dim=1)

セクション 0: はじめに

# @title Video 1: Extensions
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'pgkrU9UqXiU'), ('Bilibili', 'BV175411a7j9')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Extensions_Video")

セクション 1: MNIST データセットのダウンロードと準備

ヘルパー関数 downloadMNIST を使ってデータセットをダウンロードし、torch.Tensor に変換して、訓練セットとテストセットをそれぞれ (x_train, y_train) と (x_test, y_test) に割り当てます。

変数 input_size は、訓練用およびテスト用画像の ベクトル化された バージョンの長さを格納します。

指示:

以下のセルを実行してください

# Download MNIST
x_train, y_train, x_test, y_test = downloadMNIST()

x_train = x_train / 255
x_test = x_test / 255

image_shape = x_train.shape[1:]

input_size = np.prod(image_shape)

input_train = x_train.reshape([-1, input_size])
input_test = x_test.reshape([-1, input_size])

test_selected_idx = np.random.choice(len(x_test), 10, replace=False)
train_selected_idx = np.random.choice(len(x_train), 10, replace=False)

print(f'shape image \t \t {image_shape}')
print(f'shape input_train \t {input_train.shape}')
print(f'shape input_test \t {input_test.shape}')

セクション 2: より深いオートエンコーダー (2D)

2次元潜在空間を持つ浅いオートエンコーダーの内部表現は PCA に似ており、オートエンコーダーが非線形能力を十分に活用していないことを示しています。学習可能なパラメータを増やすことで、エンコード・デコードにおける非線形操作を活用し、データの非線形パターンを捉えられます。

隠れ層を追加することで、層ごとまたは深さ方向に追加のパラメータを導入できます。同じ数 $N$ の追加パラメータは単一層に追加することも、複数層に分散することも可能です。複数の隠れ層を追加すると、各層の圧縮/復元比率は減少します。

コーディング演習 1: より深いオートエンコーダー (2D) の構築

4つの隠れ層を追加して、このより深いANNオートエンコーダーを実装してください。エンコーダの各層のユニット数は以下の通りです：

784 -> 392 -> 64 -> 2

浅いオートエンコーダーの圧縮比は 784:2 = 392:1 です。最初の追加隠れ層の圧縮比は 2:1、続く隠れ層はボトルネックの圧縮比を 32:1 に設定します。

隠れ層のサイズの選択は、ボトルネック層の圧縮率を下げつつ、学習可能パラメータ数を増やすことを目的としています。例えば、最初の隠れ層の圧縮率が 2:1 から 4:1 に倍増すると、学習可能パラメータ数は 667K から 333K に半減します。

この深いオートエンコーダーの性能は、さらに隠れ層を追加し、各層の学習可能パラメータ数を増やすことで向上する可能性があります。ただし、パラメータ数と深さが増すと学習の難易度が上がるため、効果は徐々に減少します。ボーナス セクションでは、入力サイズの2倍から3倍の最初の隠れ層を追加する方法も検討しています。これによりパラメータ数は数百万に達しますが、学習時間が長くなります。

深層ネットワークでは重みの初期化が特に重要です。大規模データセットの利用と重み初期化が2010年のディープラーニング革命を促したと考えられています。ここでは Kaiming 正規分布初期化を以下のように実装します：

model[:-2].apply(init_weights_kaiming_normal)

指示:

4つの隠れ層と活性化関数をネットワークに追加してください
encoder と decoder の定義を調整してください
最後のセルを実行して、このオートエンコーダーの学習可能パラメータ数を確認してください

encoding_size = 2
#####################################################################
## TODO for students: add layers to build deeper autoencoder
raise NotImplementedError("Complete the sequential model")
#####################################################################
model = nn.Sequential(
    nn.Linear(input_size, int(input_size / 2)),
    nn.PReLU(),
    nn.Linear(int(input_size / 2), encoding_size * 32),
    # Add activation function
    # ...,
    # Add another layer
    # nn.Linear(..., ...),
    # Add activation function
    # ...,
    # Add another layer
    # nn.Linear(..., ...),
    # Add activation function
    # ...,
    # Add another layer
    # nn.Linear(..., ...),
    # Add activation function
    # ...,
    # Add another layer
    # nn.Linear(..., ...),
    # Add activation function
    # ....
    )

model[:-2].apply(init_weights_kaiming_normal)

print(f'Autoencoder \n\n {model}\n')

# Adjust the value n_l to split your model correctly
n_l = ...

# uncomment when you fill the code
encoder = model[:n_l]
decoder = model[n_l:]
print(f'Encoder \n\n {encoder}\n')
print(f'Decoder \n\n {decoder}')

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Build_deeper_autoencoder_Exercise")

ヘルパー関数: print_parameter_count

この関数を確認するために、以下の行のコメントを外してください。

# help(print_parameter_count)

オートエンコーダーの訓練

n_epochs=10 エポック、batch_size=128 でネットワークを訓練し、内部表現がどのようにして追加の数字クラスをうまく捉えるかを観察します。

エンコーダのマップは、デコーダのグリッド上の対応する数字に対応するよく分離されたクラスタを示します。デコーダのグリッドは、数字の傾き（左や右に傾く）に対してもネットワークが頑健であることを示しています。

指示:

以下のセルを実行してください

n_epochs = 10
batch_size = 128

runSGD(model, input_train, input_test, n_epochs=n_epochs,
       batch_size=batch_size)

with torch.no_grad():
  output_test = model(input_test)
  latent_test = encoder(input_test)

plot_row([input_test[test_selected_idx], output_test[test_selected_idx]],
         image_shape=image_shape)

plot_latent_generative(latent_test, y_test, decoder, image_shape=image_shape)

セクション 3: 球面潜在空間

前のアーキテクチャでは、表現は通常座標 $(z_1, z_2)=(0,0)$ から異なる方向に広がります。この効果は、重みが0周辺にランダムに分布して初期化されるためです。

ボトルネック層に3つ目のユニットを追加すると、3D空間の座標 $(z_1, z_2, z_3)$ が定義されます。このネットワークの潜在空間も $(z_1, z_2, z_3)=(0, 0, 0)$ から広がります。

潜在空間を球面の表面に収束させることで、原点 $(0, 0, 0)$ から無限に広がることができなくなります。なぜなら、どの方向に広がっても最終的には原点に戻るからです。この制約により、表現は球面の表面を満たすようになります。

単位球 S2 $

球面への射影は、座標 $(z_1, z_2, z_3)$ をその $L_2$ ノルムで割ることで実装されます。

(z_1, z_2, z_3)\longmapsto (s_1, s_2, s_3)=(z_1, z_2, z_3)/\|(z_1, z_2, z_3)\|_2=(z_1, z_2, z_3)/ \sqrt{z_1^2+z_2^2+z_3^2}

この写像は単位半径の $S_2$ 球面の表面に射影します。（なぜでしょう？）

セクション 3.1: オートエンコーダーの構築と訓練（3D）

ボトルネック層にユニットを1つ追加し、潜在空間を3Dで可視化します。

以下のセルを実行してください。

encoding_size = 3

model = nn.Sequential(
    nn.Linear(input_size, int(input_size / 2)),
    nn.PReLU(),
    nn.Linear(int(input_size / 2), encoding_size * 32),
    nn.PReLU(),
    nn.Linear(encoding_size * 32, encoding_size),
    nn.PReLU(),
    nn.Linear(encoding_size, encoding_size * 32),
    nn.PReLU(),
    nn.Linear(encoding_size * 32, int(input_size / 2)),
    nn.PReLU(),
    nn.Linear(int(input_size / 2), input_size),
    nn.Sigmoid()
    )

model[:-2].apply(init_weights_kaiming_normal)

encoder = model[:6]
decoder = model[6:]

print(f'Autoencoder \n\n {model}')

セクション 3.2: オートエンコーダーの訓練

n_epochs=10 エポック、batch_size=128 でネットワークを訓練します。内部表現が原点から広がり、ボトルネック層の自由度が増えたことで損失が大幅に低下する様子を観察してください。

指示:

以下のセルを実行してください

n_epochs = 10
batch_size = 128

runSGD(model, input_train, input_test, n_epochs=n_epochs,
       batch_size=batch_size)

セクション 3.3: 潜在空間を3Dで可視化

ヘルパー関数: plot_latent_3d

この関数を確認するために、以下の行のコメントを解除してください。

# help(plot_latent_3d)

with torch.no_grad():
  latent_test = encoder(input_test)

plot_latent_3d(latent_test, y_test)

コーディング演習 2: 潜在球面空間を持つ深層オートエンコーダー（2D）を構築

潜在空間を球面 $S_2$ の表面に制約します。

指示:

ボトルネック層の後にカスタム層 NormalizeLayer を追加してください
encoder と decoder の定義を調整してください
plot_latent_3d のキーワード引数 show_text=False を試してみてください

ヘルパー関数: NormalizeLayer

この関数を確認するために、以下の行のコメントを解除してください。

# help(NormalizeLayer)

encoding_size = 3
#####################################################################
## TODO for students: add custom normalize layer
raise NotImplementedError("Complete the sequential model")
#####################################################################
model = nn.Sequential(
    nn.Linear(input_size, int(input_size / 2)),
    nn.PReLU(),
    nn.Linear(int(input_size / 2), encoding_size * 32),
    nn.PReLU(),
    nn.Linear(encoding_size * 32, encoding_size),
    nn.PReLU(),
    # add the normalization layer
    # ...,
    nn.Linear(encoding_size, encoding_size * 32),
    nn.PReLU(),
    nn.Linear(encoding_size * 32, int(input_size / 2)),
    nn.PReLU(),
    nn.Linear(int(input_size / 2), input_size),
    nn.Sigmoid()
    )

model[:-2].apply(init_weights_kaiming_normal)

print(f'Autoencoder \n\n {model}\n')

# Adjust the value n_l to split your model correctly
n_l = ...

encoder = model[:n_l]
decoder = model[n_l:]
print(f'Encoder \n\n {encoder}\n')
print(f'Decoder \n\n {decoder}')

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Deep_autoencoder_with_latent_spherical_space_Exercise")

セクション 3.4: オートエンコーダの訓練

n_epochs=10 エポック、batch_size=128 でネットワークを訓練し、損失が再び上昇し、2次元潜在空間のモデルと同程度になる様子を観察してください。

指示:

以下のセルを実行してください

n_epochs = 10
batch_size = 128

runSGD(model, input_train, input_test, n_epochs=n_epochs,
       batch_size=batch_size)

with torch.no_grad():
  latent_test = encoder(input_test)

plot_latent_3d(latent_test, y_test)

セクション 3.5: $S_2$ の表面上の潜在空間の可視化

単位球面 $S_2$ の表面上の3次元座標 $(s_1, s_2, s_3)$ は、以下のように球面座標系 $(r, \theta, \phi)$ に変換できます：

\begin{align}
r &= $\sqrt{s_1^2 + s_2^2 + s_3^2}$ \
$\phi$ &= $\arctan \frac{s_2}{s_1}$ \
$\theta$ &= $\arccos\frac{s_3}{r} \end{align}$

球面座標 $

$(\theta, \phi)$ が取る値の範囲（数値的ドメイン）は何でしょうか？

角度 $(\theta, \phi)$ は球面上の自由度のみを表すため、2次元表現に戻ります。plot_latent_generative にキーワード引数 s2=True を追加すると、地図のように球面を展開して表示できます。

課題: プロット軸の数値範囲を確認し、 $\theta$ と $\phi$ を特定し、前の演習の3Dプロットの展開を可視化してください。

指示:

以下のセルを実行してください

with torch.no_grad():
  output_test = model(input_test)

plot_row([input_test[test_selected_idx], output_test[test_selected_idx]],
         image_shape=image_shape)

plot_latent_generative(latent_test, y_test, decoder,
                       image_shape=image_shape, s2=True)

まとめ

表現能力を向上させる2つの手法を学びました：隠れ層を数層追加することと、潜在空間を球面 $S_2$ に射影することです。

オートエンコーダの表現力は隠れ層を増やすことで向上します。潜在空間を $S_2$ の表面に投影すると、数字クラスがより視覚的に分散されますが、必ずしも損失が低くなるとは限りません。

深層オートエンコーダのアーキテクチャは、MNISTの認知タスクのような高度な課題に対応できる豊かな内部表現を持っています。

これで、単純なアルゴリズムが関連するデータパターンを捉えて堅牢な世界モデルを構築する方法を探求する強力なツールを手に入れました。

# @title Video 2: Wrap-up
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'GnkmzCqEK3E'), ('Bilibili', 'BV1ED4y1U7DK')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_WrapUp_Video")

ボーナス

深くて幅の広いオートエンコーダ

この演習では、最初の隠れ層を入力サイズの2倍に拡張し、その後入力サイズの半分に圧縮する構成で、約380万パラメータになります。訓練時間が長いため、チュートリアル中はこのネットワークを訓練しないでください。

指示:

以下のセルのコメントを外して実行してください

# encoding_size = 3

# model = nn.Sequential(
#     nn.Linear(input_size, int(input_size * 2)),
#     nn.PReLU(),
#     nn.Linear(int(input_size * 2), int(input_size / 2)),
#     nn.PReLU(),
#     nn.Linear(int(input_size / 2), encoding_size * 32),
#     nn.PReLU(),
#     nn.Linear(encoding_size * 32, encoding_size),
#     nn.PReLU(),
#     NormalizeLayer(),
#     nn.Linear(encoding_size, encoding_size * 32),
#     nn.PReLU(),
#     nn.Linear(encoding_size * 32, int(input_size / 2)),
#     nn.PReLU(),
#     nn.Linear(int(input_size / 2), int(input_size * 2)),
#     nn.PReLU(),
#     nn.Linear(int(input_size * 2), input_size),
#     nn.Sigmoid()
#     )

# model[:-2].apply(init_weights_kaiming_normal)

# encoder = model[:9]
# decoder = model[9:]

# print_parameter_count(model)

# n_epochs = 5
# batch_size = 128

# runSGD(model, input_train, input_test, n_epochs=n_epochs,
#        batch_size=batch_size)

# Visualization
# with torch.no_grad():
#   output_test = model(input_test)

# plot_row([input_test[test_selected_idx], output_test[test_selected_idx]],
#          image_shape=image_shape)

# plot_latent_generative(latent_test, y_test, decoder,
#                        image_shape=image_shape, s2=True)

チュートリアル 2: オートエンコーダーの拡張

チュートリアルの目的

アーキテクチャ

セットアップ

セクション 0: はじめに

セクション 1: MNIST データセットのダウンロードと準備

セクション 2: より深いオートエンコーダー (2D)

コーディング演習 1: より深いオートエンコーダー (2D) の構築

オートエンコーダーの訓練

セクション 3: 球面潜在空間

セクション 3.1: オートエンコーダーの構築と訓練（3D）

セクション 3.2: オートエンコーダーの訓練

セクション 3.3: 潜在空間を3Dで可視化

コーディング演習 2: 潜在球面空間を持つ深層オートエンコーダー（2D）を構築

セクション 3.4: オートエンコーダの訓練

セクション 3.5: S2S_2S2​ の表面上の潜在空間の可視化

まとめ

ボーナス

深くて幅の広いオートエンコーダ

セクション 3.5: $S_2$ の表面上の潜在空間の可視化