チュートリアル 2: MLEによる線形回帰

第1週、第2日目: モデルフィッティング

Neuromatch Academyによる

コンテンツ作成者: Pierre-Étienne Fiquet, Anqi Wu, Alex Hyafil （Byron Galbraithの協力を得て）

コンテンツレビュアー: Lina Teichmann, Madineh Sarvestani, Patrick Mineault, Ella Batty, Michael Waskomlis

制作編集者: Spiros Chavlis

チュートリアルの目的

チュートリアルの推定所要時間: 30分

これはデータにモデルをフィットさせるシリーズのチュートリアル2です。まずは単純な線形回帰を、最小二乗法（チュートリアル1）と最尤推定（MLE、チュートリアル2）を使って学びます。推定した線形モデルのパラメータの信頼区間を構築するためにブートストラップ法を使います（チュートリアル3）。回帰モデルの探求は、多変量線形回帰や多項式回帰への一般化（チュートリアル4）で締めくくります。最後に、これらのモデルの中から選択する方法を学びます。バイアス・バリアンストレードオフ（チュートリアル5）やモデル選択のための交差検証（チュートリアル6）について議論します。

このチュートリアルでは、データのランダムな「ノイズ」を組み込んだ別のアプローチで線形モデルをフィットさせます。

確率分布と確率モデルについて学ぶ
モデルパラメータの尤度を計算する方法を学ぶ
最尤推定量を実装し、尤度が最大となるモデルパラメータを見つける方法を学ぶ

# @title Tutorial slides
# @markdown These are the slides for the videos in all tutorials today
from IPython.display import IFrame
link_id = "2mkq4"
print(f"If you want to download the slides: https://osf.io/download/{link_id}/")
IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/{link_id}/?direct%26mode=render%26action=download%26mode=render", width=854, height=480)

セットアップ

# @title Install and import feedback gadget


from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "neuromatch_cn",
            "user_key": "y1x3mpx5",
        },
    ).render()


feedback_prefix = "W1D2_T2"

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# @title Figure Settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

import ipywidgets as widgets  # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle")

# @title Plotting Functions
def plot_density_image(x, y, theta, sigma=1, ax=None):
  """ Plots probability distribution of y given x, theta, and sigma

  Args:

    x (ndarray): An array of shape (samples,) that contains the input values.
    y (ndarray): An array of shape (samples,) that contains the corresponding
      measurement values to the inputs.
    theta (float): Slope parameter
    sigma (float): standard deviation of Gaussian noise

  """

  # plot the probability density of p(y|x,theta)
  if ax is None:
    fig, ax = plt.subplots()

  xmin, xmax = np.floor(np.min(x)), np.ceil(np.max(x))
  ymin, ymax = np.floor(np.min(y)), np.ceil(np.max(y))
  xx = np.linspace(xmin, xmax, 50)
  yy = np.linspace(ymin, ymax, 50)

  surface = np.zeros((len(yy), len(xx)))
  for i, x_i in enumerate(xx):
    surface[:, i] = stats.norm(theta * x_i, sigma).pdf(yy)

  ax.set(xlabel='x', ylabel='y')

  return ax.imshow(surface, origin='lower', aspect='auto', vmin=0, vmax=None,
            cmap=plt.get_cmap('Wistia'),
            extent=[xmin, xmax, ymin, ymax])

セクション1: 最尤推定（MLE）

# @title Video 1: Maximum Likelihood Estimation
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', '8mpNmzLKNfU'), ('Bilibili', 'BV1Mg4y1i7WH')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

# @title Submit your feedback
content_review(f"{feedback_prefix}_Maximum_Likelihood_Estimation_Video")

このビデオでは、1次元線形回帰の文脈での最尤推定（MLE）について説明します。

セクション1.1: ガウスノイズ

ビデオの関連部分のテキスト要約はこちらをクリック

前回のチュートリアルでは、データがノイズを含む線形関係から生成されていると仮定し、平均二乗誤差を最小化することでモデルパラメータを推定する効果的な方法を見つけました。

その際、ノイズは単なる邪魔者として扱いましたが、もしノイズをモデルに直接組み込んだらどうなるでしょうか？

線形モデルを思い出してください:

\begin{align}
y = $\theta x + \epsilon$ .
\end{align}

ノイズ成分 $\epsilon$ はしばしばガウス分布（正規分布）からのランダム変数としてモデル化されます。

ガウス分布はその確率密度関数（pdf）で表されます。
\begin{align}
$\mathcal{N}(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2} \end{align}$

この分布は平均 $\mu$ と分散 $\sigma^2$ の2つのパラメータに依存します。ノイズ信号は平均0、分散1のガウス「ホワイトノイズ」とみなすことが多いです：

\begin{align}
$\epsilon \sim \mathcal{N}(0, 1)$ .
\end{align}

インタラクティブデモ 1.1: ガウス分布エクスプローラー

以下のエクスプローラーウィジェットを使って、 $\mu$ と $\sigma$ のパラメータを変化させるとサンプルの位置や形状がどう変わるかを見てみましょう。

$\mu$ を変えるとpdfにどんな影響がありますか？
$\sigma$ を変えるとpdfにどんな影響がありますか？

# @markdown Make sure you execute this cell to enable the widget!

@widgets.interact(mu=widgets.FloatSlider(0.0, min=-2.0, max=2.0),
                  sigma=widgets.FloatSlider(1.0, min=0.5, max=2.0))
def plot_normal_dist(mu=0, sigma=1):

  # Generate pdf & samples from normal distribution with mu/sigma
  rv = stats.norm(mu, sigma)
  x = np.linspace(-5, 5, 100)
  y = rv.pdf(x)
  samples = rv.rvs(1000)

  # Plot
  fig, ax = plt.subplots()
  ax.hist(samples, 20, density=True, color='g', histtype='stepfilled',
          alpha=0.8, label='histogram')
  ax.plot(x, y, color='orange', linewidth=3, label='pdf')
  ax.vlines(mu, 0, rv.pdf(mu), color='y', linewidth=3, label='$\mu$')
  ax.vlines([mu-sigma, mu+sigma], 0, rv.pdf([mu-sigma, mu+sigma]), colors='red',
            color='b', linewidth=3, label='$\sigma$')
  ax.set(xlabel='x', ylabel='probability density', xlim=[-5, 5], ylim=[0, 1.0])
  ax.legend()
  plt.show()

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Gaussian_Distribution_Explorer_Interactive_Demo_and_Discussion")

セクション1.2: 確率モデル

チュートリアル開始からここまでの推定所要時間: 11分

ノイズ成分 $\epsilon$ をランダム変数としてモデル化したので、これを元の線形モデルにどう組み込むか考えましょう。ノイズは平均0、分散1のガウス分布に従うとします： $\epsilon \sim \mathcal{N}(0, 1)$ 。すると、 $y$ も平均 $\mu = \theta x$ 、分散 $\sigma^2 = 1$ のガウス分布からのランダム変数として扱えます：

y \sim \mathcal{N}(\theta x, 1)

これは、パラメータ $\theta$ と入力 $x$ が与えられたときに、 $y$ が観測される確率が

p(y|x,\theta) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(y-\theta x)^2}

であることを意味します。

この節および次の節では、単一のデータ点（ $x$ と $y$ のペア）に注目するため、添字 $i$ は省略しています（つまり $x_i$ ではなく $x$ を使います）。

さて、真の基礎モデルが $\theta = 1.2$ である元のサンプルデータセットを再度見てみましょう。

# @markdown Execute this cell to generate some simulated data

# setting a fixed seed to our random number generator ensures we will always
# get the same psuedorandom number sequence

np.random.seed(121)
theta = 1.2
n_samples = 30
x = 10 * np.random.rand(n_samples) # sample from a uniform distribution over [0,10)
noise = np.random.randn(n_samples) # sample from a standard normal distribution
y = theta * x + noise

今回は、 $p(y|x,\theta=1.2)$ の密度をプロットし、異なる $x$ の値で $p(y)$ がどう変わるかを見てみます。

# @markdown Execute this cell to visualize p(y|x, theta=1.2)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

# Invokes helper function to generate density image plots from data and parameters
im = plot_density_image(x, y, 1.2, ax=ax1)
plt.colorbar(im, ax=ax1)
ax1.axvline(8, color='k')
ax1.set(title=r'p(y | x, $\theta$=1.2)')

# Plot pdf for given x
ylim = ax1.get_ylim()
yy = np.linspace(ylim[0], ylim[1], 50)
ax2.plot(yy, stats.norm(theta * 8, 1).pdf(yy), color='orange', linewidth=2)
ax2.set(
    title=r'p(y|x=8, $\theta$=1.2)',
    xlabel='y',
    ylabel='probability density')
plt.show()

セクション1.3: 尤度推定

チュートリアル開始からここまでの推定所要時間: 15分

確率モデルができたので、次はデータに合う良い $\theta$ の推定値を見つける課題に戻ります。確率の不確実性を考慮して、推定値 $\hat{\theta}$ がデータに適合する尤度を考えます。尤度関数 $\mathcal{L}(\theta)$ は、その $\theta$ でパラメータ化された確率密度関数に等しいです：

\mathcal{L}(\theta|x,y) = p(y|x,\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(y-\theta x)^2}

コーディング演習 1.3: 尤度関数

この演習では、 $\sigma = 1$ の線形モデルに対する尤度関数 $\mathcal{L}(\theta|x, y)$ を実装します。

この関数を実装したら、推定値 $\hat{\theta}$ が与えられた観測値を生成した確率を計算できます。データセットのサンプルの一つで試してみましょう。

ヒント: 指数関数と平方根にはそれぞれ np.exp と np.sqrt を使いましょう。

def likelihood(theta_hat, x, y):
  """The likelihood function for a linear model with noise sampled from a
    Gaussian distribution with zero mean and unit variance.

  Args:
    theta_hat (float): An estimate of the slope parameter.
    x (ndarray): An array of shape (samples,) that contains the input values.
    y (ndarray): An array of shape (samples,) that contains the corresponding
      measurement values to the inputs.

  Returns:
    ndarray: the likelihood values for the theta_hat estimate
  """
  sigma = 1
  ##############################################################################
  ## TODO for students: implement the likelihood function
  # Fill out function and remove
  raise NotImplementedError("Student exercise: implement the likelihood function")
  ##############################################################################

  # Compute Gaussian likelihood
  pdf = ...

  return pdf


print(likelihood(1.0, x[1], y[1]))

サンプル出力

0.11344443599846923

解答を見る$

# @title Submit your feedback
content_review(f"{feedback_prefix}_Likelihood_function_Exercise")

$\mathcal{L}(\theta=1.0|x=2.1,y=3.7) \approx 0.11$ となるはずです。ここまでは良いですが、これが他の推定値より良いことをどう示せるでしょうか？

データ点の集合を扱う場合、我々はそれらの結合確率、すなわち全てのデータ点がパラメータで説明される尤度を考えます。ノイズが各出力に独立に影響すると仮定したので、尤度は分解可能で、次のように書けます：

\mathcal{L}(\theta|\mathbf{x}, \mathbf{y}) = \prod_{i=1}^N \mathcal{L}(\theta|x_i,y_i),

ここで、 $N$ 個のデータ点 $\mathbf{x} = [x_1,...,x_N]$ と $\mathbf{y} = [y_1,...,y_N]$ があります。

実際には、この積は数値的に不安定になることがあります。小さな値を掛け合わせるとアンダーフローが起きるためです。この問題は尤度の対数を取ることで回避できます。対数は積を和に変換するためです：

\log\mathcal{L}(\theta|\mathbf{x}, \mathbf{y}) = \sum_{i=1}^N \log\mathcal{L}(\theta|x_i,y_i)

likelihood メソッドの出力の対数をデータセット全体に適用して和を取ることで、異なる $\hat{\theta}$ の比較がしやすくなります。また、異なる分布密度をデータセット上でプロットして、定性的に比較することもできます。

# @markdown Execute this cell to visualize different distribution densities
theta_hats = [0.5, 1.0, 2.2]
fig, axes = plt.subplots(ncols=3, figsize=(15, 5))
for theta_hat, ax in zip(theta_hats, axes):
  ll = np.sum(np.log(likelihood(theta_hat, x, y)))  # log likelihood
  im = plot_density_image(x, y, theta_hat, ax=ax)
  ax.scatter(x, y)
  ax.set(title=fr'$\hat{{\theta}}$ = {theta_hat}, log likelihood: {ll:.2f}')
plt.colorbar(im, ax=ax)
plt.show()

対数尤度の計算を使うと、 $\mathcal{L}(\theta=1.0) > \mathcal{L}(\theta=0.5) > \mathcal{L}(\theta=2.2)$ となります。

これは良いニュースです。これで尤度に基づいて推定器を比較する方法ができました。しかし、MSEアプローチと同様に、最良の推定器を解析的に求めたいです。この場合は尤度を最大化する推定器を見つけます。

セクション1.4: 最尤推定量の求め方

チュートリアル開始からここまでの推定所要時間: 23分

ビデオの関連部分のテキスト要約はこちらをクリック

データセットを最も尤もらしくするパラメータ値 $\hat\theta$ を見つけたい：

\begin{align}
$\hat{\theta}_{\textrm{MLE}} = \underset{\theta}{\operatorname{argmax}} \mathcal{L}(\theta|X,Y) \end{align}$

対数尤度を取ることで数値安定性が向上しますが、最大化するパラメータ値は変わりません。なぜなら、 $\log()$ 関数は単調増加関数であり、入力の順序を保つからです。したがって：

\begin{align}
$\hat{\theta}_{\textrm{MLE}} = \underset{\theta}{\operatorname{argmax}} \sum_{i=1}^m \textrm{log} \mathcal{L}(\theta|x_i,y_i) \end{align}$

具体的な尤度関数を代入し対数を取ると：
\begin{align}
$\hat{\theta}_{\textrm{MLE}} = \underset{\theta}{\operatorname{argmax}}$ [- $\frac{N}{2} \operatorname{log} 2\pi\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N (y_i-\theta x_i)^2$ ].
\end{align}

対数尤度の最大化は負の対数尤度の最小化と同じです（実際の最適化ルーチンは最小化問題を解くことが多いため）。この目的関数は凸関数なので、負の対数尤度の微分を0にして解くことができます。これはMSE最小化の解と非常によく似ています。

\begin{align}
$\frac{\partial\operatorname{log}\mathcal{L}(\theta|x,y)}{\partial\theta}=\frac{1}{\sigma^2}\sum_{i=1}^N(y_i-\theta x_i)x_i = 0 \end{align}$

これはMSE推定量の最適解とまったく同じ式であり、実際に同じ解に到達します！

\begin{align}
$\hat{\theta}_{\textrm{MLE}} = \hat{\theta}_{\textrm{MSE}} = \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^N x_i^2} \end{align}$

# Compute theta_hat_MLE
theta_hat_mle = (x @ y) / (x @ x)

# @markdown Execute this cell to visualize density with theta_hat_mle

# Plot the resulting distribution density
fig, ax = plt.subplots()
ll = np.sum(np.log(likelihood(theta_hat_mle, x, y))) # log likelihood
im = plot_density_image(x, y, theta_hat_mle, ax=ax)
plt.colorbar(im, ax=ax)
ax.scatter(x, y)
ax.set(title=fr'$\hat{{\theta}}$ = {theta_hat_mle:.2f}, log likelihood: {ll:.2f}')
plt.show()

まとめ

チュートリアルの推定所要時間: 30分

尤度と確率の違い：

$\mathcal{L}(\theta|x, y) = p(y|x, \theta)$
$p(y|x, \theta) \rightarrow$ 「パラメータ $\theta$ と入力 $x$ が与えられたときに応答 $y$ が観測される確率」
$\mathcal{L}(\theta|x, y) \rightarrow$ 「パラメータ $\theta$ が入力 $x$ から応答 $y$ を生成した尤度モデル」

対数尤度の最大化：

計算上の都合で尤度関数の対数を取る
$\log\mathcal{L}(\theta|x, y)$ を最大化するパラメータ $\theta$ が、データを最もよく説明するモデルパラメータ

重要ポイント: 対数尤度は柔軟なコスト関数であり、データに最も適合するモデルパラメータを見つけるためによく使われる。

記法

\begin{align}
x & $\quad \text{入力、独立変数}$ \
y & $\quad \text{応答測定値、従属変数}$ \
$\mathbf{x}$ & $\quad \text{入力値のベクトル}$ \
$\mathbf{y}$ & $\quad \text{測定値のベクトル}$ \
$\epsilon$ & $\quad \text{測定誤差、ノイズ成分}$ \
$\epsilon \sim \mathcal{N}(\mu, \sigma^2)$ & $\quad \text{確率変数 } \epsilon \text{ は平均 } \mu \text{、分散 } \sigma^2 \text{ のガウス分布に従う}$ \
$\mu$ & $\quad \text{平均}$ \
$\sigma^2$ & $\quad \text{分散}$ \
$\sigma$ & $\quad \text{標準偏差}$ \
$\theta$ & $\quad \text{パラメータ}$ \
$\hat{\theta}$ & $\quad \text{パラメータの推定値}$ \
$\mathcal{L}(\theta|x, y)$ & $\quad \text{パラメータ } \theta \text{ が入力 } x \text{ から応答 } y \text{ を生成する尤度}$ \
p(y|x, $\theta)$ & $\quad \text{入力 } x \text{ とパラメータ } \theta \text{ が与えられたときの応答 } y \text{ の観測確率}$ \
\end{align}

ボーナス

$\mathrm{p}(\mathrm{y} | \mathrm{x}, \theta)$ を $x$ の関数として見ることもできます。これは刺激の尤度関数であり、観測された応答 $y$ から入力 $x$ をデコードしたい場合に有用です。これは外界にアクセスできず、他のニューロンの応答から外の世界を推測しようとするニューロンの視点から重要な考え方です！