ボーナスチュートリアル：多言語埋め込み

第3週、第1日目：時系列と自然言語処理

Neuromatch Academyによる

コンテンツ作成者： Alish Dipani, Kelson Shilling-Scrivo, Lyle Ungar

コンテンツレビュアー： Kelson Shilling-Scrivo

コンテンツ編集者： Kelson Shilling-Scrivo

制作編集者： Gagana B, Spiros Chavlis

元コンテンツ提供者：Anushree Hede, Pooja Consul, Ann-Katrin Reuel

チュートリアルの目的

RNNがシーケンスのモデリングに優れている理由を探る前に、シーケンスをモデリングし、テキストをエンコードし、そのようなエンコーディングや埋め込みを使って意味のある測定を行う他の方法をいくつか見ていきます。

# @title Tutorial slides
from IPython.display import IFrame
link_id = "n263c"
print(f"If you want to download the slides: https://osf.io/download/{link_id}/")
IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/{link_id}/?direct%26mode=render%26action=download%26mode=render", width=854, height=480)

セットアップ

# @title Install dependencies
# @markdown There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.

# @title Install and import feedback gadget


from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "neuromatch_dl",
            "user_key": "f379rz8y",
        },
    ).render()


feedback_prefix = "W3D1_T3_Bonus"

# @title Install fastText
# @markdown If you want to see the original code, go to repo: https://github.com/facebookresearch/fastText.git


import os, zipfile, requests

url = "https://osf.io/vkuz7/download"
fname = "fastText-main.zip"

print('Downloading Started...')
# Downloading the file by sending the request to the URL
r = requests.get(url, stream=True)

# Writing the file to the local file system
with open(fname, 'wb') as f:
  f.write(r.content)
print('Downloading Completed.')

# opening the zip file in READ mode
with zipfile.ZipFile(fname, 'r') as zipObj:
  # extracting all the files
  print('Extracting all the files now...')
  zipObj.extractall()
  print('Done!')
  os.remove(fname)

# Install the package

# Imports
import fasttext
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# @title Figure Settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

import ipywidgets as widgets
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")

# @title Helper functions
def cosine_similarity(vec_a, vec_b):
  """Compute cosine similarity between vec_a and vec_b"""
  return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))


def getSimilarity(word1, word2):
  v1 = ft_en_vectors.get_word_vector(word1)
  v2 = ft_en_vectors.get_word_vector(word2)
  return cosine_similarity(v1, v2)

# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# For DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import torch

def set_seed(seed=None, seed_torch=True):
  """
  Function that controls randomness.
  NumPy and random modules must be imported.

  Args:
    seed : Integer
      A non-negative integer that defines the random state. Default is `None`.
    seed_torch : Boolean
      If `True` sets the random seed for pytorch tensors, so pytorch module
      must be imported. Default is `True`.

  Returns:
    Nothing.
  """
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')

# In case that `DataLoader` is used
def seed_worker(worker_id):
  """
  DataLoader will reseed workers following randomness in
  multi-process data loading algorithm.

  Args:
    worker_id: integer
      ID of subprocess to seed. 0 means that
      the data will be loaded in the main process
      Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details

  Returns:
    Nothing
  """
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)

# @title Set device (GPU or CPU). Execute `set_device()`

# Inform the user if the notebook uses GPU or CPU.

def set_device():
  """
  Set the device. CUDA if available, CPU otherwise

  Args:
    None

  Returns:
    Nothing
  """
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device

DEVICE = set_device()
SEED = 2021
set_seed(seed=SEED)

セクション1：多言語埋め込み

従来、単語埋め込みは言語ごとに特化しており、各言語の埋め込みは別々に学習され、全く異なるベクトル空間に存在していました。しかし、もし異なる言語間で単語を比較したい場合はどうでしょうか？例えば、英語とスペイン語のコーパスを使ってテキスト分類器を作成したいとします。

ここでは、fastTextが提供する多言語単語埋め込みを使用します。詳細はこちらをご覧ください。

多言語埋め込みの学習

まず、fastTextとFacebookおよびWikipediaのデータを組み合わせて、各言語ごとに別々の埋め込みを学習します。次に、両言語間で共通する単語の辞書を見つけます。この辞書は並列データ（同じ意味を持つ2言語の文のペアからなるデータセット）から自動的に生成されます。

その後、与えられた言語間で埋め込みを共通の空間に射影する行列を見つけます。この行列は、単語 $x_i$ とその射影 $y_i$ の距離を最小化するように設計されています。辞書がペア $(x_i, y_i)$ からなる場合、射影行列 $M$ は次のようになります：

M = \underset{W}{\operatorname{argmax}} \sum_i ||x_i - Wy_i||^2

また、射影行列 $W$ は直交行列に制約されているため、単語埋め込みベクトル間の実際の距離が保たれます。多言語モデルは、DeepTextの基盤表現として多言語単語埋め込みを用い、それらを「固定」または学習中に変更しないことで訓練されます。

これを理解したら、異なる言語で上記の演習を再現してみてください！

# @markdown ### Download FastText English Embeddings of dimension 100
# @markdown This will take 1-2 minutes to run

import os, zipfile, requests

url = "https://osf.io/2frqg/download"
fname = "cc.en.100.bin.gz"

print('Downloading Started...')
# Downloading the file by sending the request to the URL
r = requests.get(url, stream=True)

# Writing the file to the local file system
with open(fname, 'wb') as f:
  f.write(r.content)
print('Downloading Completed.')

# opening the zip file in READ mode
with zipfile.ZipFile(fname, 'r') as zipObj:
  # extracting all the files
  print('Extracting all the files now...')
  zipObj.extractall()
  print('Done!')
  os.remove(fname)

# Load 100 dimension FastText Vectors using FastText library
ft_en_vectors = fasttext.load_model('cc.en.100.bin')

# @markdown ### Download FastText French Embeddings of dimension 100

# @markdown **Note:** This cell might take 2-4 minutes to run

import os, zipfile, requests

url = "https://osf.io/rqadk/download"
fname = "cc.fr.100.bin.gz"

print('Downloading Started...')
# Downloading the file by sending the request to the URL
r = requests.get(url, stream=True)

# Writing the file to the local file system
with open(fname, 'wb') as f:
  f.write(r.content)
print('Downloading Completed.')

# opening the zip file in READ mode
with zipfile.ZipFile(fname, 'r') as zipObj:
  # extracting all the files
  print('Extracting all the files now...')
  zipObj.extractall()
  print('Done!')
  os.remove(fname)

# Load 100 dimension FastText Vectors using FastText library
french = fasttext.load_model('cc.fr.100.bin')

まず、異なる言語間で同じベクトル空間に射影せずにコサイン類似度を見てみましょう。ご覧の通り、同じ単語でも他の言語ではコサイン類似度が0付近に近く、類似も非類似も示していません。

hello = ft_en_vectors.get_word_vector('hello')
hi = ft_en_vectors.get_word_vector('hi')
bonjour = french.get_word_vector('bonjour')

print(f"Cosine Similarity between HI and HELLO: {cosine_similarity(hello, hi)}")
print(f"Cosine Similarity between BONJOUR and HELLO: {cosine_similarity(hello, bonjour)}")

cat = ft_en_vectors.get_word_vector('cat')
chatte = french.get_word_vector('chatte')
chat = french.get_word_vector('chat')

print(f"Cosine Similarity between cat and chatte: {cosine_similarity(cat, chatte)}")
print(f"Cosine Similarity between cat and chat: {cosine_similarity(cat, chat)}")
print(f"Cosine Similarity between chatte and chat: {cosine_similarity(chatte, chat)}")

まず、英語とフランス語で共通する単語のリストを定義しましょう。これを使って学習用の行列を作成します。

en_words = set(ft_en_vectors.words)
fr_words = set(french.words)
overlap = list(en_words & fr_words)
bilingual_dictionary = [(entry, entry) for entry in overlap]

いくつかの関数を定義して作業を簡単にします。make_training_matricesは、ソース言語の単語、ターゲット言語の単語、共通単語のセットを受け取り、両言語間の共通単語すべての単語埋め込みの行列を作成します。これが学習用行列です。

次に、learn_transformation関数はこれらの行列を受け取り、正規化し、SVDを実行してソース言語をターゲット言語に整列させ、変換行列を返します。

def make_training_matrices(source_dictionary, target_dictionary,
                           bilingual_dictionary):
  source_matrix = []
  target_matrix = []
  for (source, target) in tqdm(bilingual_dictionary):
    # if source in source_dictionary.words and target in target_dictionary.words:
    source_matrix.append(source_dictionary.get_word_vector(source))
    target_matrix.append(target_dictionary.get_word_vector(target))
  # return training matrices
  return np.array(source_matrix), np.array(target_matrix)


# from https://stackoverflow.com/questions/21030391/how-to-normalize-array-numpy
def normalized(a, axis=-1, order=2):
  """Utility function to normalize the rows of a numpy array."""
  l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
  l2[l2==0] = 1
  return a / np.expand_dims(l2, axis)


def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):
  """
  Source and target matrices are numpy arrays, shape
  (dictionary_length, embedding_dimension). These contain paired
  word vectors from the bilingual dictionary.
  """
  # optionally normalize the training vectors
  if normalize_vectors:
    source_matrix = normalized(source_matrix)
    target_matrix = normalized(target_matrix)
  # perform the SVD
  product = np.matmul(source_matrix.transpose(), target_matrix)
  U, s, V = np.linalg.svd(product)
  # return orthogonal transformation which aligns source language to the target
  return np.matmul(U, V)

では、これらをすべて組み合わせてみましょう！

source_training_matrix, target_training_matrix = make_training_matrices(ft_en_vectors, french, bilingual_dictionary)

transform = learn_transformation(source_training_matrix, target_training_matrix)

上記と同じ例を実行しますが、今回はフランス語の単語を使う際に、埋め込みに変換行列の転置を掛けます。これでずっと良くなります！

hello = ft_en_vectors.get_word_vector('hello')
hi = ft_en_vectors.get_word_vector('hi')
bonjour = np.matmul(french.get_word_vector('bonjour'), transform.T)

print(f"Cosine Similarity between HI and HELLO: {cosine_similarity(hello, hi)}")
print(f"Cosine Similarity between BONJOUR and HELLO: {cosine_similarity(hello, bonjour)}")

cat = ft_en_vectors.get_word_vector('cat')
chatte = np.matmul(french.get_word_vector('chatte'), transform.T)
chat = np.matmul(french.get_word_vector('chat'), transform.T)

print(f"Cosine Similarity between cat and chatte: {cosine_similarity(cat, chatte)}")
print(f"Cosine Similarity between cat and chat: {cosine_similarity(cat, chat)}")
print(f"Cosine Similarity between chatte and chat: {cosine_similarity(chatte, chat)}")

今度はあなたの例をいくつか試してみてください。チュートリアル1のセクション2.1で見た例を英語とフランス語で試してみましょう。期待通りに動作しますか？

# @title Submit your feedback
content_review(f"{feedback_prefix}_Multilingual_Embeddings_Bonus_Activity")