2. テキスト平易化
• 難解なテキストの意味を保持したまま平易に書き換える
English Wikipedia: Alfonso Perez
Alfonso Perez Munoz, usually referred to as Alfonso, is a
former Spanish footballer, in the striker position.
Simple English Wikipedia: Alfonso Perez
Alfonso Perez is a former Spanish football player.
• 文圧縮 + 言い換え
• テキスト平易化は言語学習者や子どもを
はじめとする多くの読者の文章読解を支援する 2
4. 4
単語分散表現のアライメントに基づく文間類似度を用いた
テキスト平易化のための単言語パラレルコーパスの構築
1. Lennon was born in war-time England, on 9 October 1940 at
Liverpool Maternity Hospital, to Julia and Alfred Lennon, a
merchant seaman of Irish descent, who was away at the time
of his son‘s birth.
2. His parents named him John Winston Lennon after his
paternal grandfather, John “Jack” Lennon, and then-Prime
Minister Winston Churchill. …
難解なコーパス
1. Lennon started the Beatles in his hometown of Liverpool, with
Paul McCartney and George Harrison.
2. After Ringo Starr joined the band, they started to be very
successful.
3. People were excited by their music, and their live
performances always pleased audiences. …
平易なコーパス
1
1
2
…
0.27
2 3 …
0.10 0.05
0.19 0.01 0.07
文間類似度行列
"Lucy in the Sky with Diamonds" is a song written primarily by John
Lennon and credited to Lennon–McCartney, for the Beatles' 1967
album Sgt. Pepper's Lonely Hearts Club Band.
"Lucy in the Sky with Diamonds" is a song written by John Lennon
and Paul McCartney for The Beatles' 1967 album Sgt. Pepper's
Lonely Hearts Club Band. (0.91)
After his marriage to Yoko Ono in 1969, he changed his name to
John Ono Lennon.
Lennon loved his wife so much that he added her surname Ono to
his own name, since she became Yoko Ono Lennon when she
married him. (0.53)
パラレルコーパス
統計的機械
翻訳モデル
John Lennon was an English singer,
songwriter and artist who rose to worldwide
fame as the founder of the rock band the
Beatles.
John Lennon was an English singer and songwriter who
rose to worldwide fame as a co-founder of the Beatles,
the most commercially successful band in the history of
popular music.
① 単語分散表現のアライメントに基づく文間類似度の計算
② 閾値以上の文対を抽出してパラレルコーパスを構築
③ パラレルコーパスを用いて統計的機械翻訳モデルを学習
④ モデルを用いて入力文から平易な同義文を生成
5. 5
単語分散表現のアライメントに基づく文間類似度を用いた
テキスト平易化のための単言語パラレルコーパスの構築
1. Lennon was born in war-time England, on 9 October 1940 at
Liverpool Maternity Hospital, to Julia and Alfred Lennon, a
merchant seaman of Irish descent, who was away at the time
of his son‘s birth.
2. His parents named him John Winston Lennon after his
paternal grandfather, John “Jack” Lennon, and then-Prime
Minister Winston Churchill. …
難解なコーパス
1. Lennon started the Beatles in his hometown of Liverpool, with
Paul McCartney and George Harrison.
2. After Ringo Starr joined the band, they started to be very
successful.
3. People were excited by their music, and their live
performances always pleased audiences. …
平易なコーパス
1
1
2
…
0.27
2 3 …
0.10 0.05
0.19 0.01 0.07
文間類似度行列
"Lucy in the Sky with Diamonds" is a song written primarily by John
Lennon and credited to Lennon–McCartney, for the Beatles' 1967
album Sgt. Pepper's Lonely Hearts Club Band.
"Lucy in the Sky with Diamonds" is a song written by John Lennon
and Paul McCartney for The Beatles' 1967 album Sgt. Pepper's
Lonely Hearts Club Band. (0.91)
After his marriage to Yoko Ono in 1969, he changed his name to
John Ono Lennon.
Lennon loved his wife so much that he added her surname Ono to
his own name, since she became Yoko Ono Lennon when she
married him. (0.53)
パラレルコーパス
統計的機械
翻訳モデル
John Lennon was an English singer,
songwriter and artist who rose to worldwide
fame as the founder of the rock band the
Beatles.
John Lennon was an English singer and songwriter who
rose to worldwide fame as a co-founder of the Beatles,
the most commercially successful band in the history of
popular music.
① 単語分散表現のアライメントに基づく文間類似度の計算
② 閾値以上の文対を抽出してパラレルコーパスを構築
③ パラレルコーパスを用いて統計的機械翻訳モデルを学習
④ モデルを用いて入力文から平易な同義文を生成難解な文と平易な文に対して、分散表現を用いた
多対一の単語アライメントを考え、それらの単語
間類似度の平均値を用いて文間類似度を計算する
6. 6
単語分散表現のアライメントに基づく文間類似度を用いた
テキスト平易化のための単言語パラレルコーパスの構築
1. Lennon was born in war-time England, on 9 October 1940 at
Liverpool Maternity Hospital, to Julia and Alfred Lennon, a
merchant seaman of Irish descent, who was away at the time
of his son‘s birth.
2. His parents named him John Winston Lennon after his
paternal grandfather, John “Jack” Lennon, and then-Prime
Minister Winston Churchill. …
難解なコーパス
1. Lennon started the Beatles in his hometown of Liverpool, with
Paul McCartney and George Harrison.
2. After Ringo Starr joined the band, they started to be very
successful.
3. People were excited by their music, and their live
performances always pleased audiences. …
平易なコーパス
1
1
2
…
0.27
2 3 …
0.10 0.05
0.19 0.01 0.07
文間類似度行列
"Lucy in the Sky with Diamonds" is a song written primarily by John
Lennon and credited to Lennon–McCartney, for the Beatles' 1967
album Sgt. Pepper's Lonely Hearts Club Band.
"Lucy in the Sky with Diamonds" is a song written by John Lennon
and Paul McCartney for The Beatles' 1967 album Sgt. Pepper's
Lonely Hearts Club Band. (0.91)
After his marriage to Yoko Ono in 1969, he changed his name to
John Ono Lennon.
Lennon loved his wife so much that he added her surname Ono to
his own name, since she became Yoko Ono Lennon when she
married him. (0.53)
パラレルコーパス
統計的機械
翻訳モデル
John Lennon was an English singer,
songwriter and artist who rose to worldwide
fame as the founder of the rock band the
Beatles.
John Lennon was an English singer and songwriter who
rose to worldwide fame as a co-founder of the Beatles,
the most commercially successful band in the history of
popular music.
① 単語分散表現のアライメントに基づく文間類似度の計算
② 閾値以上の文対を抽出してパラレルコーパスを構築
③ パラレルコーパスを用いて統計的機械翻訳モデルを学習
④ モデルを用いて入力文から平易な同義文を生成
先行研究よりもF値を3.1改善(0.607→0.638)
内的評価:文間類似度を用いてパラレルと
ノンパラレルの2値分類を行いF値を比較
7. 7
単語分散表現のアライメントに基づく文間類似度を用いた
テキスト平易化のための単言語パラレルコーパスの構築
1. Lennon was born in war-time England, on 9 October 1940 at
Liverpool Maternity Hospital, to Julia and Alfred Lennon, a
merchant seaman of Irish descent, who was away at the time
of his son‘s birth.
2. His parents named him John Winston Lennon after his
paternal grandfather, John “Jack” Lennon, and then-Prime
Minister Winston Churchill. …
難解なコーパス
1. Lennon started the Beatles in his hometown of Liverpool, with
Paul McCartney and George Harrison.
2. After Ringo Starr joined the band, they started to be very
successful.
3. People were excited by their music, and their live
performances always pleased audiences. …
平易なコーパス
1
1
2
…
0.27
2 3 …
0.10 0.05
0.19 0.01 0.07
文間類似度行列
"Lucy in the Sky with Diamonds" is a song written primarily by John
Lennon and credited to Lennon–McCartney, for the Beatles' 1967
album Sgt. Pepper's Lonely Hearts Club Band.
"Lucy in the Sky with Diamonds" is a song written by John Lennon
and Paul McCartney for The Beatles' 1967 album Sgt. Pepper's
Lonely Hearts Club Band. (0.91)
After his marriage to Yoko Ono in 1969, he changed his name to
John Ono Lennon.
Lennon loved his wife so much that he added her surname Ono to
his own name, since she became Yoko Ono Lennon when she
married him. (0.53)
パラレルコーパス
統計的機械
翻訳モデル
John Lennon was an English singer,
songwriter and artist who rose to worldwide
fame as the founder of the rock band the
Beatles.
John Lennon was an English singer and songwriter who
rose to worldwide fame as a co-founder of the Beatles,
the most commercially successful band in the history of
popular music.
① 単語分散表現のアライメントに基づく文間類似度の計算
② 閾値以上の文対を抽出してパラレルコーパスを構築
③ パラレルコーパスを用いて統計的機械翻訳モデルを学習
④ モデルを用いて入力文から平易な同義文を生成
外的評価:パラレルコーパスから統計的機械翻訳
モデルを学習し、BLEUを比較
先行研究よりもBLEUを
3.2改善(44.3→47.5)
8. 英語のテキスト平易化コーパス
• Zhu et al. (2010)
• 文をTF-IDFベクトルとして表現
• ベクトル間のコサイン類似度が閾値を越える文対を抽出
• Coster and Kauchak (2011)
• Zhu et al. (2010) を拡張し、文の出現順序を考慮
• Hwang et al. (2015)
• Wiktionaryの見出し語と定義文中の単語の共起を用いて
異なる単語間の類似度を考慮
• 本研究
• 単語分散表現を用いて異なる単語間の類似度を考慮
異なる単語間(難解/平易)の類似度を考慮したい
8
10. 1. Average Alignment
• 文xと文yの間の全ての単語ペアの単語間類似度を計算
• |x||y|個の単語間類似度を平均して文間類似度を求める
• xi:文xに含まれるi番目の単語
• yj:文yに含まれるj番目の単語
• Φ(xi, yj):単語xiと単語yjの間の単語間類似度
本研究ではコサイン類似度を用いる
10
Save (x, y) =
1
x y
f(xi, yj )
j=1
y
å
i=1
x
å
11. 2. Maximum Alignment
• Average Alignmentは直感的であるが、多くの
単語間類似度はゼロに近い値を取るノイズとなる
• そこで、各単語xiに対して最も類似度が高い単語yjのみ
を用いて文間類似度を計算する
• Sasym(x,y)とSasym(y,x)を平均して対称な類似度を得る
11
Sasym (x, y) =
1
x
max
j
f(xi, yj )
i=1
x
å
Smax (x, y) =
1
2
Sasym (x, y)+ Sasym (y, x)( )
12. 3. Hungarian Alignment
• 次に一対一の単語アライメントに基づく文間類似度を計算
• Average Alignment:多対多の単語アライメント
• Maximum Alignment:多対一の単語アライメント
• 文xと文yを、単語をノード、単語間類似度をエッジとする
重み付き完全2部グラフと考える
• このグラフの最大マッチングを求めると、単語間類似度の
総和を最大化する一対一の単語アライメントが得られる
• 2部グラフの最大マッチング問題はHungarian法で解ける
12
Shun (x, y) =
1
min( x , y )
f(xi,h(xi ))
i=1
min( x, y )
å
13. 4. Word Mover’s Distance
• Earth Mover’s Distanceの特殊な場合に相当する文xから
文yへと単語を輸送する輸送問題を解くWMDも多対多の
単語アライメントに基づく文間類似度の計算に応用できる
• ψ(xu, yv):単語xuと単語yvの間の単語間非類似度(距離)
• freq(xu):文x中での単語xuの出現頻度
• n:語彙数、 Auv:単語の輸送量を表す行列 13
Swmd (x, y) =1-WMD(x, y)
WMD(x, y) = min Auvj(xu, yv )
v=1
n
å
u=1
n
å
Auv =
v=1
n
å
1
x
freq(xu ), Auv =
u=1
n
å
1
y
freq(yv )
14. 14
実験:テキスト平易化コーパス構築
• 内的評価
• English WikipediaとSimple English Wikipediaから
抽出した各文対について、様々な文間類似度を用い
てパラレルとノンパラレルの2値分類のF値を比較す
る
• 外的評価
• 本研究で構築するテキスト平易化コーパスと既存の
テキスト平易化コーパスのそれぞれで統計的機械翻
訳モデルを学習し、English Wikipediaの文をSimple
English Wikipediaの文へ翻訳するBLEUを比較する
15. 実験設定:内的評価
• 文間類似度を用いたパラレルとノンパラレルの2値分類
• 評価用データセット:Hwang et al. (2015)
• English WikipediaとSimple English Wikipediaから
抽出した67,853文対に4種類のラベルを人手で付与
• Good (G) :両方向含意 277文対
• Good Partial (GP):片方向含意 281文対
• Partial (P) :関係ある 117文対
• Bad (B) :関係ない 67,178文対
• F1の最大値(MaxF1)とPR曲線(AUC)で評価
• 2つの設定で評価:G vs. Others
G+GP vs. Others
15
16. 16
ラベルについて
• Good (G) 両方向含意 277文対
– Apple sauce or applesauce is a puree made of apples.
– Apple sauce (or applesauce) is a sauce that is made from stewed or mashed
apples.
• Good Partial (GP) 片方向含意 281文対
– Commercial versions of applesauce are really available in supermarkets.
– It is easy to make at home, and it is also sold already made in supermarkets as
a common food.
• Partial (P) 関係ある 117文対
– Applesauce is a sauce that is made from stewed and mashed apples.
– Applesauce is made by cooking down apples with water or apple cider to the
desired level.
• Bad (B) 関係ない 67,178文対
– Commercial versions of applesauce are really available in supermarkets.
– Peeled or unpeeled apples can be used and different spices or additives like
cinnamon can be used.
17. 17
Maximum Alignment が優秀
G vs. O G+GP vs. O
MaxF1 AUC MaxF1 AUC
Zhu et al. (2010) 0.550 0.509 0.431 0.391
Coster and Kauchak (2011) 0.564 0.495 0.415 0.387
Hwang et al. (2015) 0.712 0.694 0.607 0.529
Additive Embeddings 0.691 0.695 0.518 0.487
1. Average Alignment 0.419 0.312 0.391 0.297
2. Maximum Alignment 0.717 0.730 0.638 0.618
3. Hungarian Alignment 0.524 0.414 0.354 0.275
4. Word Mover’s Distance 0.724 0.738 0.531 0.499
※ Additive Embeddings:単語アライメントを使用しない比較手法。
単語分散表現を足した文ベクトルのコサイン類似度
19. 難解 平易
0.9 Woody Bay Station was purchased by
the Lynton …
Woody Bay Station was
bought by the Lynton …
0.7 Miró has been a significant influence
on late 20th-century art, in particular
the American abstract expressionist
artists such as Motherwell, … and
others.
Miró was a significant influence on late
20th-century art, in particular
the American abstract
expressionist artists.
0.6 The couple has four children: She has two daughters and two sons.
テキスト平易化コーパスの構築
• English WikipediaとSimple English Wikipediaから
タイトルが一致する 126,725 文書対を収集
• Maximum Alignmentを用いて 492,993 文対を収集
• 単語アライメントの閾値:単語間類似度が0.49以上
• 文アライメントの閾値:文間類似度が0.53以上
19
20. 実験設定:外的評価
• 統計的機械翻訳の枠組みでのテキスト平易化
• トレーニング:テキスト平易化コーパス
• チューニング:MERT、無作為抽出した500文対
• テスト:BLEU、Hwangらの人手のラベル付きデータ
• Good(両方向含意)277文対
• Good Partial(片方向含意)281文対
• SMTツール:Moses(Phrase-based SMT)
• 言語モデル:KenLm (5-gram)、Simple English Wikipedia
20
21. BLEUを3.2ポイント改善
文対数
平均文長 BLEU
難解 平易 G G+GP
Baseline (None) 42.1 22.3
Zhu et al. (2010) 107,516 21.2 17.4 42.0 22.1
Coster and Kauchak (2011) 136,862 23.6 21.1 44.3 23.8
Hwang et al. (2015) 284,238 26.0 19.8 43.9 23.1
Ours 492,493 25.3 17.9 47.5 26.3
21
• 本研究で構築したコーパスで学習したモデルがBLEUを大きく改善
• 我々のコーパスは難解な文と平易な文の平均文長の差が大きい
• Simple English Wikipedia全体の平均文長:25.1
• Simple English Wikipedia全体の平均文長:16.9
• Maximum Alignment は文長に関わらず適切に文間類似度を計算できる
23. テキスト平易化の実例
Input
Mozart's Clarinet Concerto and Clarinet Quintet are both in A major,
and generally Mozart was more likely to use clarinets in A major than
in any other key besides E-flat major.
Reference Mozart used clarinets in A major often.
Zhu et al.
Mozart's Clarinet Concerto and Clarinet Quintet are both in A major,
and generally Mozart which he more likely to use clarinets in A major
than in any other key besides E-flat major.
Coster and
Kauchak
Mozart was Clarinet Concerto and Clarinet Quintet are both in A
major, and Mozart used clarinets in A major often.
Hwang et al.
Mozart's Clarinet Concerto and Clarinet Quintet are both in A major,
and generally Mozart was more likely to use clarinets in A major than
in any other key besides E-flat major.
Ours
Mozart's Clarinet Concerto and Clarinet Quintet are both in A major,
and Mozart used clarinets in A major often.
23