ENGLISH    |  

PivotMT

PivotAlign: Structural Alignment for Bridging Parallel Corpora with a Pivot Language

Pivot translation is an important method of extending parallel corpora to cover new langugage pairs, however, current methods are not designed to handle differences in parallel corpus domain or syntax of languages. To address these problems, we propose the development of PivotAlign: a system that produces phrasal alignments from parallel corpora that share a common pivot language even if they are of differing domains or grammatically divergent. PivotAlign uses structural information from dependency parsers and mono-lingual alignment of the pivot language text to search for more likely multi-lingual alignments. We will use PivotAlign to construct Japanese-Spanish and Japanese-Italian phrasal SMT systems and use them to refine the alignment process. In order to evaluate our systems, we will construct a Japanese-English-Italian-Spanish evaluation corpus and release it, along with PivotAlign, to the research community. (full proposal)

Tools

  • Japanese
    • POS tagger: MeCab (0.97)
    • Dictionary: NAIST Jdic (0.4.3-20080917)
    • Dependency Parser: CaboCha (0.60pre4)
  • English
    • POS Tagger: FreeLing (2.1beta1)
    • Dependency Parser: FreeLing (2.1beta1 -- Malt Parser model?)
  • Spanish
    • POS Tagger: FreeLing (2.1beta1)
    • Dependency Parser: FreeLing (2.1beta1)
  • Italian
    • POS Tagger: FreeLing (2.1beta1)
    • Dependency Parser: FreeLing (2.1beta1)

Data

  • Europarl Corpus (ES-EN, IT-EN, ...)
    • pine5:/work/mt/corpora/europarl/es-en
    • pine5:/work/mt/corpora/europarl/it-en
  • CRL Yomiuri Shimbun (JA-EN)
    • pine5:/work/mt/corpora/europarl/yomiuri-data

Examples

  • /work/mt/corpora/europarl/es-en/en/ep-00-01-17.txt
 <CHAPTER ID=1> 
 Resumption of the session
 <SPEAKER ID=1 NAME="President">
 I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
 <P>
 Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
 You have requested a debate on this subject in the course of the next few days, during this part-session.
 In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
 Please rise, then, for this minute' s silence.
 <P>
  • /work/mt/corpora/europarl/es-en/en/ep-00-01-17.tag
 < < Fz 1
 CHAPTER_ID chapter_id NNP 1
 = = Fz 1
 1 1 Z 1
 > > Fz 1
 Resumption resumption NNP 1
 of of IN 0.999907
 the the DT 1
 session session NN 1
 < < Fz 1
  • /work/mt/corpora/europarl/es-en/en/ep-00-01-17.dep
 oth-mk/top/(< < Fz -) [
   n-chunk/modnorule/(CHAPTER_ID chapter_id NNP -)
   oth-mk/modnorule/(= = Fz -)
   sn-chunk/modnorule/(1 1 Z -)
   oth-mk/modnorule/(> > Fz -)
   n-chunk/modnorule/(Resumption resumption NNP -) [
     sp-chunk/ncmod/(of of IN -) [
       sn-chunk/dobj/(session session NN -) [
         DT/det/(the the DT -)
       ]
  • /work/mt/corpora/yomiuri-data/crl-yomiuri.ja.000.txt
 欧州は、エディンバラにおいて合意され、コペンハーゲンにおいて強化された成長イニシアチブを精力的に実行しつつある。
 我々は、ロシアの経済発展にとって、改善された市場アクセスが重要であることを認識する。
 法人レベルでのパートナーシップ及びマネージメント支援は、特に効果的であり得る。
 違憲の問題については、連邦憲法裁判所が決定する。
 統合の実際のプロセスは、経済分野から始めねばならない。
 関連する安全保障理事会決議の諸条件が満たされるまで、制裁は維持されるべきである。
 国際テロは、世界の平和と安全に対する重大な脅威だ。
 貧困、人口政策、教育、保健、女性の役割、及び児童の福祉は、特別の注意に値する。
 国際市場に対するロシア産品のアクセス改善は、ロシアの構造改革を大いに強化する。
 2 日本国政府及びロシア連邦政府は、市場経済への移行を目指すロシア連邦の努力に対する支援に当たって、戦後の日本国の経済発展の経験が有効となり得ることにつき、共通の理解を有する。
  • /work/mt/corpora/yomiuri-data/crl-yomiuri.ja.000.cabo
 * 0 7D 0/1 0.000000
 欧州	名詞,固有名詞,地域,一般,*,*,欧州,オウシュウ,オーシュー	B-LOCATION
 は	助詞,係助詞,*,*,*,*,は,ハ,ワ	O
 、	記号,読点,*,*,*,*,、,、,、	O
 * 1 2D 0/1 1.754737
 エディンバラ	名詞,固有名詞,地域,一般,*,*,エディンバラ,エディンバラ,エディンバラ	B-LOCATION
 において	助詞,格助詞,連語,*,*,*,において,ニオイテ,ニオイテ	O
 * 2 4D 1/2 0.642229
 合意	名詞,サ変接続,*,*,*,*,合意,ゴウイ,ゴーイ	O
 さ	動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ	O

Issues

  • FreeLing
    • No proper UTF-8 support (uses iconv to convert to and from iso-8859-1)
    • Dependency tree format is completely unparsable >_<
  • MST Parser
    • Need to train parsing models for each language