ENGLISH    |  


PivotAlign: Structural Alignment for Bridging Parallel Corpora with a Pivot Language

Pivot translation is an important method of extending parallel corpora to cover new langugage pairs, however, current methods are not designed to handle differences in parallel corpus domain or syntax of languages. To address these problems, we propose the development of PivotAlign: a system that produces phrasal alignments from parallel corpora that share a common pivot language even if they are of differing domains or grammatically divergent. PivotAlign uses structural information from dependency parsers and mono-lingual alignment of the pivot language text to search for more likely multi-lingual alignments. We will use PivotAlign to construct Japanese-Spanish and Japanese-Italian phrasal SMT systems and use them to refine the alignment process. In order to evaluate our systems, we will construct a Japanese-English-Italian-Spanish evaluation corpus and release it, along with PivotAlign, to the research community. (full proposal)


  • Japanese
    • POS tagger: MeCab (0.97)
    • Dictionary: NAIST Jdic (0.4.3-20080917)
    • Dependency Parser: CaboCha (0.60pre4)
  • English
    • POS Tagger: FreeLing (2.1beta1)
    • Dependency Parser: FreeLing (2.1beta1 -- Malt Parser model?)
  • Spanish
    • POS Tagger: FreeLing (2.1beta1)
    • Dependency Parser: FreeLing (2.1beta1)
  • Italian
    • POS Tagger: FreeLing (2.1beta1)
    • Dependency Parser: FreeLing (2.1beta1)


  • Europarl Corpus (ES-EN, IT-EN, ...)
    • pine5:/work/mt/corpora/europarl/es-en
    • pine5:/work/mt/corpora/europarl/it-en
  • CRL Yomiuri Shimbun (JA-EN)
    • pine5:/work/mt/corpora/europarl/yomiuri-data


  • /work/mt/corpora/europarl/es-en/en/ep-00-01-17.txt
 Resumption of the session
 <SPEAKER ID=1 NAME="President">
 I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
 Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
 You have requested a debate on this subject in the course of the next few days, during this part-session.
 In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
 Please rise, then, for this minute' s silence.
  • /work/mt/corpora/europarl/es-en/en/ep-00-01-17.tag
 < < Fz 1
 CHAPTER_ID chapter_id NNP 1
 = = Fz 1
 1 1 Z 1
 > > Fz 1
 Resumption resumption NNP 1
 of of IN 0.999907
 the the DT 1
 session session NN 1
 < < Fz 1
  • /work/mt/corpora/europarl/es-en/en/ep-00-01-17.dep
 oth-mk/top/(< < Fz -) [
   n-chunk/modnorule/(CHAPTER_ID chapter_id NNP -)
   oth-mk/modnorule/(= = Fz -)
   sn-chunk/modnorule/(1 1 Z -)
   oth-mk/modnorule/(> > Fz -)
   n-chunk/modnorule/(Resumption resumption NNP -) [
     sp-chunk/ncmod/(of of IN -) [
       sn-chunk/dobj/(session session NN -) [
         DT/det/(the the DT -)
  • /work/mt/corpora/yomiuri-data/crl-yomiuri.ja.000.txt
 2 日本国政府及びロシア連邦政府は、市場経済への移行を目指すロシア連邦の努力に対する支援に当たって、戦後の日本国の経済発展の経験が有効となり得ることにつき、共通の理解を有する。
  • /work/mt/corpora/yomiuri-data/crl-yomiuri.ja.000.cabo
 * 0 7D 0/1 0.000000
 欧州	名詞,固有名詞,地域,一般,*,*,欧州,オウシュウ,オーシュー	B-LOCATION
 は	助詞,係助詞,*,*,*,*,は,ハ,ワ	O
 、	記号,読点,*,*,*,*,、,、,、	O
 * 1 2D 0/1 1.754737
 エディンバラ	名詞,固有名詞,地域,一般,*,*,エディンバラ,エディンバラ,エディンバラ	B-LOCATION
 において	助詞,格助詞,連語,*,*,*,において,ニオイテ,ニオイテ	O
 * 2 4D 1/2 0.642229
 合意	名詞,サ変接続,*,*,*,*,合意,ゴウイ,ゴーイ	O
 さ	動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ	O


  • FreeLing
    • No proper UTF-8 support (uses iconv to convert to and from iso-8859-1)
    • Dependency tree format is completely unparsable >_<
  • MST Parser
    • Need to train parsing models for each language