This page summarizes some of my main research interests. Please refer to the publications page for details.

Novel Training Algorithms for Machine Translation

The premise of using statistical methods for machine translation is that regularities of the translation process are learnable from a corpus of parallel text. Although statistical machine translation systems have already proven useful in practice (i.e. witness Google Translate and Language Weaver), it is still unclear whether current training methods are optimal. My research efforts in machine translation thus involve exploring and devising new training algorithms, such as boosting [ACL08], sparse-feature learning [WMT10], domain adaptation [IWSLT10, ICJNLP11, ACL13], and multi-objective optimization [ACL12, NAACL13]. The training pipeline in machine translation is relatively complex, so there are many interesting open research problems from the machine learning perspective.

I should also note that training is only part of the system. Issues such as modeling, decoding, and evaluation are also important in machine translation research. While at NTT CS Labs, I had the fortune to work with colleagues who are experts in different parts of the system. Integrating all these technologies, our overall system [NTCIR11] performed first at the NTCIR Patent Translation shared task. At NAIST, I continue to explore various ways to build better machine translation systems; topics include speech interpretation [IWSLT13], methods beneficial to SVO-SOV language pairs [WMT13], and bilingual dictionary extraction [CoNLL13].

Natural Language Processing for "Highly-Productive" Languages

My early work in graduate school focused on how one can build core NLP tools (e.g. part-of-speech taggers, parsers) for morphologically-rich languages, such as Arabic. These languages exhibit a highly productive way of generating words, leading to extreme data sparseness problems. One way to tackle this issue is to build more linguistic insight into our machine learning methods. For example, Factored Language Models [c.f. COLING04] provide a way to model morphology, enabling estimation of probabilities for unknown words. Another example is our transductive approach using partial seed lexicons to build a full part-of-speech tagger [EMNLP06].

Recently, I am interested in tackling the issue of high productivity in web and social media text, in particular using the technology of Deep Learning and Continuous Word Embeddings. In [EMNLP13], we show how better word embeddings could be learned by considering compositional semantics. New vocabulary and conventions are being adopted everyday on the web. To support novel applications for these new services, it is imperative to develop an NLP toolchain that is robust and adaptable to these highly-productive languages.

Semi-supervised and Transfer Learning of Ranking Problems

Ranking is a key problem in many applications. In web search, for instance, webpages are ranked such that the most relevant ones are presented to the user first. In machine translation, ranking and choosing from a large set of hypothesized translations often leads to improved results. In computational biology problems like protein structure prediction, rankings of predicted structures help scientists filter out unlikely candidate structures, speeding up the experiment process. Abstractly, the problem of ranking is to predict an ordering over a set of objects. The field of "Learning to Rank" has risen as an active research area, with focus on using machine learning approaches to automatically construct ranking functions.

However, training data for ranking is often expensive to annotate. It is therefore conceivable that not all applications of interest will have large training data. I am interested in exploring how one can still train good ranking functions in this case. One solution is exploit unlabeled data: starting with [SIGIR08, CSL11] and culminating in my PhD thesis [UW09], I investigated how semi-supervised classification algorithms can be extended to ranking. Another solution is to exploit related data from different domains, i.e. domain adaptation or transfer learning [IPM11]. (Note: As of 2012, I have stopped active work on ranking in order to focus on other research areas; that said, ranking algorithms continue to serve as valuable tools for many of my problems.)