Hierarchical Recurrent Aggregative Generation for Few-Shot NLG
Large pretrained models enable transfer learning to low-resource domains for language generation tasks. However, previous end-to-end approaches do not account for the fact that some generation sub-tasks, specifically aggregation and lexicalisation, can benefit from transfer learning in different extents. To exploit these varying potentials for transfer learning, we propose a new hierarchical approach for few-shot and zero-shot generation. Our approach consists of a three-moduled jointly trained architecture: the first module independently lexicalises the distinct units of information in the input as sentence sub-units (e.g. phrases), the second module recurrently aggregates these sub-units to generate a unified intermediate output, while the third module subsequently post-edits it to generate a coherent and fluent final text. We perform extensive empirical analysis and ablation studies on few-shot and zero-shot settings across 4 datasets. Automatic and human evaluation shows that the proposed hierarchical approach is consistently capable of achieving state-of-the-art results when compared to previous work.
CrossAligner & Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding
Task-oriented personal assistants enable people to interact with a host of devices and services using natural language. One of the challenges of making neural dialogue systems available to more users is the lack of training data for all but a few languages. Zero-shot methods try to solve this issue by acquiring task knowledge in a high-resource language such as English with the aim of transferring it to the low-resource language(s). To this end, we introduce CrossAligner, the principal method of a variety of effective approaches for zero-shot cross-lingual transfer based on learning alignment from unlabelled parallel data. We present a quantitative analysis of individual methods as well as their weighted combinations, several of which exceed state-of-the-art (SOTA) scores as evaluated across nine languages, fifteen test sets and three benchmark multilingual datasets. A detailed qualitative error analysis of the best methods shows that our fine-tuned language models can zero-shot transfer the task knowledge better than anticipated.
XeroAlign: Zero-shot cross-lingual transformer alignment
The introduction of transformer-based cross-lingual language models brought decisive improvements to multilingual NLP tasks. However, the lack of labelled data has necessitated a variety of methods that aim to close the gap to high-resource languages. Zero-shot methods in particular, often use translated task data as training signal to bridge the performance gap between the source and target language(s). We introduce XeroAlign, a simple method for task-specific alignment of cross-lingual pretrained transformers such as XLM-R. XeroAlign uses translated task data to encourage the model to generate similar sentence embeddings for different languages. The XeroAligned XLM-R, called XLM-RA, shows strong improvements over the baseline models to achieve state-of-the-art zero-shot results on three multilingual natural language understanding tasks. XLM-RA performs on par with state-of-the-art models on a cross-lingual adversarial paraphrasing task and its text classification accuracy exceeds that of XLM-R trained with labelled data.
[paper] [sup] [repo]
Enhancing Transformers with Gradient Boosted Decision Trees for NLI Fine-Tuning
Transfer learning has become the dominant paradigm for many natural language processing tasks. In addition to models being pretrained on large datasets, they can be further trained on intermediate (supervised) tasks that are similar to the target task. For small Natural Language Inference (NLI) datasets, language modelling is typically followed by pretraining on a large (labelled) NLI dataset before fine-tuning with each NLI subtask. In this work, we explore Gradient Boosted Decision Trees (GBDTs) as an alternative to the commonly used Multi-Layer Perceptron (MLP) classification head. GBDTs have desirable properties such as good performance on dense, numerical features and are effective where the ratio of the number of samples w.r.t the number of features is low. We then introduce FreeGBDT, a method of fitting a GBDT head on the features computed during finetuning to increase performance without additional computation by the neural network. We demonstrate the effectiveness of our method on several NLI datasets using a strong baseline model (RoBERTa-large with MNLI pretraining). The FreeGBDT shows a consistent improvement over the MLP classification head.[paper] [repo]
Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management
Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size con- sidering the complexity of the dialogues. Additionally, conventional training signal in- ference is not suitable for non-deterministic agent behavior, namely, considering multiple actions as valid in identical dialogue states. We propose the Conversation Graph (ConvGraph), a graph-based representation of dialogues that can be exploited for data augmentation, multi- reference training and evaluation of non- deterministic agents. ConvGraph generates novel dialogue paths to augment data volume and diversity. Intrinsic and extrinsic evaluation across three datasets shows that data augmentation and/or multi-reference training with ConvGraph can improve dialogue success rates by up to 6.4%.[paper] [repo]
Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation
Determining the plausibility of causal relations between clauses is a commonsense reasoning task that requires complex inference ability. The general approach to this task is to train a large pretrained language model on a specific dataset. However, the available training data for the task is often scarce, which leads to instability of model training or reliance on the shallow features of the dataset. This paper presents a number of techniques for making models more robust in the domain of causal reasoning. Firstly, we perform adversarial training by generating perturbed inputs through synonym substitution. Secondly, based on a linguistic theory of discourse connectives, we perform data augmentation using a discourse parser for detecting causally linked clauses in large text, and a generative language model for generating distractors. Both methods boost model performance on the Choice of Plausible Alternatives (COPA) dataset, as well as on a Balanced COPA dataset, which is a modified version of the original data that has been developed to avoid superficial cues, leading to a more challenging benchmark. We show a statistically significant improvement in performance and robustness on both datasets, even with only a small number of additionally generated data points.[paper]
Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A Case Study on CoQA
Many NLP tasks have benefited from transferring knowledge from contextualized word embeddings, however the picture of what type of knowledge is transferred is incomplete. This paper studies the types of linguistic phenomena accounted for by language models in the context of a Conversational Question Answering (CoQA) task. We identify the problematic areas for the finetuned RoBERTa, BERT and DistilBERT models through systematic error analysis – basic arithmetic (counting phrases), compositional semantics (negation and Semantic Role Labeling), and lexical semantics (surprisal and antonymy). When enhanced with the relevant linguistic knowledge through multitask learning, the models improve in performance. Ensembles of the enhanced models yield a boost between 2.2 and 2.7 points in F1 score overall, and up to 42.1 points in F1 on the hardest question classes. The results show differences in ability to represent compositional and lexical information between RoBERTa, BERT and DistilBERT.
Auxiliary Capsules for Natural Language Understanding
Lately, joint training of Intent detection and Slot filling has become the best-performing approach in the field of Natural Language understanding (NLU). In this work we extend the newly introduced application of Capsule Networks for NLU to a multi-task learning environment, using relevant auxiliary tasks. Specifically, our models perform joint Intent classification and Slot filling with the aid of Named Entity Recognition (NER) and Part of Speech (POS) tagging tasks. This allows us to exploit the hierarchical relationships between the Intents of the utterances and the different features of input text, not only Slots but also Named Entity mentions, Parts of Speech, quantity indications, etc. The models developed in this work are evaluated on standard benchmarks, achieving state-of-the-art results on the SNIPS dataset while outperforming the best commercial systems on several low-resource datasets.
Show Us the Way: Learning to Manage Dialog from Demonstrations
We present our submission to the End-to-End Multi-Domain Dialog Challenge Track of the Eighth Dialog System Technology Challenge. Our proposed dialog system adopts a pipeline architecture, with distinct components for Natural Language Understanding, Dialog State Tracking, Dialog Management and Natural Language Generation. At the core of our system is a reinforcement learning algorithm which uses Deep Q-learning from Demonstrations to learn a dialog policy with the help of expert examples. We find that demonstrations are essential to training an accurate dialog policy where both state and action spaces are large. Evaluation of our Dialog Management component shows that our approach is effective – beating supervised and reinforcement learning baselines.
Audio Visual Scene-Aware Dialog System Using Dynamic Memory Networks
The audio visual scene-aware dialog (AVSD) task, proposed as one of the tracks in the Eighth Dialog System Technology Challenge (DSTC8), is a multimodal dialog task which aims to automatically generate a response to an input question about the content of a video clip in the context of a given dialog. In this paper, we propose for this task a number of models that are based on dynamic memory networks (DMNs). Compared to the baseline model released by the AVSD organizers, our DMN-based AVSD model with single modality achieves performance improvements of more than 4.2% in the BLEU-4 score and 18.1% in the CIDEr score, demonstrating the effectiveness of DMNs for encoding long-term context information in dialog tasks. We also present a multimodal variant of the DMN-based model which incorporates all modalities.