05 - 杨植麟 - 循环智能 - 《Latest Advances of Neural Language Models》.pdf

资源描述

1、杨植麟Recurrent AILatest Advances of Neural Language ModelsPart I:About Me Recurrent AI联合创始人，曾效力于谷歌大脑研究院和Facebook人工智能研究院与多位图灵奖得主合作发表论文曾在自然语言理解、半监督学习等30多个数据集上取得历史最好结果（state-of-the-art）2015年本科毕业于清华大学，2019年博士毕业于卡内基梅隆大学，师从苹果AI负责人Ruslan Salakhutdinov自我介绍 XLNet(NeurIPS 2019)在20个数据集好于Google BERT 是目前世界上最好的预训

2、练模型（给定标准FLOPs）NeurIPS Oral(0.5%)被数十家AI媒体报道 Transformer-XL(ACL 2019)刷新所有主流自然语言建模世界纪录历史上第一个同时在word-level和char-level超越LSTM的注意力模型可以连贯生成几千个词的文本 HotpotQA(EMNLP 2018)多步推理数据集被斯坦福、华盛顿大学、UT Austin、清华大学、字节跳动、京东、微软等机构用于模型评测 Semi-supervised graph learning(ICML 2016)400+引用推广了图学习领域的标准数据集被数百项工作采纳为标准baseline主要研

3、究成果Latest advances of neural language modelsPart II:XLNetLearning from Unlabeled DataUnlabeled dataAbundant(1000 x more),accessibleLabeled dataScarce,expensiveUnsupervised PretrainingUnlabeled dataLabeled dataAlgorithms/ModelsImprove over supervised learningUnsupervised Pretraining RBMs(Salakhutdino

4、v et al 2007),Autoencoders(Vincent et al 2008),Jigsaw(Noroozi and Favaro 2016),GANs(Donahue and Simonyan 2019)word2vec(Mikolov et al 2013),GloVe(Pennington et al 2014)Semi-supervised sequence learning(Dai and Le 2015),ELMo(Peters et al 2017),CoVe(McCann et al 2017),GPT(Radford et al 2018),BERT(Devli

5、n et al 2018)Related WorkTwo Objectives for PretrainingAuto-regressive(AR)language modelingUnidirectional TransformerNew(Denoising)Auto-encoding(AE)YorkisacityYorkisacityBidirectional TransformermaskmaskisacityNewYorkNot able to model bidirectional context.Predicted tokens are independent of each ot

6、her.mask is not used during finetuning.Sample a factorization order Determine the attention masks based on the order Optimize a standard language modeling objective Benefits:Autoregressive,avoiding disadvantages of AE Able to model bidirectional contextNew Objective:Permutation Language ModelingExam

8、ntion masks are changed to reflect factorization order.xxxxh()h()h()h()h()h()Factorization order:3 2 4 1xxxxh()h()h()h()h()Factorization order:1 4 2 3h()h()h()h()h()mem()mem()xxxxh()h()h()h()h()Factorization order:2 4 3 1h()h()h()xxxxh()h()h()h()h()h()h()h()Factorization order:4 3 1 2mem()mem()mem()

9、mem()mem()mem()xxxxComparing XLNet and BERT objectivesBERT objective(auto-encoding)XLNet objective(auto-regressive)New and York are independent.Able to model dependency between New and York.Able to model bidirectional context.Factorize the joint probability using a product rule that holds universall

10、y.orReparameterizationhdoes not contain the position of the target.Solution:condition the distribution on the position.Standard Parameterization“Stand at”and predict selfReduced to predicting a bag of words.How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the fac

11、torization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Canno

12、t see self,otherwise trivialHow to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Should not encode Should encode 4only has access to 2in the first layer!Two-St

13、ream Attention Factorization order:3,2,4,1Content streamQuery streamTwo-Stream Attentionh()g()h()g()h()g()h()g()h()g()AttentionQK,Vh()g()h()g()h()g()h()g()h()g()AttentionQK,VAt first layer,h is the word embeddings,and g is a trainable parameter.Only h is used during finetuning.The last g is used for

14、 optimizing the LM loss.Summarizing XLNetIndependence assumption and distribution discrepancy in BERTPermutation language modelingStandard parameterization is reduced to bag-of-wordsReparameterization with positionsContradiction for predicting both self and othersTwo-stream attentionChallengesSoluti

15、ons Same training data as in BERT:Wikipedia+BooksCorpus Same hyperparameters for pretraining as in BERT Model size:L=24,H=1024,A=16 Batch size:256 Number of steps:1M Same hyperparameter search space for finetuning as in BERTExperiment 1:Comparison with BERTXLNet outperforms BERT on 20 tasksWe report

16、 the best of 3 BERT variants.Almost identical training recipes.Less training data for XLNet:126GB vs 160GB Same hyperparameters for pretraining as in RoBERTa Model size:L=24,H=1024,A=16 Batch size:8192 Number of steps:500K Same hyperparameter search space for finetuning as in RoBERTaExperiment 2:Com

17、parison with RoBERTaXLNet outperforms RoBERTa on all considered tasksAlmost identical training recipes.XLNetisthe best pretrained model todaygiven standard FLOPs.FLOPsAccuracyBERT-LargeRoBERTaXLNetALBERTT54x16x1xPart III:Research Plan Challenge:XLNet and similar methods still rely on a large amount

18、of labeled data for target tasks Goal:improve data efficiency of pretraining-finetuning paradigm Directions Pretraining+meta learning Pretraining+multi-view integrationResearch ProposalMeta Learning:BackgroundChen et al 2019Pretraining+Meta Learning-Main idea:A meta learning paradigm for finetuning-

19、Why it might work:Learning to compare a novel instance against memory-Goal:Reduce sample complexity and improve data efficiency-Technical novelties and challenges:A meta learning algorithm that works with dozens/hundreds of examples and a pretrained model Example 1:We can use XLNet to learn to class

20、ify sales calls/texts Meanwhile,there are also structured data stored in databases Question:how to combine the two views?Example 2:We can use XLNet to learn to classify medical texts Meanwhile,there is another black-box model trained on medical imaging data Question:how to combine the two views?Data

21、 often have multiple views(features)Data often have multiple views(features)Naive Approach:Shallow Mixing-Extra views do not change the text representations-Not expressive enough to capture the dependency between viewsProposed Approach:Deep IntegrationDeep integration better models dependency among

22、views!-Challenge:a pretrained model normally only takes text as inputs-Solution 1:turn extra views into text-like representations-Solution 2:add additional structures to XLNet to incorporate extra views Goal:improve data efficiency of XLNet-like methods Proposed Methods:Pretraining+Meta Learning Pre

23、training+Multi-View Learning Turn extra views into text-like representations Add additional structures to XLNet to incorporate extra views Datasets and experimental settings In-house text classification datasets extracted from sales calls Billions of unlabeled sentences A few thousand labeled sentences for each class Multiple domains Evaluation metric:F1 scoreResearch Plan Highlight杨植麟Recurrent AIThanks!

展开阅读全文