1、杨植麟Recurrent AILatest Advances of Neural Language ModelsPart I:About Me Recurrent AI联合创始人,曾效力于谷歌大脑研究院和Facebook人工智能研究院 与多位图灵奖得主合作发表论文 曾在自然语言理解、半监督学习等30多个数据集上取得历史最好结果(state-of-the-art)2015年本科毕业于清华大学,2019年博士毕业于卡内基梅隆大学,师从苹果AI负责人Ruslan Salakhutdinov自我介绍 XLNet(NeurIPS 2019)在20个数据集好于Google BERT 是目前世界上最好的预训
2、练模型(给定标准FLOPs)NeurIPS Oral(0.5%)被数十家AI媒体报道 Transformer-XL(ACL 2019)刷新所有主流自然语言建模世界纪录 历史上第一个同时在word-level和char-level超越LSTM的注意力模型 可以连贯生成几千个词的文本 HotpotQA(EMNLP 2018)多步推理数据集 被斯坦福、华盛顿大学、UT Austin、清华大学、字节跳动、京东、微软等机构用于模型评测 Semi-supervised graph learning(ICML 2016)400+引用 推广了图学习领域的标准数据集 被数百项工作采纳为标准baseline主要研
3、究成果Latest advances of neural language modelsPart II:XLNetLearning from Unlabeled DataUnlabeled dataAbundant(1000 x more),accessibleLabeled dataScarce,expensiveUnsupervised PretrainingUnlabeled dataLabeled dataAlgorithms/ModelsImprove over supervised learningUnsupervised Pretraining RBMs(Salakhutdino
4、v et al 2007),Autoencoders(Vincent et al 2008),Jigsaw(Noroozi and Favaro 2016),GANs(Donahue and Simonyan 2019)word2vec(Mikolov et al 2013),GloVe(Pennington et al 2014)Semi-supervised sequence learning(Dai and Le 2015),ELMo(Peters et al 2017),CoVe(McCann et al 2017),GPT(Radford et al 2018),BERT(Devli
5、n et al 2018)Related WorkTwo Objectives for PretrainingAuto-regressive(AR)language modelingUnidirectional TransformerNew(Denoising)Auto-encoding(AE)YorkisacityYorkisacityBidirectional TransformermaskmaskisacityNewYorkNot able to model bidirectional context.Predicted tokens are independent of each ot
6、her.mask is not used during finetuning.Sample a factorization order Determine the attention masks based on the order Optimize a standard language modeling objective Benefits:Autoregressive,avoiding disadvantages of AE Able to model bidirectional contextNew Objective:Permutation Language ModelingExam
7、plesP(New York is a city)=P(New)*P(York|New)*P(is|New York)*P(a|New York is)*P(city|New York is a)Factorization order:New York is a cityFactorization order:city a is New YorkP(New York is a city)=P(city)*P(a|city)*P(is|city a)*P(New|city a is)*P(York|city a is New)Sequence order is not shuffled.Atte
8、ntion masks are changed to reflect factorization order.xxxxh()h()h()h()h()h()Factorization order:3 2 4 1xxxxh()h()h()h()h()Factorization order:1 4 2 3h()h()h()h()h()mem()mem()xxxxh()h()h()h()h()Factorization order:2 4 3 1h()h()h()xxxxh()h()h()h()h()h()h()h()Factorization order:4 3 1 2mem()mem()mem()
9、mem()mem()mem()xxxxComparing XLNet and BERT objectivesBERT objective(auto-encoding)XLNet objective(auto-regressive)New and York are independent.Able to model dependency between New and York.Able to model bidirectional context.Factorize the joint probability using a product rule that holds universall
10、y.orReparameterizationhdoes not contain the position of the target.Solution:condition the distribution on the position.Standard Parameterization“Stand at”and predict selfReduced to predicting a bag of words.How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the fac
11、torization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Canno
12、t see self,otherwise trivialHow to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Should not encode Should encode 4only has access to 2in the first layer!Two-St
13、ream Attention Factorization order:3,2,4,1Content streamQuery streamTwo-Stream Attentionh()g()h()g()h()g()h()g()h()g()AttentionQK,Vh()g()h()g()h()g()h()g()h()g()AttentionQK,VAt first layer,h is the word embeddings,and g is a trainable parameter.Only h is used during finetuning.The last g is used for
14、 optimizing the LM loss.Summarizing XLNetIndependence assumption and distribution discrepancy in BERTPermutation language modelingStandard parameterization is reduced to bag-of-wordsReparameterization with positionsContradiction for predicting both self and othersTwo-stream attentionChallengesSoluti
15、ons Same training data as in BERT:Wikipedia+BooksCorpus Same hyperparameters for pretraining as in BERT Model size:L=24,H=1024,A=16 Batch size:256 Number of steps:1M Same hyperparameter search space for finetuning as in BERTExperiment 1:Comparison with BERTXLNet outperforms BERT on 20 tasksWe report
16、 the best of 3 BERT variants.Almost identical training recipes.Less training data for XLNet:126GB vs 160GB Same hyperparameters for pretraining as in RoBERTa Model size:L=24,H=1024,A=16 Batch size:8192 Number of steps:500K Same hyperparameter search space for finetuning as in RoBERTaExperiment 2:Com
17、parison with RoBERTaXLNet outperforms RoBERTa on all considered tasksAlmost identical training recipes.XLNetisthe best pretrained model todaygiven standard FLOPs.FLOPsAccuracyBERT-LargeRoBERTaXLNetALBERTT54x16x1xPart III:Research Plan Challenge:XLNet and similar methods still rely on a large amount
18、of labeled data for target tasks Goal:improve data efficiency of pretraining-finetuning paradigm Directions Pretraining+meta learning Pretraining+multi-view integrationResearch ProposalMeta Learning:BackgroundChen et al 2019Pretraining+Meta Learning-Main idea:A meta learning paradigm for finetuning-
19、Why it might work:Learning to compare a novel instance against memory-Goal:Reduce sample complexity and improve data efficiency-Technical novelties and challenges:A meta learning algorithm that works with dozens/hundreds of examples and a pretrained model Example 1:We can use XLNet to learn to class
20、ify sales calls/texts Meanwhile,there are also structured data stored in databases Question:how to combine the two views?Example 2:We can use XLNet to learn to classify medical texts Meanwhile,there is another black-box model trained on medical imaging data Question:how to combine the two views?Data
21、 often have multiple views(features)Data often have multiple views(features)Naive Approach:Shallow Mixing-Extra views do not change the text representations-Not expressive enough to capture the dependency between viewsProposed Approach:Deep IntegrationDeep integration better models dependency among
22、views!-Challenge:a pretrained model normally only takes text as inputs-Solution 1:turn extra views into text-like representations-Solution 2:add additional structures to XLNet to incorporate extra views Goal:improve data efficiency of XLNet-like methods Proposed Methods:Pretraining+Meta Learning Pre
23、training+Multi-View Learning Turn extra views into text-like representations Add additional structures to XLNet to incorporate extra views Datasets and experimental settings In-house text classification datasets extracted from sales calls Billions of unlabeled sentences A few thousand labeled sentences for each class Multiple domains Evaluation metric:F1 scoreResearch Plan Highlight杨植麟Recurrent AIThanks!