ImageVerifierCode 换一换
格式:PDF , 页数:36 ,大小:4.72MB ,
资源ID:3506636      下载积分:2 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.wnwk.com/docdown/3506636.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: QQ登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(05 - 杨植麟 - 循环智能 - 《Latest Advances of Neural Language Models》.pdf)为本站会员(a****2)主动上传,蜗牛文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知蜗牛文库(发送邮件至admin@wnwk.com或直接QQ联系客服),我们立即给予删除!

05 - 杨植麟 - 循环智能 - 《Latest Advances of Neural Language Models》.pdf

1、杨植麟Recurrent AILatest Advances of Neural Language ModelsPart I:About Me Recurrent AI联合创始人,曾效力于谷歌大脑研究院和Facebook人工智能研究院 与多位图灵奖得主合作发表论文 曾在自然语言理解、半监督学习等30多个数据集上取得历史最好结果(state-of-the-art)2015年本科毕业于清华大学,2019年博士毕业于卡内基梅隆大学,师从苹果AI负责人Ruslan Salakhutdinov自我介绍 XLNet(NeurIPS 2019)在20个数据集好于Google BERT 是目前世界上最好的预训

2、练模型(给定标准FLOPs)NeurIPS Oral(0.5%)被数十家AI媒体报道 Transformer-XL(ACL 2019)刷新所有主流自然语言建模世界纪录 历史上第一个同时在word-level和char-level超越LSTM的注意力模型 可以连贯生成几千个词的文本 HotpotQA(EMNLP 2018)多步推理数据集 被斯坦福、华盛顿大学、UT Austin、清华大学、字节跳动、京东、微软等机构用于模型评测 Semi-supervised graph learning(ICML 2016)400+引用 推广了图学习领域的标准数据集 被数百项工作采纳为标准baseline主要研

3、究成果Latest advances of neural language modelsPart II:XLNetLearning from Unlabeled DataUnlabeled dataAbundant(1000 x more),accessibleLabeled dataScarce,expensiveUnsupervised PretrainingUnlabeled dataLabeled dataAlgorithms/ModelsImprove over supervised learningUnsupervised Pretraining RBMs(Salakhutdino

4、v et al 2007),Autoencoders(Vincent et al 2008),Jigsaw(Noroozi and Favaro 2016),GANs(Donahue and Simonyan 2019)word2vec(Mikolov et al 2013),GloVe(Pennington et al 2014)Semi-supervised sequence learning(Dai and Le 2015),ELMo(Peters et al 2017),CoVe(McCann et al 2017),GPT(Radford et al 2018),BERT(Devli

5、n et al 2018)Related WorkTwo Objectives for PretrainingAuto-regressive(AR)language modelingUnidirectional TransformerNew(Denoising)Auto-encoding(AE)YorkisacityYorkisacityBidirectional TransformermaskmaskisacityNewYorkNot able to model bidirectional context.Predicted tokens are independent of each ot

6、her.mask is not used during finetuning.Sample a factorization order Determine the attention masks based on the order Optimize a standard language modeling objective Benefits:Autoregressive,avoiding disadvantages of AE Able to model bidirectional contextNew Objective:Permutation Language ModelingExam

7、plesP(New York is a city)=P(New)*P(York|New)*P(is|New York)*P(a|New York is)*P(city|New York is a)Factorization order:New York is a cityFactorization order:city a is New YorkP(New York is a city)=P(city)*P(a|city)*P(is|city a)*P(New|city a is)*P(York|city a is New)Sequence order is not shuffled.Atte

8、ntion masks are changed to reflect factorization order.xxxxh()h()h()h()h()h()Factorization order:3 2 4 1xxxxh()h()h()h()h()Factorization order:1 4 2 3h()h()h()h()h()mem()mem()xxxxh()h()h()h()h()Factorization order:2 4 3 1h()h()h()xxxxh()h()h()h()h()h()h()h()Factorization order:4 3 1 2mem()mem()mem()

9、mem()mem()mem()xxxxComparing XLNet and BERT objectivesBERT objective(auto-encoding)XLNet objective(auto-regressive)New and York are independent.Able to model dependency between New and York.Able to model bidirectional context.Factorize the joint probability using a product rule that holds universall

10、y.orReparameterizationhdoes not contain the position of the target.Solution:condition the distribution on the position.Standard Parameterization“Stand at”and predict selfReduced to predicting a bag of words.How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the fac

11、torization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4How to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Canno

12、t see self,otherwise trivialHow to Formulate Features Let()denote the feature of the i-th token on layer lSuppose the factorization order is 3 2 4 112341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)212341(1)2(1)3(1)4(1)1(2)2(2)3(2)4(2)4Should not encode Should encode 4only has access to 2in the first layer!Two-St

13、ream Attention Factorization order:3,2,4,1Content streamQuery streamTwo-Stream Attentionh()g()h()g()h()g()h()g()h()g()AttentionQK,Vh()g()h()g()h()g()h()g()h()g()AttentionQK,VAt first layer,h is the word embeddings,and g is a trainable parameter.Only h is used during finetuning.The last g is used for

14、 optimizing the LM loss.Summarizing XLNetIndependence assumption and distribution discrepancy in BERTPermutation language modelingStandard parameterization is reduced to bag-of-wordsReparameterization with positionsContradiction for predicting both self and othersTwo-stream attentionChallengesSoluti

15、ons Same training data as in BERT:Wikipedia+BooksCorpus Same hyperparameters for pretraining as in BERT Model size:L=24,H=1024,A=16 Batch size:256 Number of steps:1M Same hyperparameter search space for finetuning as in BERTExperiment 1:Comparison with BERTXLNet outperforms BERT on 20 tasksWe report

16、 the best of 3 BERT variants.Almost identical training recipes.Less training data for XLNet:126GB vs 160GB Same hyperparameters for pretraining as in RoBERTa Model size:L=24,H=1024,A=16 Batch size:8192 Number of steps:500K Same hyperparameter search space for finetuning as in RoBERTaExperiment 2:Com

17、parison with RoBERTaXLNet outperforms RoBERTa on all considered tasksAlmost identical training recipes.XLNetisthe best pretrained model todaygiven standard FLOPs.FLOPsAccuracyBERT-LargeRoBERTaXLNetALBERTT54x16x1xPart III:Research Plan Challenge:XLNet and similar methods still rely on a large amount

18、of labeled data for target tasks Goal:improve data efficiency of pretraining-finetuning paradigm Directions Pretraining+meta learning Pretraining+multi-view integrationResearch ProposalMeta Learning:BackgroundChen et al 2019Pretraining+Meta Learning-Main idea:A meta learning paradigm for finetuning-

19、Why it might work:Learning to compare a novel instance against memory-Goal:Reduce sample complexity and improve data efficiency-Technical novelties and challenges:A meta learning algorithm that works with dozens/hundreds of examples and a pretrained model Example 1:We can use XLNet to learn to class

20、ify sales calls/texts Meanwhile,there are also structured data stored in databases Question:how to combine the two views?Example 2:We can use XLNet to learn to classify medical texts Meanwhile,there is another black-box model trained on medical imaging data Question:how to combine the two views?Data

21、 often have multiple views(features)Data often have multiple views(features)Naive Approach:Shallow Mixing-Extra views do not change the text representations-Not expressive enough to capture the dependency between viewsProposed Approach:Deep IntegrationDeep integration better models dependency among

22、views!-Challenge:a pretrained model normally only takes text as inputs-Solution 1:turn extra views into text-like representations-Solution 2:add additional structures to XLNet to incorporate extra views Goal:improve data efficiency of XLNet-like methods Proposed Methods:Pretraining+Meta Learning Pre

23、training+Multi-View Learning Turn extra views into text-like representations Add additional structures to XLNet to incorporate extra views Datasets and experimental settings In-house text classification datasets extracted from sales calls Billions of unlabeled sentences A few thousand labeled sentences for each class Multiple domains Evaluation metric:F1 scoreResearch Plan Highlight杨植麟Recurrent AIThanks!

copyright@ 2008-2023 wnwk.com网站版权所有

经营许可证编号:浙ICP备2024059924号-2