ImageVerifierCode 换一换
格式:PDF , 页数:39 ,大小:4.50MB ,
资源ID:3048663      下载积分:2 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.wnwk.com/docdown/3048663.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: QQ登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(NVIDIA+Merlin+HugeCTR+推荐系统框架.pdf)为本站会员(a****2)主动上传,蜗牛文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知蜗牛文库(发送邮件至admin@wnwk.com或直接QQ联系客服),我们立即给予删除!

NVIDIA+Merlin+HugeCTR+推荐系统框架.pdf

1、Related Session in GTC 2022Merlin HugeCTR:由GPU加速的推荐系统训练和推理 S41352-Minseok Lee|NVIDIA MERLIN 推荐系统团队高级经理Merlin HugeCTR:使用GPU Embedding 缓存的分布式分层推理参数服务器 S41126-Yingcan Wei,Fan Yu,Matthias Langer|NVIDIABuilding and Deploying Recommender Systems Quickly and Easily with NVIDIA Merlin S41119 Even Oldridge,S

2、enior Manager,Merlin Recommender Systems Team,NVIDIAGetting startedHugeCTR NVIDIA.com:https:/ GitHub:https:/ storiesLeading Design and Development of the Advertising Recommender System at Tencent:An Interview with Xiangting KongMeituan/Optimizing Meituans Machine Learning Platform:An Interview with

3、Jun HuangLearn more about HugeCTRAccelerating Embedding with the HugeCTR TensorFlow Embedding PluginHugeCTR Series Part 1:Scaling and Accelerating large Deep Learning Recommender Systems(CN)HugeCTR 系列第 1 部分:扩展和加速大型深度学习推荐系统HugeCTR Series Part 2:Training large Deep Learning Recommender Models with Mer

4、lin HugeCTRs Python APIs(CN)HugeCTR 系列第 2 部分:使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型HugeCTR Parameter Server Series Part 1:Introduction to Hierarchical Parameter ServerWe are Hiring(Full Time&Intern):C+Engineer,CUDA Engineer,Recommendation System Algorithm ResearcherPlease email your Resume to:s

5、h-HugeCTRResourcesMERLIN HUGECTR:GPU-ACCELERATED RECOMMENDER SYSTEM TRAINING AND INFERENCEJERRY SHI3SOCIAL MEDIADIGITAL ADVERTISINGE-COMMERCEDIGITAL CONTENTRECOMMENDERSTHE PERSONALIZATION ENGINE OF THE INTERNET 4.3B Active Users4.3B Watch Videos Online3.7B Shop Online4.7B Internet Users“Already,35 p

6、ercent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix come from product recommendations based on such algorithms.”Source:McKinsey4NVTabular Pipelines are slow and complexChallengeSolutionInferenceTrainingData LoadingETLUsing common item-by-item loading can be slowH

7、igh throughput to rank more items is difficult while maintaining low latencyEmbedding tables of large deep learning recommender systems can exceed memory GPU-accelerated and easy-to-use ETL pipelines prepares datasets in minutesAsynchronous and GPU-accelerated dataloader for PyTorch and TensorFlow/K

8、erasEasy data and model parallel training allow to scale TB size embeddings High throughput,low-latency production deploymentNVIDIA Merlin is an open-source library to deploy recommender systems end-2-endTritonHugeCTRNVIDIA Merlin addresses Recommender System challenges5AGENDAHugeCTR OverviewHugeCTR

9、 InferenceHugeCTR Sparse Operation KitHUGECTR OVERVIEWHUGECTR:SCALABLE,ACCELERATED RECSYS TRAININGhttps:/ An open-source framework to accelerate the training of CTR estimation models on NVIDIA GPUs.Written in CUDA C+and highly exploits GPU-accelerated libraries such as cuBLAS,cuDNN,and NCCL Provide

10、a high-level,Keras-like Python API Continue to power NVIDIAs MLPerf DLRM submission Designed for training recommender models at scale where gigantic embedding tables are included Support common models and their variants such as Deep Learning Recommendation Model(DLRM),Wide and Deep,Deep Cross Networ

11、k,and DeepFM Support GPU-accelerated,concurrent model inference as a Triton backend based on an efficient embedding cache and hierarchical parameter server(HPS)CHALLENGES OF EMBEDDING LAYERMemory Demands and Communication Overhead Challenge 1.Embedding tables may not fit in a single GPU memory,e.g.,

12、100GB,1TB,etc Solution:Store tables across multiple GPUs(and multiple nodes)Model Parallelism Challenge 2.Embedding primitives such as table lookup and pooling/reduction operator are memory bound Solution:Fuse such CUDA kernels Challenge 3.Model parallelism requires inter-GPU/inter-node communicatio

13、n Solution:Use NCCL and/or exploit NVLink(or NVSwitch)Common solution for Challenge 1&2&3:Use lower precision such as FP16RECOMMENDER MODEL AT SCALEHugeCTR embedding table can span multiple nodes beyond multiple GPUs Neural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural Networ

14、kNeural NetworkNeural NetworkGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7Node 0Neural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7Node 1EmbeddingsCOMMON EMBEDDING PRIMITIVESAll Memory BoundEmbeddingTablesCategoricalfeat

15、uresfea1(multi-hot)fea2fea3fea0(1)Table lookup(2)Pooling/Reduction(3)ConcatenationHUGECTR EMBEDDING LAYERGPU Hash Table+CUDA Kernel FusionEmbeddingTablesCategoricalfeaturesfea1(multi-hot)fea2fea3fea0(1)Hash table lookup(2)In-register reduction&concatenation(one fused kernel)WEIGHT CONVERSION FOR MIX

16、ED PRECISION TRAININGFrom FP32 Master Weights to FP16 Weights In each training iteration,optimizer computes the master weights in FP32 To support mixed precision training,FP32 weights must be casted into FP16 Currently the optimizer weight update and weight conversion happen in two different kernels

17、 Memory bandwidth utilization can be further improved NxfpropNxbpropWeight Conversion(FP32-to-FP16)OptimizerNew FP32 weights FP16 weightsDATA PARALLEL NEURAL NETWORKMultithreaded Kernel Launch to keep GPUs busyGPU1GPU2GPU3CPU Thread 1L0.K0CPU Thread 2CPU Thread 3L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0

18、.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0TimeAccumulated kernel latency may not be trivial!CPU Thread 0L0.K0GPU0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0fpropbpropDATA PARALLEL NEURAL NETWORKCUDA Graph to minimize CUDA kernel launch latencyGPU1GPU2GPU3CPU Thread 1L0.K0

19、CPU Thread 2CPU Thread 3L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0TimeCPU Thread 0L0.K0GPU0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0One CUDA Graph launchOne CUDA Graph launc

20、hHUGECTR TRAINING PERFORMANCEWide&Deep on DGX A100Higher is BetterLower is BetterHUGECTR MLPERF DLRM TRAINING PERFORMANCEIn the MLPerf Training v1.0/v1.1 Benchmark Model:DLRM Dataset:Criteo 1TB dataset HugeCTR is a key driver behind the recent NVIDIA MLPerf win In this v1.1 edition,optimization over

21、 the entire hardware and software stack sees continuing improvement Higher is BetterNVIDIA MLPerf DLRM submission details:4x4S CPU(v1.0):1.0-1045|1xDGX A100(v1.0):1.0-1058|14xDGX A100(v1.0):1.0-1067|14xDGX A100(v1.1):1.1-2073For more infromation,see www.mlperf.org.EMBEDDING TRAINING CACHE(ETC)Train

22、models(TB+)larger than the memory capacity of your GPUs by iteratively prefetching portions of the embedding table to the GPU memory in the granularity of pass.This is especially useful for continuous training.Workflow:Step 1)Load embeddings from storage to GPU memory Step 2)Training pass Step 3)Upd

23、ate model on storagerepeatGPU Memory(100s GBs)Key AEmbeddingTablePass NStorage-CPU Memory or SSD(TBs)Step 1)Step 3)Step 2)Key BEmbeddingTablePass N+1TimeINTEROPERABILITY WITH OPEN NEURAL NETWORK EXCHANGE(ONNX)Compatible with HugeCTR Training APIsOpen Neural Network Exchange(ONNX)is a common open-sou

24、rce format for AI model.hugectr2onnx is a python package that can convert HugeCTR models to ONNX format Input:Graph configuration JSON,dense models,and optionally,sparse models Output:ONNX model,which can then be used with ONNX runtime,or other framework native inference engine,hence better interope

25、rability.wdl_model wdl0_sparse_2000.model wdl1_sparse_2000.model wdl_dense_2000.model wdl.jsonHugeCTRONNXConverterhugectr2onnx.converter.convert(onnx_model_path=wdl.onnx,graph_config=wdl.json,dense_model=wdl_dense_2000.model,convert_embedding=True,sparse_models=wdl0_sparse_2000.model,wdl1_sparse_200

26、0.model)https:/ OPTIMIZATIONS&FEATUREShttps:/ MLPerf training v1.0/1.1 optimizations V1.0 optimization:https:/ We are actively working on their full generalization.Stay tuned!HugeCTR release notes https:/ HugeCTR samples&Notebooks Model samples:https:/ Notebooks:https:/ INFERENCECHALLENGES OF INFERE

27、NCE Challenge 1.Embedding tables may not fit in a single GPU memory,e.g.,100GB,1TB,etc Challenge 2.Need to meet the target latency whilst maintaining high throughput/concurrency Challenge 3.Loading an updated incremental model from training to inference must be streamlined(Fast Deployment)Challenge

28、4.GPU compute&memory resources must be well isolated GPU-CENTRIC MEMORY HIERARCHYHigh GPU memory bandwidth and fast GPU processor for low-latency inference GPU streaming multiprocessors(SM)provides much better computational throughput than CPU GPU memory bandwidth is 10 x faster than CPU memory band

29、width GPU memory size is sufficient for holding“hot embedding vectors”from a gigantic embedding table Only few embedding vectors are required for majority of requests.Inference can be optimized by leveraging GPU in productionCPU Memory:100s GB-TBsGPU Memory:10s GBGPU SMsStorage/RAID:10s-100s TB100s

30、GB/s-TBs/s10s GB/sGBs/s-10s GB/sHUGECTR INFERENCE ARCHITECTUREAt High Level Designed to effectively use the GPU memory to accelerate the inference Supports concurrent model inference execution on the same GPU or across multiple GPUs Embedding tables are cached hierarchically Parameter server is in c

31、harge of managing whole embedding tables of different models on CPU memory and SSDs Embedding cache manages hot portions of the embedding tables while providing their GPU accelerated lookups The same models instances share the embedding caches to do their predictionsModel1 InstanceModel1 InstanceMod

32、el2 InstanceModel2 InstanceModel1 Emb.CacheEmb.Table cache1 Model2 Emb.CacheEmb.Table cache1 Emb.Table cache2 Parameter ServerEmb.Table 1Emb.Table 1Emb.Table 2Model 1Model 2CPUGPUDATABASE BACKENDPersistent vs.VolatilePersistent Database HugeCTR will deploy,“persist”and maintain a full copy of all pa

33、rameters in this database.“persistent”databases are regarded as having virtually unlimited storage space.E.g.,RocksDBVolatile Database Fast!Resides in RAM(either remote or local).Can consider as a cache for the persistent database.Assumed to operate with limited space.However,limited space means tha

34、t additional handlers are implemented to establish function,when running out of storage space.E.g.,RedisEC,persistent and volatile DB,areEVENTUALLY CONSISTENT!HUGECTR HIERARCHICAL PARAMETER SERVERMulti-Level Caching across Different Types of MemoryCPU MemoryGPU MemoryParameter replication on SSDLeve

35、l 3Local parametersLevel 2NetworkMeta Parameters Level 2Parameter shards Level 2Temporary memory for networkParameter buffer poolEmbedding cache on GPU Level 1Incremental ParametersInsertion ParametersQuery parametersLocal dataEvent SinkParameter ReplicationLocal SSDsTemporary MemoryLocal Parameters

36、Parameter shardsMeta ParametersParameter BufferCPU MemoryLocal data Parameter Buffer Embedding cacheGPU MemoryINCREMENTAL UPDATES Basically,a lazy queue.Optimizer publishes updates via sink to Kafka.Each HugeCTR instance monitors queues of related models.Highly automated process:Automatic workload d

37、istribution.Can handle certain degree of node failure.Asynchronous injection of updates in background.Message SinkMessage SourceTraining(e.g.,HugeCTR native,TensorFlow SOK)MessageBuffer(e.g.,Kafka,RabbitMQ,)HPS E2E SETUP EXAMPLEFrom Training to DeploymentCONFIGURING HUGECTRe.g.:Inference with Triton

38、/Just add to your Triton config:persistent_db:type:rocks_db,path:/models/wide_and_deep,num_threads:16,read_only:false,max_get_batch_size:8192,max_set_batch_size:8192volatile_db:type:redis_cluster,address:host_name:port,.,user_name:default,password:123456,num_partitions:16,overflow_margin:1000000,ove

39、rflow_policy:evict_oldest,overflow_resolution_target:800000,max_get_batch_size:8192,max_set_batch_size:8192,cache_missed_embeddings:trueSTREAM MODEL UPDATES LIVEINTO YOUR INFERENCE SYSTEM?HugeCTR will automatically discover and ingest model updates.If multiple HugeCTR instances connect to shared dat

40、abases,the related workload is distributed about evenly among them.Fast,reliably,automatically,via Apache Kafka!/In your model training code:import hugectrsolver=hugectr.CreateSolver(max_eval_batches=300,batchsize_eval=2048,batchsize=2048,lr=0.001,vvgpu=2,repeat_dataset=True,i64_input_key=True,kafka

41、_brokers=host_name:port.)/Just add to your Triton config:update_source:type:kafka_message_queue,brokers:host_name:port.,poll_timeout_ms:50,max_receive_buffer_size:4096,max_batch_size:1024,failure_backoff_ms:300HPS PERFORMANCEWe continue to actively improve it!100%100%100%1.3 x4.2 x80.0 x0%100%200%30

42、0%400%500%600%700%800%RocksDB BackendHashMap BackendRedis BackendAv.g Query Latency ImprovementAxis TitleHugeCTR 3.1HugeCTR 3.4Model:Wide&Deep,Dataset:criteo;evaluation performed across 2000 consecutive random queries.Cluster node configuration:DGX-1V(https:/ configuration:EDR InfiniBand(https:/ is

43、BetterMORE RESOURCES ON HUGECTR INFERENCEDocumentations&Deep Dive Documentations:HugeCTR Inference Architecture:https:/ HugeCTR Hierarchical Parameter Server:https:/ HugeCTR Triton Backend README:https:/ Notebooks:Hierarchical Deployment:https:/ Continuous Training and Inference:Part 1:https:/ Part

44、2:https:/ GTC sessions:GPU Embedding Cache Performance Deep Dive at GTC 2020:https:/ HugeCTR:Distributed Hierarchical Inference Parameter Server Using GPU Embedding Cache S41126 at GTC 2022HUGECTR SPARSE OPERATION KITSPARSE OPERATION KIT(SOK):TENSORFLOW EMBEDDING PLUGINhttps:/ Motivation Enable Huge

45、CTR features and optimizations for general DL frameworks such as Tensorflow e.g.HugeCTR embedding operators Sparse Operation Kit(SOK)is a python package that provides GPU accelerated operations for sparse training and inference,e.g.,recommender for popular frameworks such as TensorFlow Its major ope

46、rators are extracted from HugeCTRFEATURESGood Compatibility with TensorFlow Compatible with TensorFlow 1.15 and 2.4+No need to recompile or reinstall TensorFlow Compatible with distributed training frameworks such as Horovod Workable with MirroredStrategy and MultiWorkerMirroredStrategyFEATURESModel

47、-Parallel GPU Embedding Layer for RecSys Training at ScaleOptional Subtitle*DP stands for Data-Parallel*MP stands for Model-ParallelSPARSE OPERATION KIT USAGE(HOROVOD)Define training loop with Horovodmodel=TFModel()tf.functiondef train_step(inputs,labels):with tf.GradientTape()as tape:logits=model(i

48、nputs)loss=_loss_fn(labels,logits)trainable_variables=model.trainable_variablesgrads=tape.gradient(loss,trainable_variables)grads=hvd.allreduce(grad)for grad in gradsoptimizer.apply_gradients(zip(grads,trainable_variables)return losssok.Init()model=SOKModel()tf.functiondef train_step(inputs,labels):

49、with tf.GradientTape()as tape:logits=model(inputs)loss=_loss_fn(labels,logits)emb_variables,other_variables=sok.split_embedding_variable_from_others(model.trainable_variables)emb_grads,other_grads=tape.gradient(loss,emb_variables,other_variables)other_grads=hvd.allreduce(grad)for grad in other_grads

50、optimizer.apply_gradients(zip(emb_grads,emb_variables)optimizer.apply_gradients(zip(other_grads,other_variables)return lossMORE RESOURCES ON SPARSE OPERATION KITDocumentations&Deep Dive Documentations:https:/ https:/nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/index.html Notebooks:SOK

copyright@ 2008-2023 wnwk.com网站版权所有

经营许可证编号:浙ICP备2024059924号-2