1、Related Session in GTC 2022Merlin HugeCTR:由GPU加速的推荐系统训练和推理 S41352-Minseok Lee|NVIDIA MERLIN 推荐系统团队高级经理Merlin HugeCTR:使用GPU Embedding 缓存的分布式分层推理参数服务器 S41126-Yingcan Wei,Fan Yu,Matthias Langer|NVIDIABuilding and Deploying Recommender Systems Quickly and Easily with NVIDIA Merlin S41119 Even Oldridge,S
2、enior Manager,Merlin Recommender Systems Team,NVIDIAGetting startedHugeCTR NVIDIA.com:https:/ GitHub:https:/ storiesLeading Design and Development of the Advertising Recommender System at Tencent:An Interview with Xiangting KongMeituan/Optimizing Meituans Machine Learning Platform:An Interview with
3、Jun HuangLearn more about HugeCTRAccelerating Embedding with the HugeCTR TensorFlow Embedding PluginHugeCTR Series Part 1:Scaling and Accelerating large Deep Learning Recommender Systems(CN)HugeCTR 系列第 1 部分:扩展和加速大型深度学习推荐系统HugeCTR Series Part 2:Training large Deep Learning Recommender Models with Mer
4、lin HugeCTRs Python APIs(CN)HugeCTR 系列第 2 部分:使用 Merlin HugeCTR 的 Python API 训练大型深度学习推荐模型HugeCTR Parameter Server Series Part 1:Introduction to Hierarchical Parameter ServerWe are Hiring(Full Time&Intern):C+Engineer,CUDA Engineer,Recommendation System Algorithm ResearcherPlease email your Resume to:s
5、h-HugeCTRResourcesMERLIN HUGECTR:GPU-ACCELERATED RECOMMENDER SYSTEM TRAINING AND INFERENCEJERRY SHI3SOCIAL MEDIADIGITAL ADVERTISINGE-COMMERCEDIGITAL CONTENTRECOMMENDERSTHE PERSONALIZATION ENGINE OF THE INTERNET 4.3B Active Users4.3B Watch Videos Online3.7B Shop Online4.7B Internet Users“Already,35 p
6、ercent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix come from product recommendations based on such algorithms.”Source:McKinsey4NVTabular Pipelines are slow and complexChallengeSolutionInferenceTrainingData LoadingETLUsing common item-by-item loading can be slowH
7、igh throughput to rank more items is difficult while maintaining low latencyEmbedding tables of large deep learning recommender systems can exceed memory GPU-accelerated and easy-to-use ETL pipelines prepares datasets in minutesAsynchronous and GPU-accelerated dataloader for PyTorch and TensorFlow/K
8、erasEasy data and model parallel training allow to scale TB size embeddings High throughput,low-latency production deploymentNVIDIA Merlin is an open-source library to deploy recommender systems end-2-endTritonHugeCTRNVIDIA Merlin addresses Recommender System challenges5AGENDAHugeCTR OverviewHugeCTR
9、 InferenceHugeCTR Sparse Operation KitHUGECTR OVERVIEWHUGECTR:SCALABLE,ACCELERATED RECSYS TRAININGhttps:/ An open-source framework to accelerate the training of CTR estimation models on NVIDIA GPUs.Written in CUDA C+and highly exploits GPU-accelerated libraries such as cuBLAS,cuDNN,and NCCL Provide
10、a high-level,Keras-like Python API Continue to power NVIDIAs MLPerf DLRM submission Designed for training recommender models at scale where gigantic embedding tables are included Support common models and their variants such as Deep Learning Recommendation Model(DLRM),Wide and Deep,Deep Cross Networ
11、k,and DeepFM Support GPU-accelerated,concurrent model inference as a Triton backend based on an efficient embedding cache and hierarchical parameter server(HPS)CHALLENGES OF EMBEDDING LAYERMemory Demands and Communication Overhead Challenge 1.Embedding tables may not fit in a single GPU memory,e.g.,
12、100GB,1TB,etc Solution:Store tables across multiple GPUs(and multiple nodes)Model Parallelism Challenge 2.Embedding primitives such as table lookup and pooling/reduction operator are memory bound Solution:Fuse such CUDA kernels Challenge 3.Model parallelism requires inter-GPU/inter-node communicatio
13、n Solution:Use NCCL and/or exploit NVLink(or NVSwitch)Common solution for Challenge 1&2&3:Use lower precision such as FP16RECOMMENDER MODEL AT SCALEHugeCTR embedding table can span multiple nodes beyond multiple GPUs Neural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural Networ
14、kNeural NetworkNeural NetworkGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7Node 0Neural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkNeural NetworkGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7Node 1EmbeddingsCOMMON EMBEDDING PRIMITIVESAll Memory BoundEmbeddingTablesCategoricalfeat
15、uresfea1(multi-hot)fea2fea3fea0(1)Table lookup(2)Pooling/Reduction(3)ConcatenationHUGECTR EMBEDDING LAYERGPU Hash Table+CUDA Kernel FusionEmbeddingTablesCategoricalfeaturesfea1(multi-hot)fea2fea3fea0(1)Hash table lookup(2)In-register reduction&concatenation(one fused kernel)WEIGHT CONVERSION FOR MIX
16、ED PRECISION TRAININGFrom FP32 Master Weights to FP16 Weights In each training iteration,optimizer computes the master weights in FP32 To support mixed precision training,FP32 weights must be casted into FP16 Currently the optimizer weight update and weight conversion happen in two different kernels
17、 Memory bandwidth utilization can be further improved NxfpropNxbpropWeight Conversion(FP32-to-FP16)OptimizerNew FP32 weights FP16 weightsDATA PARALLEL NEURAL NETWORKMultithreaded Kernel Launch to keep GPUs busyGPU1GPU2GPU3CPU Thread 1L0.K0CPU Thread 2CPU Thread 3L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0
18、.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0TimeAccumulated kernel latency may not be trivial!CPU Thread 0L0.K0GPU0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0fpropbpropDATA PARALLEL NEURAL NETWORKCUDA Graph to minimize CUDA kernel launch latencyGPU1GPU2GPU3CPU Thread 1L0.K0
19、CPU Thread 2CPU Thread 3L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0TimeCPU Thread 0L0.K0GPU0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0L0.K0L1.K0L1.K1L2.K0L0.K0L1.K0L1.K1L2.K0L2.K0L1.K0L1.K1L0.K0One CUDA Graph launchOne CUDA Graph launc
20、hHUGECTR TRAINING PERFORMANCEWide&Deep on DGX A100Higher is BetterLower is BetterHUGECTR MLPERF DLRM TRAINING PERFORMANCEIn the MLPerf Training v1.0/v1.1 Benchmark Model:DLRM Dataset:Criteo 1TB dataset HugeCTR is a key driver behind the recent NVIDIA MLPerf win In this v1.1 edition,optimization over
21、 the entire hardware and software stack sees continuing improvement Higher is BetterNVIDIA MLPerf DLRM submission details:4x4S CPU(v1.0):1.0-1045|1xDGX A100(v1.0):1.0-1058|14xDGX A100(v1.0):1.0-1067|14xDGX A100(v1.1):1.1-2073For more infromation,see www.mlperf.org.EMBEDDING TRAINING CACHE(ETC)Train
22、models(TB+)larger than the memory capacity of your GPUs by iteratively prefetching portions of the embedding table to the GPU memory in the granularity of pass.This is especially useful for continuous training.Workflow:Step 1)Load embeddings from storage to GPU memory Step 2)Training pass Step 3)Upd
23、ate model on storagerepeatGPU Memory(100s GBs)Key AEmbeddingTablePass NStorage-CPU Memory or SSD(TBs)Step 1)Step 3)Step 2)Key BEmbeddingTablePass N+1TimeINTEROPERABILITY WITH OPEN NEURAL NETWORK EXCHANGE(ONNX)Compatible with HugeCTR Training APIsOpen Neural Network Exchange(ONNX)is a common open-sou
24、rce format for AI model.hugectr2onnx is a python package that can convert HugeCTR models to ONNX format Input:Graph configuration JSON,dense models,and optionally,sparse models Output:ONNX model,which can then be used with ONNX runtime,or other framework native inference engine,hence better interope
25、rability.wdl_model wdl0_sparse_2000.model wdl1_sparse_2000.model wdl_dense_2000.model wdl.jsonHugeCTRONNXConverterhugectr2onnx.converter.convert(onnx_model_path=wdl.onnx,graph_config=wdl.json,dense_model=wdl_dense_2000.model,convert_embedding=True,sparse_models=wdl0_sparse_2000.model,wdl1_sparse_200
26、0.model)https:/ OPTIMIZATIONS&FEATUREShttps:/ MLPerf training v1.0/1.1 optimizations V1.0 optimization:https:/ We are actively working on their full generalization.Stay tuned!HugeCTR release notes https:/ HugeCTR samples&Notebooks Model samples:https:/ Notebooks:https:/ INFERENCECHALLENGES OF INFERE
27、NCE Challenge 1.Embedding tables may not fit in a single GPU memory,e.g.,100GB,1TB,etc Challenge 2.Need to meet the target latency whilst maintaining high throughput/concurrency Challenge 3.Loading an updated incremental model from training to inference must be streamlined(Fast Deployment)Challenge
28、4.GPU compute&memory resources must be well isolated GPU-CENTRIC MEMORY HIERARCHYHigh GPU memory bandwidth and fast GPU processor for low-latency inference GPU streaming multiprocessors(SM)provides much better computational throughput than CPU GPU memory bandwidth is 10 x faster than CPU memory band
29、width GPU memory size is sufficient for holding“hot embedding vectors”from a gigantic embedding table Only few embedding vectors are required for majority of requests.Inference can be optimized by leveraging GPU in productionCPU Memory:100s GB-TBsGPU Memory:10s GBGPU SMsStorage/RAID:10s-100s TB100s
30、GB/s-TBs/s10s GB/sGBs/s-10s GB/sHUGECTR INFERENCE ARCHITECTUREAt High Level Designed to effectively use the GPU memory to accelerate the inference Supports concurrent model inference execution on the same GPU or across multiple GPUs Embedding tables are cached hierarchically Parameter server is in c
31、harge of managing whole embedding tables of different models on CPU memory and SSDs Embedding cache manages hot portions of the embedding tables while providing their GPU accelerated lookups The same models instances share the embedding caches to do their predictionsModel1 InstanceModel1 InstanceMod
32、el2 InstanceModel2 InstanceModel1 Emb.CacheEmb.Table cache1 Model2 Emb.CacheEmb.Table cache1 Emb.Table cache2 Parameter ServerEmb.Table 1Emb.Table 1Emb.Table 2Model 1Model 2CPUGPUDATABASE BACKENDPersistent vs.VolatilePersistent Database HugeCTR will deploy,“persist”and maintain a full copy of all pa
33、rameters in this database.“persistent”databases are regarded as having virtually unlimited storage space.E.g.,RocksDBVolatile Database Fast!Resides in RAM(either remote or local).Can consider as a cache for the persistent database.Assumed to operate with limited space.However,limited space means tha
34、t additional handlers are implemented to establish function,when running out of storage space.E.g.,RedisEC,persistent and volatile DB,areEVENTUALLY CONSISTENT!HUGECTR HIERARCHICAL PARAMETER SERVERMulti-Level Caching across Different Types of MemoryCPU MemoryGPU MemoryParameter replication on SSDLeve
35、l 3Local parametersLevel 2NetworkMeta Parameters Level 2Parameter shards Level 2Temporary memory for networkParameter buffer poolEmbedding cache on GPU Level 1Incremental ParametersInsertion ParametersQuery parametersLocal dataEvent SinkParameter ReplicationLocal SSDsTemporary MemoryLocal Parameters
36、Parameter shardsMeta ParametersParameter BufferCPU MemoryLocal data Parameter Buffer Embedding cacheGPU MemoryINCREMENTAL UPDATES Basically,a lazy queue.Optimizer publishes updates via sink to Kafka.Each HugeCTR instance monitors queues of related models.Highly automated process:Automatic workload d
37、istribution.Can handle certain degree of node failure.Asynchronous injection of updates in background.Message SinkMessage SourceTraining(e.g.,HugeCTR native,TensorFlow SOK)MessageBuffer(e.g.,Kafka,RabbitMQ,)HPS E2E SETUP EXAMPLEFrom Training to DeploymentCONFIGURING HUGECTRe.g.:Inference with Triton
38、/Just add to your Triton config:persistent_db:type:rocks_db,path:/models/wide_and_deep,num_threads:16,read_only:false,max_get_batch_size:8192,max_set_batch_size:8192volatile_db:type:redis_cluster,address:host_name:port,.,user_name:default,password:123456,num_partitions:16,overflow_margin:1000000,ove
39、rflow_policy:evict_oldest,overflow_resolution_target:800000,max_get_batch_size:8192,max_set_batch_size:8192,cache_missed_embeddings:trueSTREAM MODEL UPDATES LIVEINTO YOUR INFERENCE SYSTEM?HugeCTR will automatically discover and ingest model updates.If multiple HugeCTR instances connect to shared dat
40、abases,the related workload is distributed about evenly among them.Fast,reliably,automatically,via Apache Kafka!/In your model training code:import hugectrsolver=hugectr.CreateSolver(max_eval_batches=300,batchsize_eval=2048,batchsize=2048,lr=0.001,vvgpu=2,repeat_dataset=True,i64_input_key=True,kafka
41、_brokers=host_name:port.)/Just add to your Triton config:update_source:type:kafka_message_queue,brokers:host_name:port.,poll_timeout_ms:50,max_receive_buffer_size:4096,max_batch_size:1024,failure_backoff_ms:300HPS PERFORMANCEWe continue to actively improve it!100%100%100%1.3 x4.2 x80.0 x0%100%200%30
42、0%400%500%600%700%800%RocksDB BackendHashMap BackendRedis BackendAv.g Query Latency ImprovementAxis TitleHugeCTR 3.1HugeCTR 3.4Model:Wide&Deep,Dataset:criteo;evaluation performed across 2000 consecutive random queries.Cluster node configuration:DGX-1V(https:/ configuration:EDR InfiniBand(https:/ is
43、BetterMORE RESOURCES ON HUGECTR INFERENCEDocumentations&Deep Dive Documentations:HugeCTR Inference Architecture:https:/ HugeCTR Hierarchical Parameter Server:https:/ HugeCTR Triton Backend README:https:/ Notebooks:Hierarchical Deployment:https:/ Continuous Training and Inference:Part 1:https:/ Part
44、2:https:/ GTC sessions:GPU Embedding Cache Performance Deep Dive at GTC 2020:https:/ HugeCTR:Distributed Hierarchical Inference Parameter Server Using GPU Embedding Cache S41126 at GTC 2022HUGECTR SPARSE OPERATION KITSPARSE OPERATION KIT(SOK):TENSORFLOW EMBEDDING PLUGINhttps:/ Motivation Enable Huge
45、CTR features and optimizations for general DL frameworks such as Tensorflow e.g.HugeCTR embedding operators Sparse Operation Kit(SOK)is a python package that provides GPU accelerated operations for sparse training and inference,e.g.,recommender for popular frameworks such as TensorFlow Its major ope
46、rators are extracted from HugeCTRFEATURESGood Compatibility with TensorFlow Compatible with TensorFlow 1.15 and 2.4+No need to recompile or reinstall TensorFlow Compatible with distributed training frameworks such as Horovod Workable with MirroredStrategy and MultiWorkerMirroredStrategyFEATURESModel
47、-Parallel GPU Embedding Layer for RecSys Training at ScaleOptional Subtitle*DP stands for Data-Parallel*MP stands for Model-ParallelSPARSE OPERATION KIT USAGE(HOROVOD)Define training loop with Horovodmodel=TFModel()tf.functiondef train_step(inputs,labels):with tf.GradientTape()as tape:logits=model(i
48、nputs)loss=_loss_fn(labels,logits)trainable_variables=model.trainable_variablesgrads=tape.gradient(loss,trainable_variables)grads=hvd.allreduce(grad)for grad in gradsoptimizer.apply_gradients(zip(grads,trainable_variables)return losssok.Init()model=SOKModel()tf.functiondef train_step(inputs,labels):
49、with tf.GradientTape()as tape:logits=model(inputs)loss=_loss_fn(labels,logits)emb_variables,other_variables=sok.split_embedding_variable_from_others(model.trainable_variables)emb_grads,other_grads=tape.gradient(loss,emb_variables,other_variables)other_grads=hvd.allreduce(grad)for grad in other_grads
50、optimizer.apply_gradients(zip(emb_grads,emb_variables)optimizer.apply_gradients(zip(other_grads,other_variables)return lossMORE RESOURCES ON SPARSE OPERATION KITDocumentations&Deep Dive Documentations:https:/ https:/nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/index.html Notebooks:SOK