大规模预训练语言模型在百度搜索中的应用研究-王帅强.pdf

资源描述

1、Search SciencePre-trained Language Model forWeb-Scale Retrieval&Rankingin Baidu SearchShuaiqiang WANGhttp:/Search ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceBackground Retrieval and ranking are two crucial stages in web-

2、scale search engineDatabaseRetrievalRankingQueryResultsWeb-scaledocumentsFew hundreds orthousands candidatesSearch Sciencehttps:/ al,2019.Ernie:Enhanced representation through knowledge integration.In arXiv:1904.09223.2.Sun,Y.et al,2020.Ernie 2.0:A continual pre-training framework for language under

3、standing.In AAAI.百度ERNIESearch Science Beyond text matching:semantic retrieval&ranking Representation-based methods Representation:document semantics aslatent vectors Retrieval:nearest neighbor search inlatent space Interaction-based models Ranking:matching over the local interactions*Picture from:D

4、ai,Andrew M.,Christopher Olah,and Quoc V.Le.Document embedding with paragraph vectors.arXiv preprint arXiv:1507.07998(2015).QuerySemantically-related candidatesBackgroundSearch Science Semantic retrieval Effectively understand the semantics of queries and documents Large number of low-frequency quer

5、ies Web-scale retrieval system Semantic ranking Expensive computations Ranking-agnostic pre-training Challenges!#$%&#(.%&.)(CLS*#*$SEP*#(.SEP.*)(&!+%&#&$&%&#(.&%&.&)(,%ERNIE-+-+-Masked Sentence AMasked Sentence BOur contribution：One of the largest application ofPLM for Web-scale Retrieval&Ranking1.Z

6、ou L.et al.Pre-trained Language Model based Ranking in Baidu Search.In KDD 2021.2.Liu Y.et al.Pre-trained Language Model for Web-scale Retrieval in Baidu Search.In KDD 2021.Search ScienceOutline1234BackgroundRetrievalRankingSummarySearch Science Retrieval model Goal:learning query-document semantic

7、relatedness Backbone:a bi-encoders(i.e.,two-tower)architecture*,with Query&Doc encoders:transformersMethodology Retrieval ModelQuery EncoderCLS-pooling!CLS#$SEP.Doc EncoderCLS-pooling%CLS#&SEP.Query(tokenized)Doc(tokenized)retrieval score(Query embeddingDoc embedding*Chang,Wei-Cheng,et al.Pre-traini

8、ng tasks for embedding-based large-scale retrieval.arXiv preprint arXiv:2002.03932(2020).Search Science Retrieval model Goal:learning query-document semantic relatedness Poly-attention:bi-encoders with more query-document interaction*Methodology Retrieval Model*Humeau,Samuel,et al.Poly-encoders:Tran

9、sformer architectures and pre-training strategies for fast and accurate multi-sentence scoring.arXiv preprint arXiv:1905.01969(2019).!#$!.#&.&EncoderEncoderAttentionCode 1AttentionCode mCLS-poolingP1Pm.!score s1.score smscore(=*+,-.#/(-Produce multiple embeddingson the query sideSearch Science Posit

10、ive&negative data mining for different data sources Search log positives:user-clicked documents;negatives:non-clicked documents Manually labeled data positives:high-scored documents;negatives:low-scored documents In-batch negative mining Introducing random negatives Benefits:More aligned with retrie

11、val task Efficiently scale up the number of negativesMethodology Retrieval Model!#!$!%&#&$&%&(&#(&$(&%(!)query&)relevant doc&)(strong negativerelevant(+,-)pairirrelevant(+,-)paircorresponding tostrong negativeirrelevant(+,-)paircorresponding torandom negativeSearch Science Multi-stage training parad

12、igm Unsupervised-Supervised General corpus-Task-specific dataMethodology Training ParadigmSearch ScienceMethodology Embedding Compression Mode deployment Compression QuantizationDoc EmbeddingDoc EmbeddingQuantizationDoc EmbeddingCompression with additional FC layerSearch ScienceMethodology System Wo

13、rkflow Deployment Integrating term-based&ERNIE-based retrieval Unifying results with post-retrieval filtering1.2.3.Text MatchingSearch ScienceEvaluation Online Evaluation Metrics:DCG&GSB#Good=#queries that the new system performs better ResultsSearch ScienceOutline1234BackgroundRetrievalRankingSumma

14、rySearch ScienceContent-aware Pre-trained Language ModelPyramid-ERNIEMethodTime ComplexityOriginal ErniePyramid-ErnieO(Lh(Nq+Nt+Ns)2)O(Llowh(Nq+Nd)2)+Llowh(Ns)2+Lhighh(Nq+Nt+Ns)2)Search ScienceQUery-WeIghted Summary ExTraction(QUIET)QueryTerm1Term2Term3W1W2W3Sentence1A0B0C0Term1W1Term3W3Sentence2D0E

15、0G0Term1W1F0H0STEP-1Sentence1W1+W3Sentence2W1TermTerm WeightTermTerm WeightTermTerm WeightCandidateScoreChoose the sentence with max score.Remove the selected sentence from the Candidate.STEP-2STEP-3!%!&!&%Search ScienceFinetune with Web-Scale Calibrated ClicksRaw clicksNoisy and inconsistent with r

16、elevanceCalibrated clicksAligning clicks with human labelsPretrainwith general dataPost-pretrainwithSearch logFinetune withcalibrated clicksFinetune withhuman labelsGeneral ErnieErnie for SearchErnie for SearchErnie for SearchRawclicksCalibratedclicksLabelgeneratorHumanlabelsSearch ScienceFinetune w

17、ith Human LabelsWe manually labeled millions of query-document pairs and train the Pyramid-ERNIE with amixture of pairwise and pointwise loss(Y,F(q,D)=Xyiyjmax(0,f(q,di)?f(q,dj)+)+?(?(f(q,di),yi)+?(f(q,dj),yj)Search ScienceEvaluationBaseA basic ERNIE-based ranking policy,fine-tuned with a pairwise l

18、ossusing human-labeled query-document pairs.Content-aware Pyramid-ERNIE(CAP)A Pyramid-ERNIE architecture,incorporatingthe query-dependentdocument summary into the deep contextualization to better capturethe relevance between the query and document.Relevance-oriented Pre-training(REP)Pre-training the

19、 Pyramid-ERNIE model with refined large-scale user-behavioral data before fine-tuning it on the task data.Human-anchored Fine-tuning(HINT)HINT anchors the ranking model with human-preferred relevancescores.Search ScienceEvaluationModel?DCG?AB?GSB?DCG2?DCG4RandomLong-Tail RandomLong-TailBase-+CAP0.65

20、%0.76%0.15%0.35%3.50%6.00%+CAP+REP2.78%1.37%0.58%0.41%5.50%7.00%+CAP+REP+HINT 2.85%1.58%0.14%0.45%6.00%7.50%“”indicates the statistically significant improvement(t-test with p 0.05 over the baseline).Search ScienceOutline1234BackgroundRetrievalRankingSummarySearch ScienceConclusionPLM-based retrieva

21、l and ranking models ERNIE-based models Multi-stage training paradigm Fully deployed onlineDatabaseRetrievalRankingQueryResultsWeb-scaledocumentsHundreds or thousandsof candidatesA simple search pipelineSearch ScienceWe are hiring!Please drop a message if interested.Search ScienceThank YouShuaiqiang WANGhttp:/

展开阅读全文