1、情报学报 2023 年 2 月 第 42 卷 第 2 期Journal of the China Society for Scientific and Technical Information,Feb.2023,42(2):189-202基于机器学习模型的科技论文潜在“精品”识别研究胡泽文,任萍,崔静静(南京信息工程大学管理工程学院,南京 210044)摘要 综合运用科技文献特征向量空间和机器学习模型实现海量文献中潜在“精品”的自动识别与推荐,能够提升海量科技文献的科学影响和其科技发展促进作用。设计和实现基于机器学习的科技文献潜在“精品”识别分类器和模型框架,测度出国际高影响力期刊和国内图书
2、情报与档案管理期刊论文的原文及引文特征,运用特征工程构建科技论文特征向量空间;然后分别采用支持向量机和朴素贝叶斯等传统机器学习模型,以及深度置信网络和多层感知机等深度学习模型进行潜在“精品”的自动识别,并基于ROC曲线(receiver operating characteristic curve)和混淆矩阵构建评价模型识别效果的指标体系。研究结果显示:深度学习模型在潜在“精品”识别方面的效果较差,而传统机器学习模型的识别效果较优,其中随机森林和支持向量机的潜在“精品”识别效果最佳,决策树识别效果次之,朴素贝叶斯识别效果较差且稳定性不足。影响因子越高的期刊潜在“精品”识别效果越好;无论国际自然
3、科学领域高影响力期刊,还是国内社会科学领域图书情报与档案管理期刊,识别出的“精品”论文全部为被引频次较高的论文且综述论文的占比较低,国内期刊的“精品”论文中仅有1篇为综述论文。“精品”论文的计量特征值与总体论文样本相比,呈现较大差异,即“精品”论文的首次响应时间较短且拥有基金资助,参考文献数量、关键词数量和被引频次较多,摘要和论文篇幅较长且偏向多作者论文。实证结果表明,机器学习模型能够准确识别科技文献中的潜在“精品”,并提升潜在“精品”识别的自动化程度,为海量文献中潜在“精品”文献的自动识别与传播利用提供理论参考与方法支撑。关键词 机器学习;深度学习;精品文献;特征工程;随机森林;支持向量机;
4、朴素贝叶斯;深度置信网络Study on Identification of Potential“Treasures”in Massive Papers Based on Machine Learning ModelsHu Zewen,Ren Ping and Cui Jingjing(School of Management Science and Engineering,Nanjing University of Information Science&Technology,Nanjing 210044)Abstract:Constructing a feature vector spac
5、e of massive literature and using machine learning models to accurately and automatically identify and utilize potential“treasures”from a vast body of literature can enhance their scientific influence and facilitate advancements in science and technology.This study designs and implements machine lea
6、rning models and the model framework of identifying potential“treasures”from consistent scientific and technological papers.As samples,we collected papers(and their citation data)published in international high-influencing journals and domestic journals from Web of Science and Library Information an
7、d Archives Management,respectively.Subsequently,we measured the bibliometric characteristics of all these papers and constructed a feature vector space of the literature.Thereafter,traditional 收稿日期:2021-12-16;修回日期:2022-10-25基金项目:国家社会科学基金项目“面向海量科技文献的潜在精品识别方法与应用研究”(20CTQ031)。作者简介:胡泽文,男,1985年生,博士,副教授,硕
8、士生导师,主要研究领域为科学计量与科技评价、数据挖掘与情报分析,E-mail:;任萍,女,1995年生,硕士,主要研究领域为数据挖掘与情报分析;崔静静,女,1998年生,硕士,主要研究领域为数据挖掘与情报分析。DOI:10.3772/j.issn.1000-0135.2023.02.006第 42 卷情 报 学 报machine learning models,such as support vector machine and naive Bayes model,and deep learning models,such as deep belief networks and multila
9、yer perceptron,were used to identify potential“high-quality”papers.An receiver operating characteristic(ROC)curve and a confusion matrix were used to evaluate the recognition effect of the machine learning algorithms.The results show that deep learning models cannot efficiently identify the potentia
10、l“treasures”from consistent papers,thus exhibiting a low recognition effect.However,the traditional machine learning models can efficiently identify the potential“treasures”from international high-influencing journals and domestic journals in library Information and Archives Management.While two typ
11、es of machine learning models,including random forest and support vector machine,show the optimum recognition effect,relatively low recognition effect for the decision tree model and Naive Bayes model is identified.Moreover,the more influential a journal is,the higher the recognition effect.Irrespec
12、tive of whether we considered international high-influencing journals from natural sciences or domestic journals from social sciences,all identified excellent papers exhibit a higher citation frequency,and extremely few review papers are found among them.Furthermore,by comparing the bibliometric fea
13、tures of all papers analyzed,we find that most identified excellent papers are multi-author articles supported by science foundation and present a shorter first-citation time,more references and keywords,higher citation frequency,and longer abstracts.The empirical results show that the machine learn
14、ing model can accurately identify potential“high-quality”articles from massive scientific and technological literature and improve the automation scope of identifying potential“high-quality”articles.This can also provide theoretical reference and methodological support for automatic recognition,diss
15、emination,and utilization of potential“high-quality”papers from massive literature.Key words:machine learning;deep learning;excellent literature;feature engineering;random forest;support vector machine;naive Bayes model;deep belief networks0引 言习近平总书记强调建立以科技创新质量、贡献、绩效为导向的分类评价体系,正确评价科技成果的科学价值、技术价值、经济价
16、值、社会价值、文化价值,坚持以“精品”奉献人民,阐释好中国精神和中国价值1-2。随着科技文献的不断涌现和数量激增,“精品”文献具备什么样的特质,如何从海量科技文献中识别出高影响力或高质量的“精品”文献并进行广泛阅读和推荐,是值得关注的科学问题且具有较高的现实意义3-4。海量科技文献的科学价值分布类似金字塔形状,位于顶端的高被引或高质量文献数量较少,不到总量的 20%,而底端的零被引或低被引文献数量较多,占总量的 80%左右,呈现典型的“二八定律”。目前海量科技文献中蕴藏的高被引或睡美人等高影响力文献已经能够通过引用统计方法识别出来,然而此类高影响力文献的占比极低,且需经过一定的引用窗口才能识别出来。事实上,海量文献中蕴含大量具备高影响力或高质量文献特质的潜在“精品”,如果能够通过特征匹配和机器学习的方式快速准确地识别出来并推广利用,就可以避免出现“精品”文献识别受引用窗口影响导致文献价值利用不及时的现象。机器学习模型涵盖长短期记忆网络、随机森林、支持向量机、朴素贝叶斯、概率神经网络、深度置信网络、多层感知机等模型,是人工智能、文档分类和知识挖掘领域的常用经典模型,在经验学习、记忆学习、