«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.cnki.ISSN1673-629X.2024.0410]
点击复制

基于改进的BERTopic模型的政策文本主题挖掘()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:: 2025年05期

页码:: 90-96

栏目:: 人工智能

出版日期:: 2025-05-10

文章信息/Info

Title:: Policy Text Topic Mining Based on Improved BERTopic Model

文章编号:: 1673-629X(2025)05-0090-07

作者:: 王雨琪1; 刘晨1; 刘建炜2*; 蔡宏民3; 1. 大规模流数据集成与分析技术北京市重点实验室(北方工业大学),北京 100144;
2. 福建幼儿师范高等专科学校,福建福州 350007;
3. 华南理工大学,广东广州 510640

Author(s):: WANG Yu-qi1; LIU Chen1; LIU Jian-wei2*; CAI Hong-min3; 1. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data (North China University of Technology),Beijing 100144,China;
2. Fujian Preschool Education College,Fuzhou 350007,China;
3. South China University of Technology,Guangzhou 510640,China

关键词:: 自然语言处理; 主题模型; 政策文本; BERTopic; 流行度偏差

Keywords:: natural language processing; topic model; policy texts; BERTopic; popularity bias

分类号:: TP393.1

DOI:: 10.20165/j.cnki.ISSN1673-629X.2024.0410

摘要:: 自然语言处理技术在文本分析中的应用,显著提高了从海量数据中提取关键信息的效率。基于自然语言处理技术的主题分析方法也在文本分析领域中取得了一定成果。然而,由于政策文本数据具有场景复杂、文本长和头部效应等挑战,现有的主题挖掘方法生成的主题结果仍有较大的进步空间。针对政策文本主题建模的挑战,该文基于 BERTopic 方法,引入了动态文档嵌入优化器和流行度纠偏正则化项。分别克服了 BERTopic 只能在固定维度挖掘主题导致的普适性不足以及受到词级别的流行度偏差影响导致的主题结果同质化问题,实现了对最佳主题聚类向量维度的自动选择和对热门词汇的有效纠偏。通过对实验分析,改进后的 BERTopic 方法在主题一致性、主题多样性和综合质量指标上均显著优于原始 BERTopic 模型及先进的神经网络主题模型;在可视化结果上,生成的主题质量也显著优于原生模型。

Abstract:: The application of natural language processing technology in text analysis has significantly improved the efficiency of extracting key information from massive data. Topic analysis methods based on natural language processing technology have also gained some success in the field of text analysis. However,due to the challenges of policy text data such as complex scenes,long text,and popularity bias effects,present topic mining approaches have a lot of opportunity for improvement. To address the aforementioned issues of policy text topic modeling, we propose a dynamic document embedding optimizer and a popularity bias regularization term based on the BERTopic approach. It respectively overcomes the lack of universality caused by the BERTopic model’s ability to mine topics in fixed di-mensions and the high homogeneity of topic results caused by word-level popularity bias,and achieves automatic optimization of the optimal topic clustering vector dimensions and selection and effective correction of hot words. Through experimental analysis of policy texts,we found that the improved BERTopic is significantly better than the original BERTopic model and the state-of-the-art models in topic consistency,topic diversity,and comprehensive quality indicators. In the visualization results,the quality of the topics generated is also significantly better than that of the native model.

相似文献/References:

[1]陈国华赵克李亚涛易帅.自然语言处理系统中的事件类名词的耦合处理[J].计算机技术与发展,2008,(06):60.
　CHEN Guo-hua,ZHAO Ke,LI Ya-tao,et al.Coupling Processing of Event Noun in NLP Systems[J].,2008,(05):60.
[2]程节华.基于FAQ的智能答疑系统中分词模块的设计[J].计算机技术与发展,2008,(07):181.
　CHENG Jie-hua.Design of Words Module in Intelligent Q/A System Based on FAQ[J].,2008,(05):181.
[3]杨欢许威赵克陈余.动词属性在自然语言处理当中的研究与应用[J].计算机技术与发展,2008,(07):233.
　YANG Huan,XU Wei,ZHAO Ke,et al.Research and Application of Verb Attributes in Natural Language Processing[J].,2008,(05):233.
[4]孙超张仰森.面向综合语言知识库的知识融合与获取研究[J].计算机技术与发展,2010,(08):25.
　SUN Chao,ZHANG Yang-sen.Research of Knowledge Integration and Obtaining Oriented Comprehensive Language Knowledge System[J].,2010,(05):25.
[5]党建亿珍珍赵克殷鸿.数学领域集体词结构形式化处理研究[J].计算机技术与发展,2007,(05):121.
　DANG Jian,YI Zhen-zhen,ZHAO Ke,et al.Research of Formalization Processing for Collective Structures in Mathematics Domain[J].,2007,(05):121.
[6]江有福郑庆华.自然语言网络答疑系统中倒排索引技术的研究[J].计算机技术与发展,2006,(02):126.
　JIANG You-fu,ZHENG Qing-hua.Research of Inverted Index in NLWAS[J].,2006,(05):126.
[7]刘亚清张瑾于纯妍.基于义原同现频率的汉语词义排歧系统[J].计算机技术与发展,2006,(05):184.
　LIU Ya-qing,ZHANG Jin,YU Chun-yan.A Chinese Word Sense Disambiguation System Based on Primitive CO- Occurrence Data[J].,2006,(05):184.
[8]刘政怡李炜吴建国.基于IMM—IME的汉字键盘输入法编程技术研究[J].计算机技术与发展,2006,(12):43.
　LIU Zheng-yi,LI Wei,WU Jian-guo.Research of Programming Technology of Chinese Input Method Based on IMM- IME[J].,2006,(05):43.
[9]赵鹏何留进孙凯方薇[].基于情感计算的网络中文信息分析技术[J].计算机技术与发展,2010,(11):146.
　ZHAO Peng,HE Liu-jin,SUN Kai,et al.Analyzing Technologies of Internet Chinese Information Based on Affective Computing[J].,2010,(05):146.
[10]徐远方李成城.基于SVM和词间特征的新词识别研究[J].计算机技术与发展,2012,(05):134.
　XU Yuan-fang,LI Cheng-cheng.Research on New Word Identification Based on SVM and Word Characteristics[J].,2012,(05):134.
[11]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217.
　SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].,2013,(05):217.
[12]白振凯,黄孝喜,王荣波,等. 基于主题模型的汉语动词隐喻识别[J].计算机技术与发展,2016,26(11):67.
　BAI Zhen-kai,HUANG Xiao-xi,WANG Rong-bo,et al. Chinese Verb Metaphor Recognition Based on Topic Model[J].,2016,26(05):67.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed522
全文下载/Downloads446
评论/Comments