[1]王雨琪,刘晨,刘建炜*,等.基于改进的BERTopic模型的政策文本主题挖掘[J].计算机技术与发展,2025,(05):90-96.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0410]
 WANG Yu-qi,LIU Chen,LIU Jian-wei*,et al.Policy Text Topic Mining Based on Improved BERTopic Model[J].,2025,(05):90-96.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0410]
点击复制

基于改进的BERTopic模型的政策文本主题挖掘()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2025年05期
页码:
90-96
栏目:
人工智能
出版日期:
2025-05-10

文章信息/Info

Title:
Policy Text Topic Mining Based on Improved BERTopic Model
文章编号:
1673-629X(2025)05-0090-07
作者:
王雨琪1刘晨1刘建炜2*蔡宏民3
1. 大规模流数据集成与分析技术北京市重点实验室(北方工业大学),北京 100144;
2. 福建幼儿师范高等专科学校,福建 福州 350007;
3. 华南理工大学,广东 广州 510640
Author(s):
WANG Yu-qi1LIU Chen1LIU Jian-wei2*CAI Hong-min3
1. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data (North China University of Technology),Beijing 100144,China;
2. Fujian Preschool Education College,Fuzhou 350007,China;
3. South China University of Technology,Guangzhou 510640,China
关键词:
自然语言处理主题模型政策文本BERTopic流行度偏差
Keywords:
natural language processingtopic modelpolicy textsBERTopicpopularity bias
分类号:
TP393.1
DOI:
10.20165/j.cnki.ISSN1673-629X.2024.0410
摘要:
自然语言处理技术在文本分析中的应用,显著提高了从海量数据中提取关键信息的效率。 基于自然语言处理技术的主题分析方法也在文本分析领域中取得了一定成果。 然而,由于政策文本数据具有场景复杂、文本长和头部效应等挑战,现有的主题挖掘方法生成的主题结果仍有较大的进步空间。 针对政策文本主题建模的挑战,该文基于 BERTopic 方法,引入了动态文档嵌入优化器和流行度纠偏正则化项。 分别克服了 BERTopic 只能在固定维度挖掘主题导致的普适性不足以及受到词级别的流行度偏差影响导致的主题结果同质化问题,实现了对最佳主题聚类向量维度的自动选择和对热门词汇的有效纠偏。 通过对实验分析,改进后的 BERTopic 方法在主题一致性、主题多样性和综合质量指标上均显著优于原始 BERTopic 模型及先进的神经网络主题模型;在可视化结果上,生成的主题质量也显著优于原生模型。
Abstract:
The application of natural language processing technology in text analysis has significantly improved the efficiency of extracting key information from massive data. Topic analysis methods based on natural language processing technology have also gained some success in the field of text analysis. However,due to the challenges of policy text data such as complex scenes,long text,and popularity bias effects,present topic mining approaches have a lot of opportunity for improvement. To address the aforementioned issues of policy text topic modeling, we propose a dynamic document embedding optimizer and a popularity bias regularization term based on the BERTopic approach. It respectively overcomes the lack of universality caused by the BERTopic model’s ability to mine topics in fixed di-mensions and the high homogeneity of topic results caused by word-level popularity bias,and achieves automatic optimization of the optimal topic clustering vector dimensions and selection and effective correction of hot words. Through experimental analysis of policy texts,we found that the improved BERTopic is significantly better than the original BERTopic model and the state-of-the-art models in topic consistency,topic diversity,and comprehensive quality indicators. In the visualization results,the quality of the topics generated is also significantly better than that of the native model.

相似文献/References:

[1]陈国华 赵克 李亚涛 易帅.自然语言处理系统中的事件类名词的耦合处理[J].计算机技术与发展,2008,(06):60.
 CHEN Guo-hua,ZHAO Ke,LI Ya-tao,et al.Coupling Processing of Event Noun in NLP Systems[J].,2008,(05):60.
[2]程节华.基于FAQ的智能答疑系统中分词模块的设计[J].计算机技术与发展,2008,(07):181.
 CHENG Jie-hua.Design of Words Module in Intelligent Q/A System Based on FAQ[J].,2008,(05):181.
[3]杨欢 许威 赵克 陈余.动词属性在自然语言处理当中的研究与应用[J].计算机技术与发展,2008,(07):233.
 YANG Huan,XU Wei,ZHAO Ke,et al.Research and Application of Verb Attributes in Natural Language Processing[J].,2008,(05):233.
[4]孙超 张仰森.面向综合语言知识库的知识融合与获取研究[J].计算机技术与发展,2010,(08):25.
 SUN Chao,ZHANG Yang-sen.Research of Knowledge Integration and Obtaining Oriented Comprehensive Language Knowledge System[J].,2010,(05):25.
[5]党建 亿珍珍 赵克 殷鸿.数学领域集体词结构形式化处理研究[J].计算机技术与发展,2007,(05):121.
 DANG Jian,YI Zhen-zhen,ZHAO Ke,et al.Research of Formalization Processing for Collective Structures in Mathematics Domain[J].,2007,(05):121.
[6]江有福 郑庆华.自然语言网络答疑系统中倒排索引技术的研究[J].计算机技术与发展,2006,(02):126.
 JIANG You-fu,ZHENG Qing-hua.Research of Inverted Index in NLWAS[J].,2006,(05):126.
[7]刘亚清 张瑾 于纯妍.基于义原同现频率的汉语词义排歧系统[J].计算机技术与发展,2006,(05):184.
 LIU Ya-qing,ZHANG Jin,YU Chun-yan.A Chinese Word Sense Disambiguation System Based on Primitive CO- Occurrence Data[J].,2006,(05):184.
[8]刘政怡 李炜 吴建国.基于IMM—IME的汉字键盘输入法编程技术研究[J].计算机技术与发展,2006,(12):43.
 LIU Zheng-yi,LI Wei,WU Jian-guo.Research of Programming Technology of Chinese Input Method Based on IMM- IME[J].,2006,(05):43.
[9]赵鹏 何留进 孙凯 方薇[].基于情感计算的网络中文信息分析技术[J].计算机技术与发展,2010,(11):146.
 ZHAO Peng,HE Liu-jin,SUN Kai,et al.Analyzing Technologies of Internet Chinese Information Based on Affective Computing[J].,2010,(05):146.
[10]徐远方 李成城.基于SVM和词间特征的新词识别研究[J].计算机技术与发展,2012,(05):134.
 XU Yuan-fang,LI Cheng-cheng.Research on New Word Identification Based on SVM and Word Characteristics[J].,2012,(05):134.
[11]孙昌年,郑诚,夏青松.基于 LDA 的中文文本相似度计算[J].计算机技术与发展,2013,(01):217.
 SUN Chang-nian,ZHENG Cheng,XIA Qing-song.Chinese Text Similarity Computing Based on LDA[J].,2013,(05):217.
[12]白振凯,黄孝喜,王荣波,等. 基于主题模型的汉语动词隐喻识别[J].计算机技术与发展,2016,26(11):67.
 BAI Zhen-kai,HUANG Xiao-xi,WANG Rong-bo,et al. Chinese Verb Metaphor Recognition Based on Topic Model[J].,2016,26(05):67.

更新日期/Last Update: 2025-05-10