[1]石雁,李朝锋. 结合统计和词间关系的文本关键词计算方法[J].计算机技术与发展,2015,25(12):22-27.
 SHI Yan,LI Chao-feng. A Method of Text Keyword Calculation by Combining Statistics with Relationship Between Words[J].,2015,25(12):22-27.
点击复制

 结合统计和词间关系的文本关键词计算方法()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
25
期数:
2015年12期
页码:
22-27
栏目:
智能、算法、系统工程
出版日期:
2015-12-10

文章信息/Info

Title:
 A Method of Text Keyword Calculation by Combining Statistics with Relationship Between Words
文章编号:
1673-629X(2015)12-0022-06
作者:
 石雁李朝锋
 江南大学 物联网工程学院
Author(s):
 SHI YanLI Chao-feng
关键词:
 文本特征相似计算 互信息SimHash 特征提取文本去重
Keywords:
 text featuresimilarity calculationmutual informationSimHashfeature extractiontext de-duplication
分类号:
TP301
文献标志码:
A
摘要:
 在中文文本相似去重中的关键词计算和提取阶段,文本分词后,存在高维、稀疏和缺乏语义词项,而这些大多没有实际意义的词会给计算带来噪音,不利于文本去重. 为此,需要提取文本特征,使该特征能够表示文本的主要内容. 针对此问题,提出了一种结合词频、词项间互信息关联度及其语义相似度的改进的关键词提取方法. 该方法综合考虑候选词的统计特征、词项间的相关度和相似度,并将此方法应用于SimHash文本相似计算模型中. 实验结果表明,基于该模型的特征提取在相似文本去重计算上有着较高的准确率、召回率和F1 值,优于传统方法.
Abstract:
 The stage of keywords calculation and extraction in Chinese text similarity de-duplication,text segmentation exists high dimen-sion,sparsity and lack of semantic words,and most of them have no practical significance that brings noise to calculation,not conducive to text de-duplication. Therefore,it’s necessary to extract text feature,which can represent the main content of text. To solve this prob-lem,propose an improved keywords extraction method by combining word frequency,mutual information correlation and semantic simi-larity between words. The method comprehensively considers the statistical characteristics of candidate words,relevance and similarity be-tween words,it’s applied to SimHash text similarity computing model. The experimental results show that the feature extraction method based on this model can achieve high precision,recall and F1 value in the calculation of text similarity de-duplication,and it’s better than traditional methods.

相似文献/References:

[1]陈素萍 谢丽聪.一种文本特征选择方法的研究[J].计算机技术与发展,2009,(02):112.
 CHEN Su-ping,XIE Li-cong.Research on Document Feature Selection[J].,2009,(12):112.
[2]张志宏,吴庆波,邵立松,等.基于飞腾平台TOE协议栈的设计与实现[J].计算机技术与发展,2014,24(07):1.
 ZHANG Zhi-hong,WU Qing-bo,SHAO Li-song,et al. Design and Implementation of TCP/IP Offload Engine Protocol Stack Based on FT Platform[J].,2014,24(12):1.
[3]梁文快,李毅. 改进的基因表达算法对航班优化排序问题研究[J].计算机技术与发展,2014,24(07):5.
 LIANG Wen-kuai,LI Yi. Research on Optimization of Flight Scheduling Problem Based on Improved Gene Expression Algorithm[J].,2014,24(12):5.
[4]黄静,王枫,谢志新,等. EAST文档管理系统的设计与实现[J].计算机技术与发展,2014,24(07):13.
 HUANG Jing,WANG Feng,XIE Zhi-xin,et al. Design and Implementation of EAST Document Management System[J].,2014,24(12):13.
[5]侯善江[],张代远[][][]. 基于样条权函数神经网络P2P流量识别方法[J].计算机技术与发展,2014,24(07):21.
 HOU Shan-jiang[],ZHANG Dai-yuan[][][]. P2P Traffic Identification Based on Spline Weight Function Neural Network[J].,2014,24(12):21.
[6]李璨,耿国华,李康,等. 一种基于三维模型的文物碎片线图生成方法[J].计算机技术与发展,2014,24(07):25.
 LI Can,GENG Guo-hua,LI Kang,et al. A Method of Obtaining Cultural Debris’ s Line Chart Based on Three-dimensional Model[J].,2014,24(12):25.
[7]翁鹤,皮德常. 混沌RBF神经网络异常检测算法[J].计算机技术与发展,2014,24(07):29.
 WENG He,PI De-chang. Chaotic RBF Neural Network Anomaly Detection Algorithm[J].,2014,24(12):29.
[8]刘茜[],荆晓远[],李文倩[],等. 基于流形学习的正交稀疏保留投影[J].计算机技术与发展,2014,24(07):34.
 LIU Qian[],JING Xiao-yuan[,LI Wen-qian[],et al. Orthogonal Sparsity Preserving Projections Based on Manifold Learning[J].,2014,24(12):34.
[9]尚福华,李想,巩淼. 基于模糊框架-产生式知识表示及推理研究[J].计算机技术与发展,2014,24(07):38.
 SHANG Fu-hua,LI Xiang,GONG Miao. Research on Knowledge Representation and Inference Based on Fuzzy Framework-production[J].,2014,24(12):38.
[10]叶偲,李良福,肖樟树. 一种去除运动目标重影的图像镶嵌方法研究[J].计算机技术与发展,2014,24(07):43.
 YE Si,LI Liang-fu,XIAO Zhang-shu. Research of an Image Mosaic Method for Removing Ghost of Moving Targets[J].,2014,24(12):43.

更新日期/Last Update: 2016-01-28