Scientific Information Research

Research on Word Segmentation of Ancient Books Based on Domain Large Language Model

Danhao ZHU, 1.Department of Criminal Science and Technology, Jiangsu Police Institute, Nanjing 210031
ZHAO Zhixiao, School of Information Management, Nanjing Agricultural University, Nanjing 210095
Na WU, School of Information Management, Nanjing Agricultural University, Nanjing 210095
Xiyu WANG, School of Information Management, Nanjing Agricultural University, Nanjing 210095

Keywords

"Xunzi" large language model; Zuozhuan; segmentation; instruction tuning

Abstract

[Purpose/significance]In this paper, we take the automatic text segmentation of ancient books as an entry point, introduce the "Xunzi" series of large language models, and explore the performance of large language models on the task of word division of ancient texts. [Method/process]This paper constructs an instruction dataset based on the Zuozhuan, with data cleaning and organisation.on this basis, 1 000 pieces were extracted from it as test data, then 500, 1 000, 2 000, and 5 000 pieces of data were used as training data to fine-tune the instructions and test their performance, respectively. [Result/conclusion]The experimental results show that only a relatively small amount of data is needed for the large language model to have a more desirable performance, and the Xunzi-Qwen-7B model shows optimal performance with an F1 value of 84.54% when the amount of fine-tuned data reaches 5 000 pieces.

First Page

Recommended Citation

ZHU, Danhao; Zhixiao, ZHAO; WU, Na; and WANG, Xiyu (2024) "Research on Word Segmentation of Ancient Books Based on Domain Large Language Model," Scientific Information Research: Vol. 6: Iss. 2, Article 2.
Available at: https://eng.kjqbyj.com/journal/vol6/iss2/2

Reference

[1] 黄祥喜.“语境相关”自动分词方法[J].情报学报，1989，8（04）：266-273.
[2] 姚天顺，张桂平，吴映明.基于规则的汉语自动分词系统[J].中文信息学报，1990（01）：37-43.
[3] 何克抗，徐辉，孙波.书面汉语自动分词专家系统设计原理[J].中文信息学报，1991（02）：1-14，28.
[4] 陈其晖，应志伟，柴佩琪.基于歧义二叉树的汉语分词方法[J].计算机辅助工程，1999（04）：12-17.
[5] 沈达阳，孙茂松，黄昌宁.基于统计的汉语分词模型及实现方法[J].中文信息，1998（Z1）：96-98.
[6] 刘挺，吴岩，王开铸.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报，1998（01）：18-26.
[7] 李家福，张亚非.一种基于概率模型的分词系统[J].系统仿真学报，2002（05）：544-546，550.
[8] 邱冰，皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息，2008（24）：100-102.
[9] 丁振国，张卓，黎靖.基于Hash结构的逆向最大匹配分词算法的改进[J].计算机工程与设计，2008（12）：3208-3211，3265.
[10] 曾艳，侯汉清.古籍文本抽词研究[J].图书情报工作，2008（01）：132-135.
[11] 徐紫云，徐雪松.从自动分词角度看先秦与现代汉语词汇区别[J].华东交通大学学报，2009，26（06）：101-104.
[12] 张梅山，邓知龙，车万翔，等.统计与词典相结合的领域自适应中文分词[J].中文信息学报，2012，26（02）：8-12.
[13] 徐润华，陈小荷.一种利用注疏的《左传》分词新方法[J].中文信息学报，2012，26（02）：13-17，45.
[14] 钱智勇，周建忠，童国平，等.基于HMM的楚辞自动分词标注研究[J].图书情报工作，2014，58（04）：105-110.
[15] 梁社会，陈小荷.先秦文献《孟子》自动分词方法研究[J].南京师范大学文学院学报，2013（03）：175-182.
[16] 邓丽萍，罗智勇.基于半监督CRF的跨领域中文分词[J].中文信息学报，2017，31（04）：9-19.
[17] 王姗姗，王东波，黄水清，等.多维领域知识下的《诗经》自动分词研究[J].情报学报，2018，37（02）：183-193.
[18] 王晓玉，李斌.基于CRFs和词典信息的中古汉语自动分词[J].数据分析与知识发现，2017，1（05）：62-70.
[19] 倪维健，孙浩浩，刘彤，等.面向领域文献的无监督中文分词自动优化方法[J].数据分析与知识发现，2018，2（02）：96-104.
[20] 李筱瑜.基于新词发现与词典信息的古籍文本分词研究[J].软件导刊，2019，18（04）：60-63.
[21] 邢玲，程兵.基于结巴分词的领域自适应分词方法研究[J].计算机仿真，2023，40（04）：310-316，503.
[22] 金宸，李维华，姬晨，等.基于双向LSTM神经网络模型的中文分词[J].中文信息学报，2018，32（02）：29-37.
[23] 徐伟，车万翔，刘挺.融合手工特征与双向LSTM结构的中文分词方法研究[J].智能计算机与应用，2019，9（01）：169-172，177.
[24] 程宁，李斌，葛四嘉，等.基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J].中文信息学报，2020，34（04）：1-9.
[25] 俞敬松，魏一，张永伟.基于BERT的古文断句研究与应用[J].中文信息学报，2019，33（11）：57-63.
[26] 张琪，江川，纪有书，等.面向多领域先秦典籍的分词词性一体化自动标注模型构建[J].数据分析与知识发现，2021，5（03）：2-11.
[27] 刘畅，王东波，胡昊天，等.面向数字人文的融合外部特征的典籍自动分词研究：以SikuBERT预训练模型为例[J].图书馆论坛，2022，42（06）：44-54.
[28] 唐雪梅，苏祺，王军，等.基于图卷积神经网络的古汉语分词研究[J].情报学报，2023，42（06）：740-750.
[29] 张素华，叶青，程春雷，等.面向中医古籍文本的领域自适应性无监督分词[J].软件导刊，2022，21（01）：96-100.
[30] 石民，李斌，陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报，2010，24（02）：39-45.

Download

Included in

Scholarly Communication Commons

COinS