Scientific Information Research

Research on the Application of Part-of-speech Tagging of Ancient Books under the Domain Large Language Model

Danhao ZHU, 1.Department of Criminal Science and Technology, Jiangsu Police Institute, Nanjing 210031
ZHAO Zhixiao, School of Information Management, Nanjing Agricultural University, Nanjing 210095
Die HU, School of Information Management, Nanjing Agricultural University, Nanjing 210095
Wenhua ZHAO, School of Information Management, Nanjing Agricultural University, Nanjing 210095

Keywords

Large language model; "Xunzi" large language model; Zuozhuan; lexical annotation; instruction tuning

Abstract

[Purpose/significance]The development of the large language model has brought new ideas for ancient text mining, and combining the large language model with the digitisation and intelligence of ancient books is a necessary path for the work of ancient books in the new era. [Methods/process]This paper uses the lexically annotated corpus of Zuozhuan to construct a batch of high-quality lexically annotated instruction data through data cleaning and preprocessing, on the basis of which 500, 1 000, 2 000, and 5 000 pieces of data are used to fine-tune the instructions of the large language model, and the performance test is carried out on another 1 000 pieces of data, respectively. [Results/conclusions]The experimental results show that the "Xunzi" series model outperforms the general domain model on the lexical annotation task of ancient texts, and the Xunzi-Baichuan2-7B model exhibits optimal performance with an F1 value of 81.67% when the amount of fine-tuned data reaches 5 000.

First Page

Recommended Citation

ZHU, Danhao; Zhixiao, ZHAO; HU, Die; and ZHAO, Wenhua (2024) "Research on the Application of Part-of-speech Tagging of Ancient Books under the Domain Large Language Model," Scientific Information Research: Vol. 6: Iss. 2, Article 3.
Available at: https://eng.kjqbyj.com/journal/vol6/iss2/3

Reference

[1] 纪有书，王东波，黄水清.基于词对齐的古汉语同义词自动抽取研究：以前四史典籍为例[J].数据分析与知识发现，2021，5（11）：135-144. [2] 俞理明.从早期佛经材料看古代汉语中的两种疑问词“为”[J].四川大学学报（哲学社会科学版），1991（04）：75-81. [3] 于智荣.上古典籍中表“率领诸义的“以”字不是介词[J].语文研究，2002（02）：33-37. [4] 刘巧芝.从副词谓语句看古汉语“必”字的词性[J].和田师范专科学校学报，2004（04）：111-112. [5] 李亚南.《古文观止》宋文“不”字句式分析[J].现代语文，2019（07）：19-23. [6] 吴柱.古代法律术语“坐”词义演变疏证：兼辨相关典籍历代注释之误与介词“坐”的产生问题[J].中国语文，2019（06）：736-747，768. [7] 常博林，万晨，李斌，等.基于词和实体标注的古籍数字人文知识库的构建与应用：以《资治通鉴·周秦汉纪》为例[J].图书情报工作，2021，65（22）：134-142. [8] ZHANG H，REN F J.Chinese POS Tagging Using Restricted Maximum Entropy Model[J].Chinese Journal of Electronics，2010，19（01）：39-42. [9] 王国龙，杜建强，郝竹林，等.中医诊断古文的词性标注与特征重组[J].计算机工程与设计，2015，36（03）：835-841. [10] 包振山，宋秉彦，张文博，等.基于半监督学习和规则相结合的中医古籍命名实体识别研究[J].中文信息学报，2022，36（06）：90-100. [11] 刘博，杜建强，聂斌，等.基于二阶HMM的中医诊断古文词性标注[J].计算机工程，2017，43（07）：211-216. [12] 王东波，黄水清，何琳.基于多特征知识的先秦典籍词性自动标注研究[J].图书情报工作，2017，61（12）：64-70. [13] 袁悦，王东波，黄水清，等.不同词性标记集在典籍实体抽取上的差异性探究[J].数据分析与知识发现，2019，3（03）：57-65. [14] 程宁.基于深度学习的古籍文本断句与词法分析一体化处理技术研究[D].南京：南京师范大学，2020. [15] 李成名.基于深度学习的古籍词法分析研究[D].南京：南京师范大学，2018. [16] GUO J J，WANG S P，YUC H，et al.Chinese POS tagging method based on bi-GRU+CRF hybrid model[C]//Advances in Intelligent Networking and Collaborative Systems：The 10th International Conference on Intelligent Networking and Collaborative Systems （INCoS）.Springer International Publishing，2019：453-460. [17] 陈诗，王东波，黄水清.数字人文下的典籍人称代词指代消解研究[J].情报理论与实践，2021，44（10）：165-172. [18] 吴梦成，林立涛，许乾坤，等.融合不同语义知识的中国古代典籍机器翻译研究[J/OL].情报资料工作：1-14[2024-02-29].http：//kns.cnki.net/kcms/detail/11.1448.G3.20231226.1634.012.html. [19] WANG D B，LIU C，ZHAO Z X，et al.Gujibert and gujigpt：Construction of intelligent information processing foundation language models for ancient texts[J].arXiv preprint arXiv：2307.05354，2023. [20] 刘江峰，冯钰童，王东波，等.数字人文视域下SikuBERT增强的史籍实体识别研究[J].图书馆论坛，2022，42（10）：61-72. [21] 王东波，刘畅，朱子赫，等.SikuBERT与SikuRoBERTa：面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛，2022，42（06）：31-43. [22] 张琪，江川，纪有书，等.面向多领域先秦典籍的分词词性一体化自动标注模型构建[J].数据分析与知识发现，2021，5（03）：2-11. [22] 崔斌，王东波，黄水清.基于典籍文本的农作物时间分布及演化特征研究：以《食货志》为例[J].图书情报工作，2021，65（14）：90-100. [23] 耿云冬，张逸勤，刘欢，等.面向数字人文的中国古代典籍词性自动标注研究：以SikuBERT预训练模型为例[J].图书馆论坛，2022，42（06）：55-63. [24] 张逸勤，邓三鸿，胡昊天，等.预训练模型视角下的跨语言典籍风格计算研究[J].数据分析与知识发现，2023，7（10）：50-62. [25] DEVLIN J，CHANG M W，LEE K，et al.BERT：Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT.2019：4171-4186. [26] 石民，李斌，陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报，2010，24（02）：39-45.

Download

Included in

Scholarly Communication Commons

COinS