•  
  •  
 

Scientific Information Research

Keywords

Large language model; "Xunzi" large language model; Zuozhuan; lexical annotation; instruction tuning

Abstract

[Purpose/significance]The development of the large language model has brought new ideas for ancient text mining, and combining the large language model with the digitisation and intelligence of ancient books is a necessary path for the work of ancient books in the new era. [Methods/process]This paper uses the lexically annotated corpus of Zuozhuan to construct a batch of high-quality lexically annotated instruction data through data cleaning and preprocessing, on the basis of which 500, 1 000, 2 000, and 5 000 pieces of data are used to fine-tune the instructions of the large language model, and the performance test is carried out on another 1 000 pieces of data, respectively. [Results/conclusions]The experimental results show that the "Xunzi" series model outperforms the general domain model on the lexical annotation task of ancient texts, and the Xunzi-Baichuan2-7B model exhibits optimal performance with an F1 value of 81.67% when the amount of fine-tuned data reaches 5 000.

First Page

21

Reference

[1] 纪有书,王东波,黄水清.基于词对齐的古汉语同义词自动抽取研究:以前四史典籍为例[J].数据分析与知识发现,2021,5(11):135-144. [2] 俞理明.从早期佛经材料看古代汉语中的两种疑问词“为”[J].四川大学学报(哲学社会科学版),1991(04):75-81. [3] 于智荣.上古典籍中表“率领诸义的“以”字不是介词[J].语文研究,2002(02):33-37. [4] 刘巧芝.从副词谓语句看古汉语“必”字的词性[J].和田师范专科学校学报,2004(04):111-112. [5] 李亚南.《古文观止》宋文“不”字句式分析[J].现代语文,2019(07):19-23. [6] 吴柱.古代法律术语“坐”词义演变疏证:兼辨相关典籍历代注释之误与介词“坐”的产生问题[J].中国语文,2019(06):736-747,768. [7] 常博林,万晨,李斌,等.基于词和实体标注的古籍数字人文知识库的构建与应用:以《资治通鉴·周秦汉纪》为例[J].图书情报工作,2021,65(22):134-142. [8] ZHANG H,REN F J.Chinese POS Tagging Using Restricted Maximum Entropy Model[J].Chinese Journal of Electronics,2010,19(01):39-42. [9] 王国龙,杜建强,郝竹林,等.中医诊断古文的词性标注与特征重组[J].计算机工程与设计,2015,36(03):835-841. [10] 包振山,宋秉彦,张文博,等.基于半监督学习和规则相结合的中医古籍命名实体识别研究[J].中文信息学报,2022,36(06):90-100. [11] 刘博,杜建强,聂斌,等.基于二阶HMM的中医诊断古文词性标注[J].计算机工程,2017,43(07):211-216. [12] 王东波,黄水清,何琳.基于多特征知识的先秦典籍词性自动标注研究[J].图书情报工作,2017,61(12):64-70. [13] 袁悦,王东波,黄水清,等.不同词性标记集在典籍实体抽取上的差异性探究[J].数据分析与知识发现,2019,3(03):57-65. [14] 程宁.基于深度学习的古籍文本断句与词法分析一体化处理技术研究[D].南京:南京师范大学,2020. [15] 李成名.基于深度学习的古籍词法分析研究[D].南京:南京师范大学,2018. [16] GUO J J,WANG S P,YUC H,et al.Chinese POS tagging method based on bi-GRU+CRF hybrid model[C]//Advances in Intelligent Networking and Collaborative Systems:The 10th International Conference on Intelligent Networking and Collaborative Systems (INCoS).Springer International Publishing,2019:453-460. [17] 陈诗,王东波,黄水清.数字人文下的典籍人称代词指代消解研究[J].情报理论与实践,2021,44(10):165-172. [18] 吴梦成,林立涛,许乾坤,等.融合不同语义知识的中国古代典籍机器翻译研究[J/OL].情报资料工作:1-14[2024-02-29].http://kns.cnki.net/kcms/detail/11.1448.G3.20231226.1634.012.html. [19] WANG D B,LIU C,ZHAO Z X,et al.Gujibert and gujigpt:Construction of intelligent information processing foundation language models for ancient texts[J].arXiv preprint arXiv:2307.05354,2023. [20] 刘江峰,冯钰童,王东波,等.数字人文视域下SikuBERT增强的史籍实体识别研究[J].图书馆论坛,2022,42(10):61-72. [21] 王东波,刘畅,朱子赫,等.SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛,2022,42(06):31-43. [22] 张琪,江川,纪有书,等.面向多领域先秦典籍的分词词性一体化自动标注模型构建[J].数据分析与知识发现,2021,5(03):2-11. [22] 崔斌,王东波,黄水清.基于典籍文本的农作物时间分布及演化特征研究:以《食货志》为例[J].图书情报工作,2021,65(14):90-100. [23] 耿云冬,张逸勤,刘欢,等.面向数字人文的中国古代典籍词性自动标注研究:以SikuBERT预训练模型为例[J].图书馆论坛,2022,42(06):55-63. [24] 张逸勤,邓三鸿,胡昊天,等.预训练模型视角下的跨语言典籍风格计算研究[J].数据分析与知识发现,2023,7(10):50-62. [25] DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT.2019:4171-4186. [26] 石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报,2010,24(02):39-45.

Share

COinS