Journal of Scientific Information Research
Keywords
vector database; multimodal data fusion; vector data retrieval; vector data indexing; AI application ecosystem
Abstract
[Purpose/significance]The article reveals the theoretical systems, technological systems, and applied systems of vector databases, aiming to promote innovation in the research and practice of multimodal AI related theories, technologies, and applications. [Method/process]This article elaborates on the evolution of vector databases and defines its core concepts through literatures tracing and content analyzing. Subsequently, it compares and analyzes their characteristics and values, and based on this, sorts out their application mechanisms, functions, corresponding key technologies and application modes. Simultaneously, it discusses the challenges and countermeasures faced by vector databases, and looks forward to their development trends from theoretical, technical, and application perspectives. [Result/conclusion]Vector databases originate from the construction of the vector index method system, develop in vector data retrieval engine construction, and mature in vector database management system construction. Compared to relational databases and graph databases, vector databases exhibit obvious characteristics in data models, indexing mechanisms.They hold various value for users, data managers, developers and researchers. The key technologies are divided into
First Page
11
Last Page
24
Submission Date
13-Apr-2024
Revision Date
12-Jun-2024
Acceptance Date
02-Jul-2024
Published Date
01-Oct-2024
Reference
[1] LI H Y,SU Y X,DENG C,et al.A Survey on Retrieval-augmented Text Generation[DB/OL].[2024-03-28].https://arxiv.org/abs/2202.01110.
[2] 阿里云.阿里云天池数据集[DS/OL].[2024-03-28].https://tianchi.aliyun.com/dataset/.
[3] VOORHEES E M.The Cluster Hypothesis Revisited[C]//Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Montreal Quebec,Canada:Association for Computing Machinery,1985:188-196.
[4] DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by Latent Semantic Analysis[J].Journal of the American Society for Information Science,1990,41(06):391-407.
[5] SIVIC,ZISSERMAN.Video Google:A Text Retrieval Approach to Object Matching in Videos[C]//Ninth IEEE International Conference on Computer Vision.Nice,France:IEEE,2003,2:1470-1477.
[6] GIONIS A,INDYK P,MOTWANI R.Similarity Search in High Dimensions via Hashing[C]//Proceedings of the 25th International Conference on Very Large Data Bases.San Francisco,USA:Morgan Kaufmann,1999:518-529.
[7] JEGOU H,DOUZE M,SCHMID C.Product Quantization for Nearest Neighbor Search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,33(01):117-128.
[8] GE T,HE K,KE Q,et al.Optimized Product Quantization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(04):744-755.
[9] FU C,XIANG C,WANG C X,et al.Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph[DB/OL].[2024-03-28].https://arxiv.org/abs/1707.00143.
[10] MALKOV Y A,YASHUNIN D A.Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,42(04):824-836.
[11] NOROUZI M,PUNJANI A,FLEET D J.Fast Search in Hamming Space with Multi-index Hashing[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.Providence,RI,USA:IEEE,2012:3108-3115.
[12] DASGUPTA S,FREUND Y.Random Projection Trees and Low Dimensional Manifolds[C]//Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing.Victoria British,Columbia,Canada:Association for Computing Machinery,2008:537-546.
[13] MUJA M,LOWE D G.Scalable Nearest Neighbor Algorithms for High Dimensional Data[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,36(11):2227-2240.
[14] 潘正源,李樵,李月琳,等.智能信息检索研究范式的演进、反思与前瞻[J].图书馆论坛,2024,44(01)137-150.
[15] FaceBook.Faiss[DB/OL].[2024-03-28].https://github.com/facebookresearch/faiss.
[16] WEI C,WU B,WANG S,et al.Analyticdb-v:a Hybrid Analytical Engine towards Query Fusion for Structured and Unstructured Data[J].Proceedings of the VLDB Endowment,2020,13(12):3152-3165.
[17] YANG W,LI T,FANG G,et al.Pase:Postgresql Ultra-high-dimensional Approximate Nearest Neighbor Search Extension[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.Portland,OR,USA:Association for Computing Machinery,2020:2241-2253.
[18] 刘细文,孙蒙鸽,王茜,等.DIKIW逻辑链下GPT大模型对文献情报工作的潜在影响分析[J].图书情报工作,2023,67(21):3-12.
[19] 刘倩倩,刘圣婴,刘炜.图书情报领域大模型的应用模式和数据治理[J].图书馆杂志,2023,42(12):22-35.
[20] LI J,LIU H,GUI C,et al.The Design and Implementation of a Real Time Visual Search System on JD E-commerce Platform[C]//Proceedings of the 19th International Middleware Conference Industry.Rennes,France:Association for Computing Machinery,2018:9-16.
[21] VESPA.The Open Big Data Serving Engine[EB/OL].[2024-03-28].https://vespa.ai/.
[22] WEAVIATE.The AI Native Vector Database[EB/OL].[2024-03-28].https://weaviate.io/.
[23] QDRANT.High-performance,Massive-scale Vector Database for the Next Generation of AI[EB/OL].[2024-03-28].https://qdrant.tech/.
[24] PINECONE.Long-Term Memory for AI[EB/OL].[2024-03-28].https://www.pinecone.io/.
[25] WANG J,YI X,GUO R,et al.Milvus:A Purpose-built Vector Data Management System[C]//Proceedings of the 2021 International Conference on Management of Data.New York,United States:Association for Computing Machinery,2021:2614-2627.
[26] GUO R T,LUAN X F,XIANG L,et al.Manu:a Cloud Native Vector Database Management System[DB/OL].[2024-03-28].https://arxiv.org/abs/2206.13843.
[27] CHROMA.The AI-native Open-source Embedding Database[DB/OL].[2024-03-28].https://github.com/chroma-core/chroma.
[28] ROIE SCHWABER-COHEN.What is a Vector Database?[EB/OL].[2024-03-28].https://www.pinecone.io/learn/vector-database/.
[29] Amazon.什么是向量数据库?[EB/OL].[2024-03-28].https://aws.amazon.com/cn/what-is/vector-databases/.
[30] ElasticSearch.什么是向量嵌入?[EB/OL].[2024-03-28].https://www.elastic.co/cn/what-is/vector-embedding.
[31] LEONIE MONIGATTI.A Gentle Introduction to Vector Databases[EB/OL].[2024-03-28].https://weaviate.io/blog/what-is-a-vector-database.
[32] FRANK LIU.What is a Vector Database?[EB/OL].[2024-03-28].https://zilliz.com/learn/what-is-vector-database.
[33] Milvus.Milvus Documentation[EB/OL].[2024-03-28].https://milvus.io/docs.
[34] MYSCALE.MyScale Docs[EB/OL].[2024-03-28].https://myscale.com/docs/en/.
[35] 卢小宾,霍帆帆,王壮,等.数智时代的信息分析方法:数据驱动、知识驱动及融合驱动[J].中国图书馆学报,2024,50(01):29-44.
[36] VALD.A Highly Scalable Distributed Vector Search Engine[EB/OL].[2024-03-28].https://vald.vdaas.org/.
[37] DAN DASCALESCU.Vector Embeddings Explained[EB/OL].[2024-03-28].https://weaviate.io/blog/vector-embeddings-explained.
[38] LE Q,MIKOLOV T.Distributed Representations of Sentences and Documents[C]//Proceedings of the 31st International Conference on Machine Learning-Volume 32.Beijing,China:JMLR.org,2014:II-1188-II-1196.
[39] GROHE M.Word2vec,Node2vec,Graph2vec,X2vec:Towards a Theory of Vector Embeddings of Structured Data[C]//Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems.Portland,OR,USA:Association for Computing Machinery,2020:1-16.
[40] ZHANG Y,LI Y F,CUI L Y,et al.Siren's Song in the AI Ocean:A Survey on Hallucination in Large Language Models[DB/OL].[2024-03-28].https://arxiv.org/abs/2309.01219.
[41] 郭小平,秦艺轩.解构智能传播的数据神话:算法偏见的成因与风险治理路径[J].现代传播(中国传媒大学学报),2019,41(09):19-24.
[42] TAIPALUS T.Vector Database Management Systems:Fundamental Concepts,Use-cases,and Current Challenges[J].Cognitive Systems Research,2024:101216.
Digital Object Identifier (DOI)
10.19809/j.cnki.kjqbyj.2024.04.002
Recommended Citation
SUN, Yusheng and ZENG, Junhao
(2024)
"Research on Vector Database and Its Application,"
Journal of Scientific Information Research: Vol. 6:
Iss.
4, Article 2.
DOI: 10.19809/j.cnki.kjqbyj.2024.04.002
Available at:
https://eng.kjqbyj.com/journal/vol6/iss4/2