在前面搭建了Ollama之后,我们就可以在本地使用向量数据库,方便计算自然语言的相似度,包括在语义和BM2上。
通过Docker快速搭建Weaviate向量数据库,本身基于Golang编写,镜像只有160M。由于我只需要测试embedding,即把自然语言转换为向量(或者理解为数组),就只用了text2vec-ollama
模块。
version: '3'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.29.1
ports:
- 8080:8080
- 50051:50051
volumes:
- weaviate_data:/var/lib/weaviate
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
ENABLE_API_BASED_MODULES: 'true'
ENABLE_MODULES: 'text2vec-ollama'
CLUSTER_HOSTNAME: 'node1'
volumes:
weaviate_data:
连接数据库
import weaviate
if client is not None and client.is_ready():
client.close()
client = weaviate.connect_to_local()
print(client.is_ready()) # Should print: `True`
插入数据
我这里使用了bge-m3
模型,因为需要支持中文。
from weaviate.classes.config import Configure
client.collections.delete(name="intent_collections")
intent_collections = client.collections.create(
name="intent_collections",
vectorizer_config=Configure.Vectorizer.text2vec_ollama( # Configure the Ollama embedding integration
api_endpoint="http://host.docker.internal:11434", # Allow Weaviate from within a Docker container to contact your Ollama instance
model="bge-m3", # The model to use
)
)
data = [
{
"text": "buy goods on Taobao"
},
{
"text": "buy goods on JD"
}
]
intent_collections = client.collections.get("intent_collections")
with intent_collections.batch.dynamic() as batch:
for d in data:
batch.add_object({
"text": d["text"]
})
if batch.number_errors > 10:
print("Batch import stopped due to excessive errors.")
break
failed_objects = intent_collections.batch.failed_objects
if failed_objects:
print(f"Number of failed imports: {len(failed_objects)}")
print(f"First failed object: {failed_objects[0]}")
查询相似度
相似度上有BM25和语义两个方面,可以通过alpha
控制权重,默认是BM25为0.7,语义是0.3。
import json
intent_collections = client.collections.get("intent_collections")
response = intent_collections.query.hybrid(
query="在淘宝买东西",
alpha=0.5,
limit=3,
return_metadata=['score', 'explain_score']
)
for obj in response.objects:
prop = obj.properties
prop.update({'score': obj.metadata.score})
print(json.dumps(prop, indent=2))
关闭连接
client.close()
通过API调用
curl http://localhost:8080/v1/schema
curl http://localhost:8080/v1/objects/
curl http://localhost:8080/v1/objects/Automation_collections/1aee807e-70a1-45e9-8a16-49228630c863