向量数据库Weaviate

在前面搭建了Ollama之后，我们就可以在本地使用向量数据库，方便计算自然语言的相似度，包括在语义和BM2上。

通过Docker快速搭建Weaviate向量数据库，本身基于Golang编写，镜像只有160M。由于我只需要测试embedding，即把自然语言转换为向量(或者理解为数组)，就只用了text2vec-ollama模块。

version: '3'

services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.29.1
    ports:
    - 8080:8080
    - 50051:50051
    volumes:
    - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_API_BASED_MODULES: 'true'
      ENABLE_MODULES: 'text2vec-ollama'
      CLUSTER_HOSTNAME: 'node1'


volumes:
  weaviate_data:

连接数据库

import weaviate

if client is not None and client.is_ready():
  client.close()

client = weaviate.connect_to_local()

print(client.is_ready())  # Should print: `True`

插入数据

我这里使用了bge-m3模型，因为需要支持中文。

from weaviate.classes.config import Configure

client.collections.delete(name="intent_collections")

intent_collections = client.collections.create(
    name="intent_collections",
    vectorizer_config=Configure.Vectorizer.text2vec_ollama(     # Configure the Ollama embedding integration
        api_endpoint="http://host.docker.internal:11434",       # Allow Weaviate from within a Docker container to contact your Ollama instance
        model="bge-m3",                               # The model to use
    )
)

data = [
    {
        "text": "buy goods on Taobao"
    },
    {
        "text": "buy goods on JD"
    }
]

intent_collections = client.collections.get("intent_collections")

with intent_collections.batch.dynamic() as batch:
    for d in data:
        batch.add_object({
            "text": d["text"]
        })
        if batch.number_errors > 10:
            print("Batch import stopped due to excessive errors.")
            break

failed_objects = intent_collections.batch.failed_objects
if failed_objects:
    print(f"Number of failed imports: {len(failed_objects)}")
    print(f"First failed object: {failed_objects[0]}")

查询相似度

相似度上有BM25和语义两个方面，可以通过alpha控制权重，默认是BM25为0.7，语义是0.3。

import json

intent_collections = client.collections.get("intent_collections")

response = intent_collections.query.hybrid(
    query="在淘宝买东西",
    alpha=0.5,
    limit=3,
    return_metadata=['score', 'explain_score']
)

for obj in response.objects:
    prop = obj.properties
    prop.update({'score': obj.metadata.score})
    print(json.dumps(prop, indent=2))

关闭连接

client.close()

通过API调用

curl http://localhost:8080/v1/schema
curl http://localhost:8080/v1/objects/
curl http://localhost:8080/v1/objects/Automation_collections/1aee807e-70a1-45e9-8a16-49228630c863

Reference

https://weaviate.io/developers/weaviate/quickstart/local