Postgres中的full text search

Database and Ruby, Python, History


文章源于 https://xata.io/blog/postgres-full-text-search-engine

下载数据文件

https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots

创建表

CREATE TABLE movies(
  ReleaseYear int,
  Title text,
  Origin text,
  Director text,
  Casting text,
  Genre text,
  WikiPage text,
  Plot text);

导入数据

\COPY movies(ReleaseYear, Title, Origin, Director, Casting, Genre, WikiPage, Plot)
  FROM 'wiki_movie_plots_deduped.csv' DELIMITER ',' CSV HEADER;

tsvector

用来存储词素向量的排序列表,这里的词素已经被规范化了,比如 refuse 和 refusing 都被转成 refus。weights 是权重,从 A 到 D,默认是 D,权重最低。

 SELECT * FROM unnest(to_tsvector('english', 'I''m going to make him an offer he can''t refuse. Refusing is not an option.'));
 lexeme | positions | weights
--------+-----------+---------
 go     | {3}       | {D}
 m      | {2}       | {D}
 make   | {5}       | {D}
 offer  | {8}       | {D}
 option | {17}      | {D}
 refus  | {12,13}   | {D,D}
(6 rows)

tsquery

规范化查询。通过构建规范化查询,我们就可以进行词素查询了。

SELECT websearch_to_tsquery('english', 'the darth vader');
 websearch_to_tsquery
----------------------
 'darth' & 'vader'
(1 row)

SELECT websearch_to_tsquery('english', 'darth OR vader');
 websearch_to_tsquery
----------------------
 'darth' | 'vader'

SELECT websearch_to_tsquery('english', 'darth vader -wars');
   websearch_to_tsquery
---------------------------
 'darth' & 'vader' & !'war'

SELECT websearch_to_tsquery('english', '"the darth vader son"');
     websearch_to_tsquery
------------------------------
 'darth' <-> 'vader' <-> 'son'

查询

SELECT websearch_to_tsquery('english', 'darth vader') @@
        to_tsvector('english',
                'Darth Vader is my father.');

?column?
----------
 t

构建 GIN

GIN 是广义倒排索引。数据库中的一种索引结构,用于加速文本搜索和全文搜索。GIN 索引通常用于 PostgreSQL 等数据库管理系统中,它允许在文本列中快速查找包含特定词汇或短语的行。

首先需要创建 tsvector 列,然后再创建索引.

ALTER TABLE movies ADD search tsvector GENERATED ALWAYS AS
	(to_tsvector('english', Title) || ' ' ||
   to_tsvector('english', Plot) || ' ' ||
   to_tsvector('simple', Director) || ' ' ||
	 to_tsvector('simple', Genre) || ' ' ||
   to_tsvector('simple', Origin)
) STORED;

CREATE INDEX idx_search ON movies USING GIN(search);

测试一下

SELECT title FROM movies WHERE search @@ websearch_to_tsquery('english','darth vader');
                      title
--------------------------------------------------
 Star Wars Episode IV: A New Hope (aka Star Wars)
 The Empire Strikes Back
 Return of the Jedi
 Meet the Fockers
 Star Wars: Episode III  Revenge of the Sith
 Rogue One: A Star Wars Story (film)
 Star Wars: The Force Unleashed
 Star Wars: The Force Unleashed II
 American Honey
(9 rows)

排序

Postgres 有 ts_rank 函数,可以查看查询结果的相关性。

SELECT title, ts_rank(search, websearch_to_tsquery('english', 'darth vader')) rank FROM movies WHERE search @@ websearch_to_tsquery('english','darth vader') order by rank desc;
                      title                       |    rank
--------------------------------------------------+-------------
 Star Wars: The Force Unleashed                   |  0.39236963
 Star Wars: The Force Unleashed II                |  0.28135812
 The Empire Strikes Back                          |  0.26263964
 Star Wars Episode IV: A New Hope (aka Star Wars) |  0.18902963
 Star Wars: Episode III  Revenge of the Sith     |  0.10292397
 Rogue One: A Star Wars Story (film)              |  0.10049681
 Return of the Jedi                               |  0.09910346
 American Honey                                   |  0.09910322
 Meet the Fockers                                 | 0.098500855
(9 rows)

相关性的调整

有时候,title 会比较重要,又或者 votes 比较重要,我们都需要修改 rank 的结果,用来提升排名。比如,可以通过对数来提升 votes 的权重。

SELECT title,
  ts_rank(search, websearch_to_tsquery('english', 'jedi')) + log(votes)*0.01
 FROM movies
 WHERE search @@ websearch_to_tsquery('english','jedi')
 ORDER BY rank DESC LIMIT 10;

或者我们直接提升某一列的权重,比如 title。

ALTER TABLE movies ADD search tsvector GENERATED ALWAYS AS
   (setweight(to_tsvector('english', Title), 'A') || ' ' ||
   to_tsvector('english', Plot) || ' ' ||
   to_tsvector('simple', Director) || ' ' ||
   to_tsvector('simple', Genre) || ' ' ||
   to_tsvector('simple', Origin) || ' ' ||
   to_tsvector('simple', Casting)
) STORED;

模糊查询/拼写错误

Postgres 在使用 tsvector 和 tsquery 的时候,不支持模糊查询或者拼写错误。但是拼写错误部分我们可以这样实现。

  • 先索引所有的词素
  • 通过相似性查询所有的单词
  • 对找到的单词,再进行查询
CREATE MATERIALIZED VIEW unique_lexeme AS
   SELECT word FROM ts_stat('SELECT search FROM movies');

SELECT * FROM unique_lexeme
   WHERE levenshtein_less_equal(word, 'pregant', 2) < 2;

   word
----------
 premant
 pregrant
 pregnant
 paegant