学习日志 211229 elasticsearch text analyzer理解

2021-12-29 17:53 作者:mayoiwill 0人读过 | 我要投稿

# 211229

## 设置内置分析器

- 参考

- https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-analyzers.html

- 内置分析器自定义

- 内置分析器和自定义分析器的区别

- 内置分析器用 type

- 自定义分析器 type统一是custom 之后再详细定义tokenizer等三项

- 内置分析器支持的参数是特定参数, 如stopwords 需要查文档

- 自定义分析器参数是固定的那三项

- 已定义出来的索引的properties字段是不能改的

- runtime字段是可以改的

- reindex或者删了重建

- 使用fields这种子字段, 可以针对同一个原始字段使用不同的分析器

- `stopwords:_english_`表示英语的一些助词不索引

- `to be or not to be`问题

- 先按带stopwords的索引查查不出再改用不带stopwords的索引查

## 设置自定义分析器

- 可以使用内置char_filter tokenizer 和 token_filter

- 也可以自定义上述三项

- 例如创建一个自定义char_filter叫做 emoticons

- 参数为 type:mapping mappings数据格式是 xx => yyy

- 创建一个自定义tokenizer

- 参数 type:pattern pattern是一个字符串 `[ .,!?]`

- 创建自定义token_filter

- 参数 type:stop `stopwords:_english_`

## 分析器使用的优先级

- 逐字段设置 -> 索引级别默认设置 -> 标准

- 可以为构建时和搜索时设置不同的分析器

- 搜索引分析器针对query起作用

- search_analyzer

- 参考

- https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html

- 有多种方式

- 可以在query里指定

- 常用的就是在mapping的字段上指定

- 指定方法

```

PUT my-index-000001

{

"mappings": {

"properties": {

"title": {

"type": "text",

"analyzer": "whitespace",

"search_analyzer": "simple"

}

```

## 内置分析器简介

- 有多种, 我们只介绍两个 fingerprint 和 standard

- fingerprint

- 基于 fingerprinting algorithm

- https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint

- 测试

```

POST _analyze

{

"analyzer": "fingerprint",

"text": "Yes yes, Gödel said this sentence is consistent and."

}

```

- 构成

- standard 分词器

- token_filter有以下几个按顺序

- Lower Case Token Filter

- ASCII folding

- Stop Token Filter (disabled by default)

- Fingerprint

- 其实关键就是 Fingerprint 这个token_filter

- standard

- 定义

- standard tokenizer

- Lower Case Token Filter

- Stop Token Filter (disabled by default)

## 自定义分析器

- 选择一个tokenizer

- 一般都是选 standard

- 中文 ICU Tokenizer

- 选择一些token filter

- Lower Case Token Filter

- ASCII folding

- 同义词等

### icu_tokenizer

- 参考 https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html

- 基于字典的方法

- 安装见Q&A部分第二个问题基于k8s的安装

- 测试

```

PUT icu_sample

{

"settings": {

"index": {

"analysis": {

"analyzer": {

"my_icu_analyzer": {

"tokenizer": "icu_tokenizer"

}

GET icu_sample/_analyze

{

"analyzer": "my_icu_analyzer",

"text": "南京长江大桥"

}

```

- 为了创建自定义analyzer, 需要把该分析器挂在某个自建索引下

- 测试时采用 /索引名/_analyze的方法

- 指定analyzer: 自定义分析器名的方式

- 该索引可以不含任何字段

### Q&A

- Q: `failed to find tokenizer under name [icu_tokenizer]`

- A: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

- `sudo bin/elasticsearch-plugin install analysis-icu`

- 每个节点都要装装完还要重启

- https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-cluster.html

- Q: 上述方法不好用

- A: 针对k8s安装的elasticsearch 需要用别的方法

- https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-bundles-plugins.html

- 里面描述了2种方法自定义image 或者使用 initContainers配置段落(k8s描述文件)

- 用第2种

- 真正执行的命令改为

- `bin/elasticsearch-plugin install analysis-icu`

- 里面还提到了添加自定义的synonym(同义词)文件

- 后续也会用到

## token filter

- 这个算是elastic search的核心功能之一

- 有很多内置filter 之前提到的

- Lowercase

- ASCII folding

- 下面找几个我感兴趣的了解一下

- 其它的大家自己查文档吧

### Snowball 和 Stemmer

- 词干化

- 区别

- snowball用的是snowball方法

- stemmer用的是

- https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#algorithmic-stemmers

- 基本上讲snowball效果会好一点

- 当然效果好一般意味着性能差一点?

- 测试

```

GET /_analyze

{

"tokenizer": "standard",

"filter": [ "snowball" ],

"text": "the foxes jumping quickly"

}

```

- filter可以选 snowball 或 stemmer

- 结果差别

- snowball quickly -> quick

- stemmer quickly -> quickli

- 这里又学习一个不用定义任何索引, 直接测试自定义filter

### 同义词 synonym

- 需要提供同义词词典

- 词典文件支持两种文件格式 solr和wordnet

## 自定义同义词替换

- 下载wordnet 3.0格式的同义词词典英语

- https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms

- https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms/raw/master/synonyms.json

- 用k8s configmap上传该词典文件方案不可行

- 参考

- https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/

- `kubectl create configmap synonyms --from-file=synonym`

- 问题文件太大无法上传

- Request entity too large

- 采用initContainers 复制文件过去

- 遇到github网络不通的问题

- 国内gitee不能直接curl 需要登录

- 自己搞个内部nginx挂个pvc吧 TODO 明天继续

- 配置k8s上的elasticsearch使用该词典

## 应用分析器到索引

==========

今天比较不顺遇到同义词上传的问题

不过也因此学到了initContainers的用法

明天继续搞

标签：

学习日志 211229 elasticsearch text analyzer理解

学习日志 211229 elasticsearch text analyzer理解的评论 (共条)

你可能也喜欢这些文章

最新发布的文章

学习日志 211229 elasticsearch text analyzer理解

本文作者的其他文章

学习日志 211229 elasticsearch text analyzer理解的评论 (共 条)

你可能也喜欢这些文章

最新发布的文章

学习日志 211229 elasticsearch text analyzer理解的评论 (共条)