Elasticsearch入门篇-基本概念&中文分词器IK

2020/4/8 17:01:39

编程Tag： elasticsearch

本文主要是介绍Elasticsearch入门篇-基本概念&中文分词器IK，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

学习资料汇总

Overview

开源的 Elasticsearch（以下简称 Elastic）是目前全文搜索引擎的首选。它可以快速地储存、搜索和分析海量数据，维基百科、Stack Overflow、Github 都采用它。

Elastic 的底层是开源库 Lucene。但是，你没法直接用 Lucene，必须自己写代码去调用它的接口。Elastic 是 Lucene 的封装，提供了 REST API 的操作接口，开箱即用。

ES 安装配置

Install ES on MacOS with Homebrew

使用 Homebrew 安装 ElasticSearch

brew tap elastic/tap

brew install elastic/tap/elasticsearch-full
复制代码

Type	Description	Default Location	Setting
home	Elasticsearch home directory or $ES_HOME	/usr/local/var/homebrew/linked/elasticsearch-full
conf	Configuration files including elasticsearch.yml	/usr/local/etc/elasticsearch	ES_PATH_CONF
data		/usr/local/var/lib/elasticsearch	path.data
logs		/usr/local/var/log/elasticsearch	path.logs
plugins		/usr/local/var/homebrew/linked/elasticsearch/plugins

至此，ES 测试环境已经搭建完毕，在进入正式开发前，需要做以下的额外步骤

Learn how to configure Elasticsearch
Configure important Elasticsearch settings
Configure important system settings

下面进行简单的运行，验证配置是否成功

在终端执行 elasticsearch，运行ES。如果遇到 max virtual memory areas vm.maxmapcount [65530] is too low 报错，可以执行下述命令

sudo sysctl -w vm.max_map_count=262144
复制代码

如果一切正常，Elastic 就会在默认的 9200 端口运行。这时，打开另一个命令行窗口，请求该端口，会得到说明信息

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200
{
  "name" : "ZBMAC-b286c5fb6",
  "cluster_name" : "elasticsearch_liubaoshuai1",
  "cluster_uuid" : "v6A5SuX2RI2Clgs30qhq7g",
  "version" : {
    "number" : "7.6.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
    "build_date" : "2020-03-26T06:34:37.794943Z",
    "build_snapshot" : false,
    "lucene_version" : "8.4.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
复制代码

上面代码中，请求 9200 端口，Elastic 返回一个 JSON 对象，包含当前节点、集群、版本等信息
按下 Ctrl + C，Elastic 就会停止运行
默认情况下，Elastic 只允许本机访问，如果需要远程访问，可以修改 Elastic 安装目录的config/elasticsearch.yml 文件，去掉 network.host 的注释，将它的值改成 0.0.0.0，然后重新启动 Elastic。设成 0.0.0.0 表示允许任何人都可以访问。线上服务不要这样设置，要设成具体的 IP

network.host: 0.0.0.0
复制代码

cat 命令可以帮助开发者快速查询 Elasticsearch 的相关信息

使用 _cat 可以查看支持的命令

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200/_cat
=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates
复制代码

每个命令都支持使用 ?v 参数，来显示详细的信息

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200/_cat/master?v
id                     host      ip        node
KqsnRZTeRbeKimt8XfGAoQ 127.0.0.1 127.0.0.1 ZBMAC-b286c5fb6
复制代码

每个命令都支持使用 help 参数，来输出可以显示的列

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200/_cat/master?help
id   |   | node id    
host | h | host name  
ip   |   | ip address 
node | n | node name 
复制代码

通过 h 参数，可以指定输出的字段

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200/_cat/master?v
id                     host      ip        node
KqsnRZTeRbeKimt8XfGAoQ 127.0.0.1 127.0.0.1 ZBMAC-b286c5fb6
ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200/_cat/master?h=host,ip,node
127.0.0.1 127.0.0.1 ZBMAC-b286c5fb6
复制代码

数字类型的格式化：很多的命令都支持返回可读性的大小数字，比如使用 mb 或者 kb 来表示

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
复制代码

ES基本概念

Node 与 Cluster

Elastic 本质上是一个分布式数据库，允许多台服务器协同工作，每台服务器可以运行多个 Elastic 实例。

单个 Elastic 实例称为一个节点（node）。一组节点构成一个集群（cluster）。

Index

Elastic 会索引所有字段，经过处理后写入一个反向索引（Inverted Index）。查找数据的时候，直接查找该索引。

所以，Elastic 数据管理的顶层单位就叫做 Index（索引）。它是单个数据库的同义词。每个 Index （即数据库）的名字必须是小写。

下面的命令可以查看当前节点的所有 Index。

$ curl -X GET 'http://localhost:9200/_cat/indices?v'
复制代码

Document

Index 里面单条的记录称为 Document（文档）。许多条 Document 构成了一个 Index。

Document 使用 JSON 格式表示，下面是一个例子。

{
  "user": "张三",
  "title": "工程师",
  "desc": "数据库管理"
}
复制代码

同一个 Index 里面的 Document，不要求有相同的结构（scheme），但是最好保持相同，这样有利于提高搜索效率。

Type

Document 可以分组，比如 weather 这个 Index 里面，可以按城市分组（北京和上海），也可以按气候分组（晴天和雨天）。这种分组就叫做 Type，它是虚拟的逻辑分组，用来过滤 Document。

不同的 Type 应该有相似的结构（schema），举例来说，id 字段不能在这个组是字符串，在另一个组是数值。这是与关系型数据库的表的一个区别。性质完全不同的数据（比如 products 和 logs）应该存成两个 Index，而不是一个 Index 里面的两个 Type（虽然可以做到）。

下面的命令可以列出每个 Index 所包含的 Type。

$ curl 'localhost:9200/_mapping?pretty=true'
复制代码

根据规划，Elastic 6.x 版只允许每个 Index 包含一个 Type，7.x 版将会彻底移除 Type。

Index 新建/删除

新建 Index，可以直接向 Elastic 服务器发出 PUT 请求。下面的例子是新建一个名叫 weather 的 Index。

$ curl -X PUT 'localhost:9200/weather'
复制代码

服务器返回一个 JSON 对象，里面的 acknowledged 字段表示操作成功。

{
  "acknowledged":true,
  "shards_acknowledged":true
}
复制代码

然后，我们发出 DELETE 请求，删除这个 Index。

$ curl -X DELETE 'localhost:9200/weather'

{
   "acknowledged":true
}
复制代码

数据操作

FAQ

在进行数据操作时，若遇到下述报错，可以参考如上链接解决。

elasticsearch6.x {"error":"Content-Type header [application/x-www-form-urlencoded] is not supported"
复制代码

The Elasticsearch engineering team is busy working on features for Elasticsearch 6.0. One of the changes that is coming in Elasticsearch 6.0 is strict content-type checking.

Starting from Elasticsearch 6.0, all REST requests that include a body must also provide the correct content-type for that body. --- elasticsearch6.x {"error":"Content-Type header [application/x-www-form-urlencoded] is not supported" - 官方Fix

在 ES 6.0之后，采用了严格 content-type 校验，需要添加 -H'Content-Type: application/json' 参数在命令行中。比如

curl -X PUT 'localhost:9200/accounts/person/1' -d '
{
  //...
}' 
复制代码

需要更新为

curl -H'Content-Type: application/json' -X PUT 'localhost:9200/accounts/person/1' -d '
{
  //...
}' 
复制代码

新增记录

向指定的 /Index/Type 发送 PUT 请求，就可以在 Index 里面新增一条记录。比如，向 /accounts/person 发送请求，就可以新增一条人员记录。

ZBMAC-b286c5fb6:~ liubaoshuai1$ curl -H "Content-Type: application/json" -X PUT 'localhost:9200/accounts/person/1' -d '
> {
>   "user": "张三",
>   "title": "工程师",
>   "desc": "数据库管理"
> }'
复制代码

服务器返回的 JSON 对象，会给出 Index、Type、Id、Version 等信息。

{           
    "_index":"accounts",
    "_type":"person",
    "_id":"1",
    "_version":1,
    "result":"created",
    "_shards":{"total":2,"successful":1,"failed":0},
    "_seq_no":0,
    "_primary_term":1
 }
复制代码

仔细看，会发现请求路径是 /accounts/person/1，最后的 1 是该条记录的 Id。它不一定是数字，任意字符串（比如abc）都可以。

新增记录的时候，也可以不指定 Id，这时要改成 POST 请求。

$ curl -H "Content-Type: application/json" -X POST 'localhost:9200/accounts/person' -d '
{
  "user": "李四",
  "title": "工程师",
  "desc": "系统管理"
}'
复制代码

这个时候，服务器返回的 JSON 对象里面，_id 字段就是一个随机字符串。

{
  "_index":"accounts",
  "_type":"person",
  "_id":"A9QxNXEBLRNcEQlvnE1_",
  "_version":1,
  "result":"created",
  "_shards":{"total":2,"successful":1,"failed":0},
  "_seq_no":2,
  "_primary_term":2
}
复制代码

查看记录

向 /Index/Type/Id 发出 GET 请求，就可以查看这条记录。

URL 的参数 pretty=true 表示以易读的格式返回。

返回的数据中，found 字段表示查询成功，_source 字段返回原始记录。

$ curl 'localhost:9200/accounts/person/1?pretty=true'

{
  "_index" : "accounts",
  "_type" : "person",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 1,
  "_primary_term" : 2,
  "found" : true,
  "_source" : {
    "user" : "张三",
    "title" : "工程师",
    "desc" : "数据库管理"
  }
}
复制代码

删除记录

删除记录就是发出 DELETE 请求。

$ curl -X DELETE 'localhost:9200/accounts/person/1'
复制代码

更新记录

更新记录就是使用 PUT 请求，重新发送一次数据。

$ curl -H "Content-Type: application/json" -X PUT 'localhost:9200/accounts/person/1' -d '
{
    "user" : "张三",
    "title" : "工程师",
    "desc" : "数据库管理，软件开发"
}' 
复制代码

可以发现，返回的JSON对象中，记录的 Id 没变，但是版本（version）变化了，操作类型（result）从 created 变成 updated，created 字段变成 false，因为这次不是新建记录。

{
  "_index":"accounts",
  "_type":"person",
  "_id":"1",
  "_version":3,
  "result":"updated",
  "_shards":{"total":2,"successful":1,"failed":0},
  "_seq_no":3,
  "_primary_term":2
}
复制代码

额外补充的是，若要对返回的 JSON 对象格式化，可以执行下述步骤

NPM 全局安装json：npm install -g json
在上述命令参数后面加上 | json
若不想显示 curl 的统计信息，可以添加 -s 参数（参考-Curl不显示统计信息% Total % Received % | 简书）

$ curl -H "Content-Type: application/json" -X PUT 'localhost:9200/accounts/person/1' -d '
{
    "user" : "张三",
    "title" : "工程师",
    "desc" : "数据库管理，软件开发"
}'  |json
复制代码

数据查询

返回所有记录

使用 GET 方法，直接请求 /Index/Type/_search，就会返回所有记录。

$ curl 'localhost:9200/accounts/person/_search'

{
    "took": 629,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1,
        "hits": [
            {
                "_index": "accounts",
                "_type": "person",
                "_id": "A9QxNXEBLRNcEQlvnE1_",
                "_score": 1,
                "_source": {
                    "user": "李四",
                    "title": "工程师",
                    "desc": "系统管理"
                }
            },
            {
                "_index": "accounts",
                "_type": "person",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "user": "张三",
                    "title": "工程师",
                    "desc": "数据库管理，软件开发"
                }
            }
        ]
    }
}
复制代码

上面代码中，返回结果的 took 字段表示该操作的耗时（单位为毫秒），timed_out 字段表示是否超时，hits 字段表示命中的记录，里面子字段的含义如下

total：返回记录数
max_score：最高的匹配程度
hits：返回的记录组成的数组

全文搜索

Elastic 的查询非常特别，使用自己的查询语法，要求 GET 请求带有数据体。

$ curl -H "Content-Type: application/json" 'localhost:9200/accounts/person/_search'  -d '
{
  "query" : { "match" : { "desc" : "软件" }}
}'
复制代码

上面代码使用 Match 查询，指定的匹配条件是 desc 字段里面包含"软件"这个词。返回结果如下。

{
    "took": 33,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.1978253,
        "hits": [
            {
                "_index": "accounts",
                "_type": "person",
                "_id": "1",
                "_score": 1.1978253,
                "_source": {
                    "user": "张三",
                    "title": "工程师",
                    "desc": "数据库管理，软件开发"
                }
            }
        ]
    }
}
复制代码

Elastic 默认一次返回10条结果，可以通过 size 字段改变这个设置。还可以通过 from 字段指定位移。

$ curl -H "Content-Type: application/json" 'localhost:9200/accounts/person/_search'  -d '
{
  "query" : { "match" : { "desc" : "管理" }},
  "size": 20   //定义size
  “from": 1,   // 从位置1开始（默认是从位置0开始）
}'
复制代码

逻辑运算

如果有多个搜索关键字， Elastic 认为它们是或 (or) 关系。

$ curl -H "Content-Type: application/json"  'localhost:9200/accounts/person/_search'  -d '
{
  "query" : { "match" : { "desc" : "软件 系统" }}
}'
复制代码

上面代码搜索的是 软件 or 系统。

如果要执行多个关键词的 and 搜索，必须使用布尔查询。

$ curl -H "Content-Type: application/json" 'localhost:9200/accounts/person/_search'  -d '
{
  "query": {
    "bool": {
      "must": [
        { "match": { "desc": "软件" } },
        { "match": { "desc": "系统" } }
      ]
    }
  }
}'
复制代码

中文分词设置

ElasticSearch中文分词 | 简书

分词基本概念

当一个文档被存储时，ES 会使用分词器从文档中提取出若干词元（token）来支持索引的存储和搜索。ES 内置了很多分词器，但内置的分词器对中文的处理不好。下面通过例子来看内置分词器的处理。在 web 客户端发起如下的一个 REST 请求，对英文语句进行分词

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{  
    "text": "hello world"  
}' | json

// |json 表示对返回结果进行json化展示
复制代码

返回结果如下

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   213  100   179  100    34  44750   8500 --:--:-- --:--:-- --:--:-- 53250
{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
复制代码

上面结果显示 "hello world" 语句被分为两个单词，因为英文天生以空格分隔，自然就以空格来分词，这没有任何问题。

下面我们看一个中文的语句例子，请求 REST 如下

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{  
    "text": "我爱编程"  
}' | json

复制代码

操作成功后，响应的内容如下

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   383  100   348  100    35  69600   7000 --:--:-- --:--:-- --:--:-- 76600
{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "爱",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "编",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "程",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    }
  ]
}
复制代码

从结果可以看出，这种分词把每个汉字都独立分开来了，这对中文分词就没有意义了，所以 ES 默认的分词器对中文处理是有问题的。好在有很多不错的第三方的中文分词器，可以很好地和 ES 结合起来使用。在 ES 中，每种分词器（包括内置的、第三方的）都会有个名称。上面默认的操作，其实用的分词器的名称是 standard。下面的请求与前面介绍的请求是等价的，如：

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{  
    "analyzer": "standard",
    "text": "我爱编程"  
}' | json

复制代码

当我们换一个分词器处理分词时，只需将 "analyzer" 字段设置相应的分词器名称即可。

ES通过安装插件的方式来支持第三方分词器，对于第三方的中文分词器，比较常用的是中科院 ICTCLAS 的 smartcn 和 IKAnanlyzer 分词器。

下面，对 IKAnanlyzer 分词器（下面简称为 ik）进行介绍和使用。

ik分词器的安装

ES提供了一个脚本 elasticsearch-plugin（ windows 下为 elasticsearch-plugin.bat）来安装插件，脚本位于 ES 安装目录的 bin 目录下。

elasticsearch-plugin 脚本可以有3种命令，通过参数区分

// install
// 安装指定的插件到当前ES节点中。
elasticsearch-plugin install 插件地址

// list
// 显示当前ES节点已经安装的插件列表
elasticsearch-plugin list


// remove
// 删除已安装的插件
elasticsearch-plugin remove 插件名称
复制代码

使用 elasticsearch-plugin install 安装插件时，插件地址既可以是一个远程文件地址（在线安装），也可以是下载到本地的文件。不管是远程文件或本地文件，对于ik插件来说都是一个 zip 文件。

注意，ik 的版本要与 ES 的版本一致。此处，安装的 ES 版本为 7.6.2，所以 ik 也安装 7.6.2 版本。

远程文件安装命令如下

elasticsearch-plugin  install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.6.2/elasticsearch-analysis-ik-7.6.2.zip
复制代码

需要注意的是，在Linux下安装本地路径的 .zip 包时，文件路径前面需要添加 file:/// 前缀，比如

elasticsearch-plugin  install 
file:///home/hadoop/elasticsearch-analysis-ik-7.6.2.zip
复制代码

安装完毕后

发现在 ES 的安装目录下的 plugins 目录下多了一个 analysis-ik 目录（内容是 ik 的 zip 包解压后根目录下的所有文件，一共是 5 个 jar 文件和 1 个 properties 配置文件）
另外 ES 的安装目录下的 config 目录下多了一个 analysis-ik 目录（内容是 ik 的 zip 包解压后根目录下的 config 目录下所有文件，用于放置 ik 的自定义词库）

参考网上资料，会介绍在安装 ik 插件后，需要在 ES 的配置文件 elasticsearch.yml 中加上如下一行内容

index.analysis.analyzer.ik.type: "ik"
复制代码

可是，实际情况是加上这句话后，ES 会启动失败。

这是因为，在 ES 新版本中，index.analysis.analyzer.ik.type 已经不需要了，添加了启动时反而会报错。ES 5.X 版本不再通过 elasitcsearch.yml 配置设置分词规则，而是在创建索引时指定。

需要注意的是，ik 安装完成后，需要重启 ES，才可以进行后续的 ik 使用。

下面再介绍一个坑点，参考网上给出的使用 ik 的例子，请求命令如下

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{  
    "analyzer": "ik",
    "text": "我爱编程"  
}' | json
复制代码

执行上述命令，总会返回如下错误信息

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   264  100   207  100    57  25875   7125 --:--:-- --:--:-- --:--:-- 33000
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "failed to find global analyzer [ik]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to find global analyzer [ik]"
  },
  "status": 400
}
复制代码

错误信息还是很明确的，找不到分词器 ik。

这是因为，在新版本 ES 中，分词器的名称变了，不再是 ik 了。新版本的 ik 提供了两个分词器，分别是 ik_max_word 和 ik_smart，用任何一个替换 ik，就没问题了。

ik中文分词器的使用

下面进行 ik_max_word 和 ik_smart 分词效果的对比。

可以发现，对中文“世界如此之大”进行分词，ik_max_word 比 ik_smart 得到的中文词更多，但这样也带来一个问题，使用 ik_max_word 会占用更多的存储空间。

ik_max_word

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{
    "analyzer": "ik_max_word",
    "text": "世界如此之大"
}' | json

//echo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   407  100   339  100    68  12107   2428 --:--:-- --:--:-- --:--:-- 14535
{
  "tokens": [
    {
      "token": "世界",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "如此之",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "如此",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "之大",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}
复制代码

ik_smart

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{
    "analyzer": "ik_smart",
    "text": "世界如此之大"
}' | json


//echo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   324  100   255  100    69  51000  13800 --:--:-- --:--:-- --:--:-- 64800
{
  "tokens": [
    {
      "token": "世界",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "如此",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "之大",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}
复制代码

ik的自定义词典

有时，可能 ik 自身提供的分词词典无法满足特定的一些需求（如专用名词等），ik 提供了自定义词典的功能，也就是用户可以自己定义一些词汇，这样ik就会把它们当作词典中的内容来处理。

举个例子，对于上面例子中的 “世界如此之大” 这个中文语句，ik 词库中不会有 “界如此” 这样一个单词，假设 “界如此” 就是一个专用名词，我们希望 ik 能识别出来。这时就可自定义ik的词典。具体方法是如下

步骤1：新建扩展名为 dic 的文本文件，文件中写入想增加的词条，每个词条单独一行，如文件名是 test.dic，文件内容如下

界如此
高潜
复制代码

上面例子中有两个自定义词条。

步骤2：将上面的 dic 文件保存到 ES 安装目录（/usr/local/var/homebrew/linked/elasticsearch-full）的 config 目录下的 analysis-ik 目录下，可以建立子目录，放在子目录下。比如文件的路径为 /usr/local/etc/elasticsearch/analysis-ik/mydict/test.dic
步骤3：修改 ik 的配置文件 IKAnalyzer.cfg.xml（位于 config/analysis-ik 目录下），在配置文件中增加如下条目

<!--用户可以在这里配置自己的扩展字典 -->
<!-- <entry key="ext_dict"></entry> -->
<entry key="ext_dict">mydict/test.dic</entry>
复制代码

这样就将自定义的字典文件加到ik的字典中了。

步骤4：重启ES，使配置生效。并发起请求查看结果

curl -H "Content-Type: application/json" -PUT 'localhost:9200/_analyze' -d '
{
    "analyzer": "ik_max_word",
    "text": "世界如此之大"
}' | json


//echo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   491  100   423  100    68  70500  11333 --:--:-- --:--:-- --:--:-- 81833
{
  "tokens": [
    {
      "token": "世界",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "界如此",
      "start_offset": 1,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "如此之",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "如此",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "之大",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}

复制代码

可以看出，自定义的 “界如此” 词条被分词出来了。不过如果我们将 analyzer 改为 ik_smart 却发现 “界如此” 词条没能被识别出来。

文档的中文分词使用

elasticsearch-analysis-ik 使用文档 | github

前面的介绍只是简单举例介绍了 ik 的使用，下面我们来通过一个更完整的例子介绍分词。

ES的分词在创建索引（index）后，可以通过 REST 命令来设置，这样后续插入到该索引的数据都会被相应的分词器进行处理。

为了比较 ik 的 ik_smart 和 ik_max_word 这两个分词器及默认的分词器 standard，我们创建 3 个索引来分别使用这 3 个分词器。

创建索引

curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/index_ik_s' |json
curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/index_ik_m' |json
curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/index_stan' |json
复制代码

设置分词器

curl -H "Content-Type: application/json" -PUT 'localhost:9200/index_ik_s/_mapping' -d '
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_smart",
                "search_analyzer": "ik_smart"
            }
        }
}' | json

//echo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   220  100    21  100   199    132   1251 --:--:-- --:--:-- --:--:--  1383
{
  "acknowledged": true
}
复制代码

curl -H "Content-Type: application/json" -PUT 'localhost:9200/index_ik_m/_mapping' -d '
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
            }
        }
}' | json
复制代码

curl -H "Content-Type: application/json" -PUT 'localhost:9200/index_stan/_mapping' -d '
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "standard",
                "search_analyzer": "standard"
            }
        }
}' | json
复制代码

插入数据

为了批量插入，我们使用了 linux 的 curl 命令来执行 REST 操作。

curl -XPOST http://localhost:9200/index_ik_s/_create/1 -H 'Content-Type:application/json' -d'
{"content":"美国留给伊拉克的是个烂摊子吗"}' |json

curl -XPOST http://localhost:9200/index_ik_s/_create/2 -H 'Content-Type:application/json' -d'
{"content":"公安部：各地校车将享最高路权"}' |json

curl -XPOST http://localhost:9200/index_ik_s/_create/3 -H 'Content-Type:application/json' -d'
{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}' |json

curl -XPOST http://localhost:9200/index_ik_s/_create/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}' |json
复制代码

curl -XPOST http://localhost:9200/index_ik_m/_create/1 -H 'Content-Type:application/json' -d'
{"content":"美国留给伊拉克的是个烂摊子吗"}' |json

curl -XPOST http://localhost:9200/index_ik_m/_create/2 -H 'Content-Type:application/json' -d'
{"content":"公安部：各地校车将享最高路权"}' |json

curl -XPOST http://localhost:9200/index_ik_m/_create/3 -H 'Content-Type:application/json' -d'
{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}' |json

curl -XPOST http://localhost:9200/index_ik_m/_create/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}' |json
复制代码

curl -XPOST http://localhost:9200/index_stan/_create/1 -H 'Content-Type:application/json' -d'
{"content":"美国留给伊拉克的是个烂摊子吗"}' |json

curl -XPOST http://localhost:9200/index_stan/_create/2 -H 'Content-Type:application/json' -d'
{"content":"公安部：各地校车将享最高路权"}' |json

curl -XPOST http://localhost:9200/index_stan/_create/3 -H 'Content-Type:application/json' -d'
{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}' |json

curl -XPOST http://localhost:9200/index_stan/_create/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}' |json
复制代码

简单查询示例

下面给出一个简单的查询示例。

curl -XPOST http://localhost:9200/index_ik_s/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}' |json

//echo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   928  100   697  100   231   1671    553 --:--:-- --:--:-- --:--:--  2225
{
  "took": 413,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.66590554,
    "hits": [
      {
        "_index": "index_ik_s",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.66590554,
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "index_ik_s",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.61737806,
        "_source": {
          "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight": {
          "content": [
            "中韩渔警冲突调查：韩警平均每天扣1艘<tag1>中国</tag1>渔船"
          ]
        }
      }
    ]
  }
}
复制代码

测试验证和对比

先测下它们对中文标准单词的支持，查询 “中国”，3种索引效果都一样的，都能胜任。请求命令如下

curl -XPOST http://localhost:9200/index_ik_s/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}' |json

curl -XPOST http://localhost:9200/index_ik_m/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}' |json

curl -XPOST http://localhost:9200/index_stan/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}' |json
复制代码

下面进行极端的非标准搜索，搜索关键字“均每”，可以发现， ik_smart 和 ik_max_word 这两个分词器均无法检索出结果，默认的分词器 standard 可以检出到内容。这是因为默认的分词器 standard 在搜索时，会将每个字都拆成一个检索条件。

curl -XPOST http://localhost:9200/index_stan/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "均每" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}' |json

//echo

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   672  100   442  100   230  22100  11500 --:--:-- --:--:-- --:--:-- 33600
{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.2112894,
    "hits": [
      {
        "_index": "index_stan",
        "_type": "_doc",
        "_id": "3",
        "_score": 2.2112894,
        "_source": {
          "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight": {
          "content": [
            "中韩渔警冲突调查：韩警平<tag1>均</tag1><tag1>每</tag1>天扣1艘中国渔船"
          ]
        }
      }
    ]
  }
}
复制代码

最后给出一个经验性的结论

ik_smart 既能满足英文的要求，又更智能更轻量，占用存储最小，所以首推 ik_smart
standard 对英语支持是最好的，但是对中文是简单暴力每个字建一个反向索引，浪费存储空间而且效果很c差
ik_max_word 比 ik_smart 对中文的支持更全面，但是存储上的开销实在太大，不建议使用

这篇关于Elasticsearch入门篇-基本概念&中文分词器IK的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！