互联网网站离不开搜索，本文就来介绍一个热门的全文检索技术：Elasticsearch 。

本 wiki 基于 elasticsearch 6.8.2 。

认识 Elasticsearch

为什么选择 Elasticsearch ?

Mysql & Elasticsearch

较 mysql 等传统的关系型数据库，es 等搜索引擎的优势来自 全文检索 , 即模糊查询。

mysql 使用 B+索引 ，查询和插入较平衡。 es 使用 倒排索引 ，更倾向查询。但两者复杂度不会有太大差异，差异主要来自设计思路。

`mysql` & `elasticsearch`	全文检索过程
`mysql`	只能对 `field` 添加索引，且对超长会截取，不会分词。故 `keyword` 查询走索引，模糊查询全表扫描。
`elasticsearch`	会建立分词后的索引，模糊查询走索引。

Lucene & Solr & Elasticsearch

lucene , solr , elasticsearch 都是使用的 倒排索引 。

主流搜索引擎	说明	优劣
`Lucene`	Apache的开源软件项目，完全用 Java 编写。	Lucene 只是一个框架，要充分利用它的功能，需要使用Java，并且在程序中集成 Lucene 。API 强大，使用复杂。
`Solr`	Apache的开源软件项目，基于 Lucene。	成熟的产品。
`ElasticSearch`	Apache的开源软件项目，基于 Lucene库构建的 Restful 搜索引擎。	更轻便易用，增长迅速。

Elasticsearch

es 的 API 完全遵循 Rest原则 ，可以直接通过 http访问。

es 本质也是一种数据库，很多概念和关系型数据库类似。

type 在 7.x 版本会被删除，不要在索引下建立多个类型。

1 2	Relational DB -> Databases -> Tables -> Rows -> Columns Elasticsearch -> Indices -> Types -> Documents -> Fields

访问： http://localhost:9200/

Kibana

kibana 是基于 Node.js 的 elasticsearch 索引库数据统计工具，还提供了索引的控制台 dev tools 。

需要另行安装，访问： http://localhost:5601/

IK分词器

ES 很大的优势来自分词能力，但其原生并不支持中文。需要另行安装中文分词插件 : ik-analyzer 。

使用 Elasticsearch

http示例基于 kibana 的 dev tools 。

索引

`settings`	说明
`number_of_shards`	分片数
`number_of_replicas`	每个分片的副本数

PUT /demo
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0 
  }
}

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "demo"
}

类型

使用 _mapping 可以定义 type 的数据结构，及对应 field 属性。

字段属性	说明
`type`	类型，例如：`keyword` , `text` , `float` 等。`keyword` 是词条，不支持分词； `text` 支持分词。
`index`	是否创建索引，默认为 `true` 。
`store`	是否二次存储。
`analyzer`	分词器。

PUT demo/_mapping/item
{
  "properties": {
    "title": {
      "type": "keyword"
    },
    "subTitle": {
      "type": "text",
      "index": true,
      "store": false,
      "analyzer": "ik_max_word"
    }
  }
}

#! Deprecation: [types removal] Specifying types in put mapping requests is deprecated. To be compatible with 7.0, the mapping definition should not be nested under the type name, and the parameter include_type_name must be provided and set to false.
{
  "acknowledged" : true
}

查看映射配置

1	GET /demo/_mapping

#! Deprecation: [types removal] The parameter include_type_name should be explicitly specified in get mapping requests to prepare for 7.0. In 7.0 include_type_name will default to 'false', which means responses will omit the type name in mapping definitions.
{
  "demo" : {
    "mappings" : {
      "item" : {
        "properties" : {
          "subTitle" : {
            "type" : "text",
            "analyzer" : "ik_max_word"
          },
          "title" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}

文档

es 面向 document ， document 是 json 格式的数据。

新建文档时，若存在之前定义没有的字段，会自动创建对应字段。
文档默认自动生成 id，也可以使用 POST /<index>/<type>/[id] 来指定文档的id。

POST /demo/item
{
  "title": "Apple",
  "subTitle": "Apple 苹果手机 iphone11",
  "price": 4999
}

{
  "_index" : "demo",
  "_type" : "item",
  "_id" : "FIST3W8BJuaok6yjTyCs",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

基础查询

全量查询

GET /demo/item/_search
{
  "query": {
    "match_all": {}
  }
}

匹配查询

涉及	说明
`operator`	操作符， `or` 表示分词 `或匹配` ； `and` 表示分词 `且匹配` 。
`minimum_should_match`	最小匹配度，可以使用 `具体数字` 或 `百分比` 。

GET /demo/item/_search
{
  "query": {
    "match": {
      "subTitle": {
        "query": "苹果11",
        "operator": "or", 
        "minimum_should_match": "30%"
      }
    }
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "demo",
        "_type" : "item",
        "_id" : "FIST3W8BJuaok6yjTyCs",
        "_score" : 1.0,
        "_source" : {
          "title" : "Apple",
          "subTitle" : "Apple 苹果手机 iphone11",
          "price": 4999
        }
      }
    ]
  }
}

多字段匹配

GET /demo/item/_search
{
  "query": {
    "multi_match": {
      "query": "Apple",
      "fields": ["title", "subTitle"]
    }
  }
}

词条匹配

词条匹配是以最小分词直接匹配，一般用于不可分割词条。

单词条

GET /demo/item/_search
{
  "query": {
    "term": {
      "subTitle": "手机"
    }
  }
}

多词条

GET /demo/item/_search
{
  "query": {
    "terms": {
      "subTitle": ["苹果", "iphone"]
    }
  }
}

字段过滤

使用 _source 可以选择关注的字段。

`_source`	说明
`""`	`[]`	关注的字段
`includes`	包含的字段
`excludes`	排除的字段

GET /demo/item/_search
{
  "query": {
    "terms": {
      "subTitle": ["苹果", "iphone"]
    }
  },
  "_source": "subTitle"
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "demo",
        "_type" : "item",
        "_id" : "FIST3W8BJuaok6yjTyCs",
        "_score" : 1.0,
        "_source" : {
          "subTitle" : "Apple 苹果手机 iphone11"
        }
      }
    ]
  }
}

高级查询

布尔查询

`bool`	说明
`must`	与
`should`	或
`must_not`	非

GET /demo/item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Apple"
          }
        }
      ]
    }
  }
}

模糊查询

可以看出，之前介绍的 match 多是 分词匹配 ，并不是像 like 一样完全的模糊查询。模糊查询可以使用 fuzzy 。

GET /demo/item/_search
{
  "query": {
    "fuzzy": {
      "subTitle": {
        "value": "iphon",
        "fuzziness": 1
      }
    }
  }
}

范围查询

对于数值型多使用范围匹配，支持 gte | get | lte | lt 。

GET /demo/item/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 4000,
        "lte": 6000
      }
    }
  }
}

结果集过滤

filter 可以对结果集过滤，类似于嵌套子查询。

GET /demo/item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "subTitle": "手机"
          }
        }
      ],
      "filter": {
        "range": {
          "price": {
            "gte": 4000,
            "lte": 6000
          }
        }
      }
    }
  }
}

排序

sort 可以对结果集排序。

GET /demo/item/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

聚合

除了单纯的查询，还可以使用 聚合 完成统计和分析。

聚合类型	说明	常见示例
`桶（bucket）`	按照某个维度对数据进行分组，每一组数据在 `es` 中称为一个`桶` 。	词条分桶 \	阶级分桶 \	范围分桶等
`度量（metric）`	分组完成以后，可以对组中的数据进行聚合运算，例如求平均值、最大、最小、求和等，这些在 `es` 中称为`度量` 。	平均值 \	最大值 \	最小值 \	求和 \	计数等。

词条分桶

词条分桶只能针对 keyword ，不可对 text 使用。
size : 查询条数，htis 内容。

GET /demo/item/_search
{
  "size": 0, 
  "aggs": {
    "agg_title": {
      "terms": {
        "field": "title"
      }
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "agg_title" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Apple",
          "doc_count" : 1
        },
        {
          "key" : "Sony",
          "doc_count" : 1
        }
      ]
    }
  }
}

阶级分桶

对于数值型的字段，常使用阶级分组。

`histogram`	说明
`field`	字段。
`interval`	间隔。
`min_doc_count`	最小文档计数，只有大于等于这个数才会显示对应桶。

GET /demo/item/_search
{
  "size": 0,
  "aggs": {
    "price_histogram": {
      "histogram": {
        "field": "price",
        "interval": 500,
        "min_doc_count": 1
      }
    }
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "price_histogram" : {
      "buckets" : [
        {
          "key" : 4000.0,
          "doc_count" : 1
        },
        {
          "key" : 4500.0,
          "doc_count" : 1
        }
      ]
    }
  }
}

范围分桶

范围分桶与阶梯分桶类似，也是把数字按照阶段进行分组，但是需要指定起始和结束大小。

`range`	说明
`field`	指定字段。
`ranges`	范围，`from` 起始值， `to` 结束值。

GET /demo/item/_search
{
  "size": 0,
  "aggs": {
    "price_histogram": {
      "range": {
        "field": "price",
        "ranges": [
          {
            "from": 3000,
            "to": 5000
          }
        ]
      }
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "price_histogram" : {
      "buckets" : [
        {
          "key" : "3000.0-5000.0",
          "from" : 3000.0,
          "to" : 5000.0,
          "doc_count" : 2
        }
      ]
    }
  }
}

附录

[Mac] 安装 elasticsearch

仅介绍基于 hombrew 的安装方法。

elasticsearch

# 安装
$ brew install elasticsearch

# 启动
$ brew services start elasticsearch

# 验证
http://localhost:9200

# 配置
vi /usr/local/etc/elasticsearch/elasticsearch.yml

kibana

# 安装
brew install kibana

# 启动
brew services start kibana

# 验证
http://localhost:5601

ik 分词器

根据 es 版本下载对应安装包

下载： ik-analyer

$ brew info elasticsearch
# 解压并复制到： /usr/local/var/elasticsearch/plugins/

# 重启
brew services restart elasticsearch