自定义分析器
分析器analyzer
主要有以下几个字段
type
指定内置的分析器,custom可以忽略tokenizer
char_filter
filter
- position_increment_gap 目前还没用到,不知道
各个作用
analyzer/analysis (分析器)
文档document 如何转化为 倒排索引,一段text变成一个个term,这个过程叫做文本分析.
Tokenizer(分词器)
语干中提取词元(Token)
Filter(过滤器)
Character Filter(字符过滤器)
Token词元会进一步处理, 如转小写等操作,被处理后得到Term(词)
1个分析器,由1个tokenizer,0/more个filter,0/more个char_filter组成。
示例
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
# 自定义分析器,指定tokenizer、char_filter、filter
"my_custom_analyzer": {
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
# 定义3大件
# 分词器
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
# 字符过滤器/映射器
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
# 过滤器
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
参考
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html