Elasticsearch中文分词+拼音分词无法生成一些相关的词汇拼音全拼
中文分词+拼音分词无法生成一些相关的词汇拼音全拼
es版本:7.6.2 语言:php
这是自定义分词:
"analysis"=>[
"analyzer"=>[
"my_analyzer"=>[
"type"=>"custom",
"tokenizer"=>"ik_max_word",
"filter"=>"my_filter"
]
],
"filter"=>[
"my_filter"=>[
"type"=>"pinyin",
"first_letter" => "prefix",
"padding_char" => "",
"keep_separate_first_letter"=>false,
"keep_full_pinyin"=>true,
"keep_original"=>true,
"limit_first_letter_length"=>16,
"lowercase"=>true,
'keep_first_letter'=>true,
"remove_duplicated_term"=>true,
"keep_joined_full_pinyin"=>true #这里设置为true
]
]
],
搜索词为”腾讯“它会返回:
{
"tokens": [
{
"token": "teng",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "xun",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "腾讯",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "tengxun",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "tx",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
}
]
}
搜索”微信“会返回:
{
"tokens": [
{
"token": "wei",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "微",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "w",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "xin",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "信",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "x",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}
]
}
就是不生成微信的全拼,导致输入”“搜索不到,中间加个空格就可以了”wei xin“,不止微信还有钉钉,微博等词。
我尝试修改的源码但未生效
不知道哪里的问题导致一些词无法生成全拼