Я пытаюсь использовать Elastic Search
(версия 6.8), чтобы найти наиболее похожие теги из текста, и я ожидаю получить сумму баллов похожих тегов вместо значения по умолчанию elasti c Расчет поиска (формула).
Например, я создаю my_test_index и вставляю три документа:
POST my_test_index/_doc/17
{
"id": 17,
"tags": ["devops", "server", "hardware"]
}
POST my_test_index/_doc/20
{
"id": 20,
"tags": ["software", "application", "developer", "develop"]
}
POST my_test_index/_doc/21
{
"id": 21,
"tags": ["electronic", "electric"]
}
Нет сопоставления, по умолчанию, как показано ниже:
{
"my_test_index" : {
"aliases" : { },
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1585820383702",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "05SgLog6S-GTSShTatrvQw",
"version" : {
"created" : "6080199"
},
"provided_name" : "my_test_index"
}
}
}
}
Итак, я запрашиваю запрос ниже:
GET my_test_index/_search
{
"query": {
"more_like_this": {
"fields": [
"tags"
],
"like": [
"i like electric devices and develop some softwares."
],
"min_term_freq": 1,
"min_doc_freq": 1
}
}
}
И получите этот ответ:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.2876821,
"_source" : {
"id" : 21,
"tags" : [
"electronic",
"electric"
]
}
},
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.2876821,
"_source" : {
"id" : 20,
"tags" : [
"software",
"application",
"developer",
"develop"
]
}
}
]
}
}
Если я установлю объяснение: true, результат будет:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_shard" : "[my_test_index][1]",
"_node" : "maQL1REnQHaff51ekrqMxA",
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.2876821,
"_source" : {
"id" : 21,
"tags" : [
"electronic",
"electric"
]
},
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(tags:electric in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
},
{
"_shard" : "[my_test_index][2]",
"_node" : "maQL1REnQHaff51ekrqMxA",
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.2876821,
"_source" : {
"id" : 20,
"tags" : [
"software",
"application",
"developer",
"develop"
]
},
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(tags:develop in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
}
]
}
}
Но это не подходящий для меня результат, я хочу чтобы вычислить сумму баллов, подобных тегам, подобным приведенным ниже: у меня есть слово " electri c" в тексте и тегах и равное тегу " electri c", он получает 1,0 балла и сходство с тегом " electric " дает ~ 0,7 балла. И слово " develop " в тексте и тегах, равное тегу " develop ", оно получает 1,0 балла, аналогично тегу " developer ", получает ~ 0,8 балла и сходство с « softwares », он получает ~ 0,9 балла и так далее ...
Итак, я ожидаю, что этот результат ==> сумма баллов _id: 20 is = ~ 2.7, _id: 21 = ~ 1.7 и ....
Я надеялся, что кто-нибудь может привести пример того, как это сделать, или, по крайней мере, указать мне правильное направление.
Спасибо.