社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  Elasticsearch

999 - Elasticsearch Analysis 02 - Analyzer

歌哥 • 5 年前 • 385 次点击  
阅读 3

999 - Elasticsearch Analysis 02 - Analyzer

Standard Analyzer

  • 默认的analyzer,适合大多数语言。
  • 根据Unicode Text Segmentation算法的定义,将文本切分成词元。
  • 删除多数标点符号、将词元小写,支持删除停止词。

standard analyzer由以下构成:

  • Standard Tokenizer
    • Standard Tokenizer
  • Token Filters
    • Standard Token Filter
    • Lower Case Token Filter
    • Stop Token Filter(默认禁用)

Standard Analyzer 示例

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Standard Analyzer 配置

参数 说明
max_token_length 提取单词时,允许的单词长度。默认255。
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_array_analyzer": {
          "type": "standard",
          "stopwords": ["the","2","quick","brown","foxes","jumped","over","dog's","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_array_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[lazy]

Simple Analyzer

  • simple analyzer遇到非字母就会切分,并且会小写。

simple analyzer由以下构成:

  • Tokenizer
    • Lower Case Tokenizer

Simple Analyzer 示例

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Whitespace Analyzer

  • 遇到空格符就切分

whitespace analyzer由以下构成:

  • Tokenizer
    • Whitespace Tokenizer

Whitespace Analyzer 示例

POST _analyze
{
  "analyzer": "whitespace"
  , "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

Stop Analyzer

  • 类似simple analyzer,但是支持删除停止词。默认使用_english_停止词。

stop analyzer由以下构成:

  • Tokenizer
    • Lower Case Tokenizer
  • Token filters
    • Stop Token Filter

Stop Analyzer 示例

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

Stop Analyzer 配置

参数 说明
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_english_
stopwords_path 包含停止符的文件的路径,路径相对于Elasticsearch的config目录。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer":{
          "type": "stop",
          "stopwords":  ["the","2","quick","brown","foxes","jumped","over","dog","s","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_stop_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ lazy ]

Keyword Analyzer

  • 不切分作为一整个词元输出。

keyword analyzer由以下构成:

  • Tokenizer
    • Keyword Tokenizer

Keyword Analyzer 示例

POST _analyze
{
  "analyzer": "keyword", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生 [ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

Pattern Analyzer

  • 按照正则表示式去切分,默认为\W+

pattern analyzer由以下构成:

  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lower Case Token Filter
    • Stop Token Filter (默认禁用)

Pattern Analyzer 示例

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
复制代码

产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Pattern Analyzer 配置

参数 说明
pattern 使用Java正则表达式。默认\W+
flags Java正则表达式flags,多个用|分离,例如"CASE_INSENSITIVE | COMMENTS"。
lowercase 是否小写。默认true
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_analyzer": {
          "type": "pattern",
          "pattern": "\\W|_",
          "lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_pattern_analyzer", 
  "text": "John_Smith@foo-bar.com"
}
复制代码

产生[ john, smith, foo, bar, com ]

Fingerprint Analyzer

  • 小写,规范化删掉扩展符,排序,去重。
  • 也可以配置停止符。 fingerprint tokenizer 由以下构成:
  • Tokenizer
    • Standard Tokenizer
  • Token Filters(依次如下)
    • Lower Case Token Filter
    • ASCII Folding Token Filter
    • Stop Token Filter (默认禁用)
    • Fingerprint Token Filter

Fingerprint Analyzer 示例

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
复制代码

产生[ and consistent godel is said sentence this yes ]

Fingerprint Analyzer 配置

参数 说明
separator 连接条件。默认是空格。
max_output_size 词元最大长度,超过会被丢弃(不是超过部分被丢弃,而且超过这个长度整条被丢弃)。默认255。
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

复制代码

产生[ consistent godel said sentence yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "stopwords": "_english_",
          "separator": "-"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
复制代码

产生[ consistent-godel-said-sentence-yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "max_output_size": 30
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
复制代码

什么都不产生,整条被丢弃,

补充说明

  • Whitespace会遇到空格就拆分,而Standard则是提取出单词,例如:对于Brown-Foxes,Whitespace切分之后还是这样,而Standard切分后则是brownfoxes
  • Simple遇到非字母就切分,而Standard未必,例如:对于dog's,Simple会切分成dogs,而Standard切分后则是dog's
  • 总之,Whitespace遇到空格就切分,Simple遇到非字母就切分,Standard切分单词(可以是所有格形式)。

自定义Analyzer

  • 0或更多的character filter
  • 一个tokenizer
  • 0或更多的token filter

自定义Analyzer的配置

参数 说明
tokenizer 内置或自定义的tokenizer
char_filter 内置或自定义的character filter,可选
filter 内置或自定义的token filter,可选
position_increment_gap 当一个字段值为数组且有多个值时,为了防止跨值匹配,修改值的position。默认100。例如[ "John Abraham", "Lincoln Smith"]为拆分之后position为1,2, 103,104,这样就防止了跨值匹配。更具体的看Mapping文章的position_increment_gap部分。

示例1:

  • Character Filter
    • HTML Strip Character Filter
  • Tokenizer
    • Standard Tokenizer
  • Token Filters
    • Lowercase Token Filter
    • ASCII-Folding Token Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter":[
            "html_strip"
            ],
          "filter": [
            "lowercase",
            "asciifolding"
            ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}
复制代码

产生[ is, this, deja, vu ]

示例2

  • Character Filter
    • Mapping Character Filter:替换:)为_happy_以及 :( 为_sad_
  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lowercase Token Filter
    • Stop Token Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer"


    
: {
          "type": "custom",
          "char_filter": [
              "emoticons"
            ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
            ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop":{
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}
复制代码

产生[ i'm, _happy_, person, you ]

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/46604
 
385 次点击