/ /

/ /

Token Filters

A token filter performs operations such as the following:

Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".
Redaction, the removal of sensitive information from public documents.

Token Filters require a type field, and some take additional options as well.

Syntax

"tokenFilters": [
  {
    "type": "<token-filter-type>",
    "<additional-option>": <value>
  }
]

Token Filter Types

MongoDB Search supports the following types of token filter:

asciiFolding
daitchMokotoffSoundex
edgeGram
englishPossessive
flattenGraph
icuFolding
icuNormalizer
kStemming
length
lowercase
regex
reverse
shingle
snowballStemming
spanishPluralStemming
stempel
stopword
trim
wordDelimiterGraph

The following sample index definitions and queries use the sample collection named minutes. To follow along with these examples, load the minutes collection on your cluster and navigate to the Create a Search Index page in the Atlas UI following the steps in the Create a MongoDB Search Index tutorial. Then, select the minutes collection as your data source, and follow the example procedure to create an index from the Atlas UI or using mongosh.

➤ Use the Select your language drop-down menu to set the language of the example on this page.

asciiFolding

The asciiFolding token filter converts alphabetic, numeric, and symbolic unicode characters that are not in the Basic Latin Unicode block to their ASCII equivalents, if available.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `asciiFolding`.
`originalTokens`	string	no	String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following: `include` - include the original tokens with the converted tokens in the output of the token filter. We recommend this value if you want to support queries on both the original tokens as well as the converted forms. `omit` - omit the original tokens and include only the converted tokens in the output of the token filter. Use this value if you want to query only on the converted forms of the original tokens. Default: `omit`

Example

The following index definition indexes the page_updated_by.first_name field in the minutes collection using a custom analyzer named asciiConverter. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the asciiFolding token filter to convert the field values to their ASCII equivalent.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type asciiConverter in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select asciiFolding from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.first_name field.
Select page_updated_by.first_name from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select asciiConverter from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "dynamic": false,
        "fields": {
          "first_name": {
            "type": "string",
            "analyzer": "asciiConverter"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "asciiConverter",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "asciiFolding"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "dynamic": false,
6       "fields": {
7         "page_updated_by": {
8           "type": "document",
9           "dynamic": false,
10           "fields": {
11             "first_name": {
12               "type": "string",
13               "analyzer": "asciiConverter"
14             }
15           }
16         }
17       }
18     },
19     "analyzers": [
20       {
21         "name": "asciiConverter",
22         "tokenizer": {
23           "type": "standard"
24         },
25         "tokenFilters": [
26           {
27             "type": "asciiFolding"
28           }
29         ]
30       }
31     ]
32   }
33 )

The following query searches the first_name field in the minutes collection for names using their ASCII equivalent.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "Sian",
      "path": "page_updated_by.first_name"
    }
  }
}

SCORE: 0.5472603440284729 _id:  "1"
  message: "try to siGn-In"
  page_updated_by: Object
    last_name: "AUERBACH"
    first_name: "Siân"
    email: "auerbach@example.com"
    phone: "(123)-456-7890"
  text: Object
    en_US: "<head> This page deals with department meetings.</head>"
    sv_FI: "Cette page traite des réunions de département"

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "Sian",
        "path": "page_updated_by.first_name"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "page_updated_by.last_name": 1,
      "page_updated_by.first_name": 1
    }
  }
])

[
 {
    _id: 1,
    page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'}
 }
]

MongoDB Search returns document with _id: 1 in the results because MongoDB Search created the following tokens (searchable terms) for the page_updated_by.first_name field in the document, which it then used to match to the query term Sian:

Field Name	Output Tokens
`page_updated_by.first_name`	`Sian`

daitchMokotoffSoundex

The daitchMokotoffSoundex token filter creates tokens for words that sound the same based on the Daitch-Mokotoff Soundex phonetic algorithm. This filter can generate multiple encodings for each input, where each encoded token is a 6 digit number.

Note

Don't use the daitchMokotoffSoundex token filter in:

Synonym or autocomplete type mapping definitions.
Operators where fuzzy is enabled. MongoDB Search supports the fuzzy option for the following operators:
- autocomplete
- text

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `daitchMokotoffSoundex`.
`originalTokens`	string	no	String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following: `include` - include the original tokens with the encoded tokens in the output of the token filter. We recommend this value if you want queries on both the original tokens as well as the encoded forms. `omit` - omit the original tokens and include only the encoded tokens in the output of the token filter. Use this value if you want to only query on the encoded forms of the original tokens. Default: `include`

Example

The following index definition indexes the page_updated_by.last_name field in the minutes collection using a custom analyzer named dmsAnalyzer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the daitchMokotoffSoundex token filter to encode the tokens for words that sound the same.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type dmsAnalyzer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select daitchMokotoffSoundex from the dropdown and select the value shown in the following table for the originalTokens field:
Field
Value
originalTokens
include
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.last_name field.
Select page_updated_by.last_name from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select dmsAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "dynamic": false,
        "fields": {
          "last_name": {
            "type": "string",
            "analyzer": "dmsAnalyzer"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "dmsAnalyzer",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "daitchMokotoffSoundex",
          "originalTokens": "include"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "dynamic": false,
6       "fields": {
7         "page_updated_by": {
8           "type": "document",
9           "dynamic": false,
10           "fields": {
11             "last_name": {
12               "type": "string",
13               "analyzer": "dmsAnalyzer"
14             }
15           }
16         }
17       }
18     },
19     "analyzers": [
20       {
21         "name": "dmsAnalyzer",
22         "tokenizer": {
23           "type": "standard"
24         },
25         "tokenFilters": [
26           {
27             "type": "daitchMokotoffSoundex",
28             "originalTokens": "include"
29           }
30         ]
31       }
32     ]
33   }
34 )

The following query searches for terms that sound similar to AUERBACH in the page_updated_by.last_name field of the minutes collection.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "AUERBACH",
      "path": "page_updated_by.last_name"
    }
  }
}

SCORE: 0.568153440952301  _id:  "1"
  message: "try to siGn-In"
  page_updated_by: Object
    last_name: "AUERBACH"
    first_name: "Siân"
    email: "auerbach@example.com"
    phone: "(123)-456-7890"
  text: Object
    en_US: "<head> This page deals with department meetings.</head>"
    sv_FI: "Den här sidan behandlar avdelningsmöten"
    fr_CA: "Cette page traite des réunions de département"
SCORE: 0.521163284778595  _id:  "2"
  message: "do not forget to SIGN-IN. See ① for details."
  page_updated_by: Object
    last_name: "OHRBACH"
    first_name: "Noël"
    email: "ohrbach@example.com"
    phone: "(123) 456 0987"
  text: Object
    en_US: "The head of the sales department spoke first."
    fa_IR: "ابتدا رئیس بخش فروش صحبت کرد"
    sv_FI: "Först talade chefen för försäljningsavdelningen"

db.minutes.aggregate([
  {
    "$search": {
     "index": "default",
     "text": {
        "query": "AUERBACH",
        "path": "page_updated_by.last_name"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "page_updated_by.last_name": 1
    }
  }
])

[
 { "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } }
 { "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } }
]

MongoDB Search returns documents with _id: 1 and _id: 2 because the terms in both documents are phonetically similar, and are coded using the same six digit numbers (097400 and 097500). The following table shows the tokens (searchable terms and six digit encodings) that MongoDB Search creates for the documents in the results:

Document ID	Output Tokens
`"_id": 1`	`AUERBACH`, `097400`, `097500`
`"_id": 2`	`OHRBACH`, `097400`, `097500`

edgeGram

The edgeGram token filter tokenizes input from the left side, or "edge", of a text input into n-grams of configured sizes.

Note

Typically, token filters operate similarly to a pipeline, with each input token yielding no more than one output token that is then inputted into the subsequent token. The edgeGram token filter, by contrast, is a graph-producing filter that yields multiple output tokens from a single input token.

Because synonym and autocomplete field type mapping definitions only work when used with non-graph-producing token filters, you can't use the edgeGram token filter in synonym or autocomplete field type mapping definitions.

For querying with regex (MongoDB Search Operator) or wildcard Operator, you can't use edgeGram token filter as the searchAnalyzer as it produces more than one output token per input token. Specify a different analyzer as the searchAnalyzer in your index definition.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `edgeGram`.
`minGram`	integer	yes	Number that specifies the minimum length of generated n-grams. Value must be less than or equal to `maxGram`.
`maxGram`	integer	yes	Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to `minGram`.
`termNotInBounds`	string	no	String that specifies whether to index tokens shorter than `minGram` or longer than `maxGram`. Accepted values are: `include` `omit` If `include` is specified, tokens shorter than `minGram` or longer than `maxGram` are indexed as-is. If `omit` is specified, those tokens are not indexed. Default: `omit`

Example

The following index definition indexes the title field in the minutes collection using a custom analyzer named titleAutocomplete. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
- icuFolding token filter to apply character foldings to the tokens.
- edgeGram token filter to create 4 to 7 character long tokens from the left side.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type titleAutocomplete in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select icuFolding from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select edgeGram from the dropdown and type the value shown in the following table for the fields:
Field
Value
minGram
4
maxGram
7
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select titleAutocomplete from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "titleAutocomplete",
  "mappings": {
    "dynamic": false,
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "titleAutocomplete"
      }
    }
  },
  "analyzers": [
    {
      "name": "titleAutocomplete",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        },
        {
          "type": "edgeGram",
          "minGram": 4,
          "maxGram": 7
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "analyzer": "titleAutocomplete",
5     "mappings": {
6       "dynamic": false,
7       "fields": {
8         "title": {
9           "type": "string",
10           "analyzer": "titleAutocomplete"
11         }
12       }
13     },
14     "analyzers": [
15       {
16         "name": "titleAutocomplete",
17         "charFilters": [],
18         "tokenizer": {
19           "type": "standard"
20         },
21         "tokenFilters": [
22           {
23             "type": "icuFolding"
24           },
25           {
26             "type": "edgeGram",
27             "minGram": 4,
28             "maxGram": 7
29           }
30         ]
31       }
32     ]
33   }
34 )

The following query searches the title field of the minutes collection for terms that begin with mee, followed by any number of other characters.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:

{
  "$search": {
    "wildcard": {
      "query": "mee*",
      "path": "title",
      "allowAnalyzedField": true
    }
  }
}

SCORE: 1  _id:  "1"
  message: "try to siGn-In"
  page_updated_by: Object
    last_name: "AUERBACH"
    first_name: "Siân"
    email: "auerbach@example.com"
    phone: "(123)-456-7890"
  text: Object
    en_US: "<head> This page deals with department meetings.</head>"
    sv_FI: "Den här sidan behandlar avdelningsmöten"
    fr_CA: "Cette page traite des réunions de département"
SCORE: 1  _id:  "3"
  message: "try to sign-in"
  page_updated_by: Object
    last_name: "LEWINSKY"
    first_name: "Brièle"
    email: "lewinsky@example.com"
    phone: "(123).456.9870"
  text: Object
    en_US: "<body>We'll head out to the conference room by noon.</body>"

db.minutes.aggregate([
  {
    "$search": {
      "wildcard": {
        "query": "mee*",
        "path": "title",
        "allowAnalyzedField": true
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "title": 1
    }
  }
])

[
  { _id: 1, title: 'The team's weekly meeting' },
  { _id: 3, title: 'The regular board meeting' }
]

MongoDB Search returns documents with _id: 1 and _id: 3 because the documents contain the term meeting, which matches the query criteria. Specifically, MongoDB Search creates the following 4 to 7 character tokens (searchable terms) for the documents in the results, which it then matches to the query term mee*:

Document ID	Output Tokens
`"_id": 1`	`team`, `team'`, `team's`, `week`, `weekl`, `weekly`, `meet`, `meeti`, `meetin`, `meeting`
`"_id": 3`	`regu`, `regul`, `regula`, `regular`, `boar`, `board`, `meet`, `meeti`, `meetin`, `meeting`

englishPossessive

The englishPossessive token filter removes possessives (trailing 's) from words.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `englishPossessive`.

Example

The following index definition indexes the title field in the minutes collection using a custom analyzer named englishPossessiveStemmer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens (search terms) based on word break rules.
Apply the englishPossessive token filter to remove possessives (trailing 's) from the tokens.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type englishPossessiveStemmer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select englishPossessive from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select englishPossessiveStemmer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "englishPossessiveStemmer"
      }
    }
  },
  "analyzers": [
    {
      "name": "englishPossessiveStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "englishPossessive"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "title": {
7           "type": "string",
8           "analyzer": "englishPossessiveStemmer"
9         }
10       }
11     },
12     "analyzers": [
13       {
14         "name": "englishPossessiveStemmer",
15         "charFilters": [],
16         "tokenizer": {
17           "type": "standard"
18         },
19         "tokenFilters": [
20           {
21             "type": "englishPossessive"
22           }
23         ]
24       }
25     ]
26   }
27 )

The following query searches the title field in the minutes collection for the term team.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "team",
      "path": "title"
    }
  }
}

SCORE: 0.34314215183258057  _id:  "1"
  message: "try to siGn-In"
  page_updated_by: Object
  text: Object
SCORE: 0.29123833775520325  _id:  "2"
  message: "do not forget to SIGN-IN. See ① for details."
  page_updated_by: Object
  text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "team",
        "path": "title"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "title": 1
    }
  }
])

[
  {
    _id: 1,
    title: 'The team's weekly meeting'
  },
  {
    _id: 2,
    title: 'The check-in with sales team'
  }
]

MongoDB Search returns results that contain the term team in the title field. MongoDB Search returns the document with _id: 1 because MongoDB Search transforms team's in the title field to the token team during analysis. Specifically, MongoDB Search creates the following tokens (searchable terms) for the documents in the results, which it then matches to the query term:

Document ID	Output Tokens
`"_id": 1`	`The`, `team`, `weekly`, `meeting`
`"_id": 2`	`The`, `check`, `in`, `with`, `sales`, `team`

flattenGraph

The flattenGraph token filter transforms a token filter graph into a flat form suitable for indexing. If you use the wordDelimiterGraph token filter, use this filter after the wordDelimiterGraph token filter.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `flattenGraph`.

Example

The following index definition indexes the message field in the minutes collection using a custom analyzer called wordDelimiterGraphFlatten. The custom analyzer specifies the following:

Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Apply the following filters to the tokens:
- wordDelimiterGraph token filter to split tokens based on sub-words, generate tokens for the original words, and also protect the word SIGN_IN from delimination.
- flattenGraph token filter to flatten the tokens to a flat form.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type wordDelimiterGraphFlatten in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select wordDelimiterGraph from the dropdown and configure the following fields for the token filter.
1. Select the following fields:
  Field
  Value
  delimiterOptions.generateWordParts
  true
  delimiterOptions.preserveOriginal
  true
2. Type SIGN_IN in the protectedWords.words field.
3. Select protectedWords.ignoreCase.
Click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select flattenGraph from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select wordDelimiterGraphFlatten from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "message": {
        "type": "string",
        "analyzer": "wordDelimiterGraphFlatten"
      }
    }
  },
  "analyzers": [
    {
      "name": "wordDelimiterGraphFlatten",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "wordDelimiterGraph",
          "delimiterOptions" : {
            "generateWordParts" : true,
            "preserveOriginal" : true
          },
          "protectedWords": {
            "words": [
              "SIGN_IN"
            ],
            "ignoreCase": false
          }
        },
        {
          "type": "flattenGraph"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "message": {
7           "type": "string",
8           "analyzer": "wordDelimiterGraphFlatten"
9         }
10       }
11     },
12     "analyzers": [
13       {
14         "name": "wordDelimiterGraphFlatten",
15         "charFilters": [],
16         "tokenizer": {
17           "type": "whitespace"
18         },
19         "tokenFilters": [
20           {
21             "type": "wordDelimiterGraph",
22             "delimiterOptions": {
23               "generateWordParts": true,
24               "preserveOriginal": true
25             },
26             "protectedWords": {
27               "words": [
28                 "SIGN_IN"
29               ],
30               "ignoreCase": false
31             }
32           },
33           {
34             "type": "flattenGraph"
35           }
36         ]
37       }
38     ]
39   }
40 )

The following query searches the message field in the minutes collection for the term sign.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "sign",
      "path": "message"
    }
  }
}

SCORE: 0.6763891577720642 _id:  "3"
  message: "try to sign-in"
  page_updated_by: Object
  text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "sign",
        "path": "message"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "message": 1
    }
  }
])

[
  {
    _id: 3,
    message: 'try to sign-in'
  }
]

MongoDB Search returns the document with _id: 3 in the results for the query term sign even though the document contains the hyphenated term sign-in in the title field. The wordDelimiterGraph token filter creates a token filter graph and the flattenGraph token filter transforms the token filter graph into a flat form suitable for indexing. Specifically, MongoDB Search creates the following tokens (searchable terms) for the document in the results, which it then matches to the query term sign:

Document ID	Output Tokens
`_id: 3`	`try`, `to`, `sign-in`, `sign`, `in`

icuFolding

The icuFolding token filter applies character folding from Unicode Technical Report #30 such as accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

Attributes

It has the following attribute:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `icuFolding`.

Example

The following index definition indexes the text.sv_FI field in the minutes collection using a custom analyzer named diacriticFolder. The custom analyzer specifies the following:

Apply the keyword tokenizer to tokenize all the terms in the string field as a single term.
Use the icuFolding token filter to apply foldings such as accent removal, case folding, canonical duplicates folding, and so on.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type diacriticFolder in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select icuFolding from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.sv_FI nested field.
Select text.sv_FI nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select diacriticFolder from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "diacriticFolder",
  "mappings": {
    "fields": {
      "text": {
        "type": "document",
        "fields": {
          "sv_FI": {
            "analyzer": "diacriticFolder",
            "type": "string"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "diacriticFolder",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "analyzer": "diacriticFolder",
5     "mappings": {
6       "fields": {
7         "text": {
8           "type": "document",
9           "fields": {
10             "sv_FI": {
11               "analyzer": "diacriticFolder",
12               "type": "string"
13             }
14           }
15         }
16       }
17     },
18     "analyzers": [
19       {
20         "name": "diacriticFolder",
21         "charFilters": [],
22         "tokenizer": {
23           "type": "keyword"
24         },
25         "tokenFilters": [
26           {
27             "type": "icuFolding"
28           }
29         ]
30       }
31     ]
32   }
33 )

The following query uses the the wildcard operator to search the text.sv_FI field in the minutes collection for all terms that contain the term avdelning, preceded and followed by any number of other characters.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "wildcard": {
      "query": "*avdelning*",
      "path": "text.sv_FI",
      "allowAnalyzedField": true
    }
  }
}

SCORE: 1  _id:  "1"
 message: "try to siGn-In"
 page_updated_by: Object
 text: Object
   en_US: "<head> This page deals with department meetings.</head>"
   sv_FI: "Den här sidan behandlar avdelningsmöten"
   fr_CA: "Cette page traite des réunions de département"
SCORE: 1  _id:  "2"
 message: "do not forget to SIGN-IN. See ① for details."
 page_updated_by: Object
 text: Object
   en_US: "The head of the sales department spoke first."
   fa_IR: "ابتدا رئیس بخش فروش صحبت کرد"
   sv_FI: "Först talade chefen för försäljningsavdelningen"

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "wildcard": {
        "query": "*avdelning*",
        "path": "text.sv_FI",
        "allowAnalyzedField": true
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.sv_FI": 1
    }
  }
])

[
  {
    _id: 1,
    text: { sv_FI: 'Den här sidan behandlar avdelningsmöten' }
  },
  {
    _id: 2,
    text: { sv_FI: 'Först talade chefen för försäljningsavdelningen' }
  }
]

MongoDB Search returns the document with _id: 1 and _id: 2 in the results because the documents contain the query term avdelning followed by other characters in the document with _id: 1 and preceded and followed by other characters in the document with _id: 2. Specifically, MongoDB Search creates the following tokens for the documents in the results, which it then matches to the query term *avdelning*.

Document ID	Output Tokens
`_id: 1`	`den har sidan behandlar avdelningsmoten`
`_id: 2`	`forst talade chefen for forsaljningsavdelningen`

icuNormalizer

The icuNormalizer token filter normalizes tokens using a standard Unicode Normalization Mode.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `icuNormalizer`.
`normalizationForm`	string	no	Normalization form to apply. Accepted values are: `nfd` (Canonical Decomposition) `nfc` (Canonical Decomposition, followed by Canonical Composition) `nfkd` (Compatibility Decomposition) `nfkc` (Compatibility Decomposition, followed by Canonical Composition) To learn more about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15. Default: `nfc`

Example

The following index definition indexes the message field in the minutes collection using a custom analyzer named textNormalizer. The custom analyzer specifies the following:

Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Use the icuNormalizer token filter to normalize tokens by Compatibility Decomposition, followed by Canonical Composition.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type textNormalizer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select icuNormalizer from the dropdown and select nfkc from the normalizationForm dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select textNormalizer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "textNormalizer",
  "mappings": {
    "fields": {
      "message": {
        "type": "string",
        "analyzer": "textNormalizer"
      }
    }
  },
  "analyzers": [
    {
      "name": "textNormalizer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "icuNormalizer",
          "normalizationForm": "nfkc"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "analyzer": "textNormalizer",
5     "mappings": {
6       "fields": {
7         "message": {
8           "type": "string",
9           "analyzer": "textNormalizer"
10         }
11       }
12     },
13     "analyzers": [
14       {
15         "name": "textNormalizer",
16         "charFilters": [],
17         "tokenizer": {
18           "type": "whitespace"
19         },
20         "tokenFilters": [
21           {
22             "type": "icuNormalizer",
23             "normalizationForm": "nfkc"
24           }
25         ]
26       }
27     ]
28   }
29 )

The following query searches the message field in the minutes collection for the term 1.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "text": { "query": "1", "path": "message" } } }
SCORE: 0.4342196583747864 _id: "2" message: "do not forget to SIGN-IN. See ① for details." page_updated_by: Object text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "1",
        "path": "message"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "message": 1
    }
  }
])

[ { _id: 2, message: 'do not forget to SIGN-IN. See ① for details.' } ]

MongoDB Search returns the document with _id: 2 in the results for the query term 1 even though the document contains the circled number ① in the message field because the icuNormalizer token filter creates the token 1 for this character using the nfkc normalization form. The following table shows the tokens (searchable terms) that MongoDB Search creates for the document in the results using the nfkc normalization form and by comparison, the tokens it creates for the other normalization forms.

Normalization Forms	Output Tokens	Matches `1`
`nfd`	`do`, `not`, `forget`, `to`, `SIGN-IN.`, `See`, `①`, `for`, `details.`	X
`nfc`	`do`, `not`, `forget`, `to`, `SIGN-IN.`, `See`, `①`, `for`, `details.`	X
`nfkd`	`do`, `not`, `forget`, `to`, `SIGN-IN.`, `See`, `1`, `for`, `details.`	√
`nfkc`	`do`, `not`, `forget`, `to`, `SIGN-IN.`, `See`, `1`, `for`, `details.`	√

kStemming

The kStemming token filter combines algorithmic stemming with a built-in dictionary for the english language to stem words. It expects lowercase text and doesn't modify uppercase text.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `kStemming`.

Example

The following index definition indexes the text.en_US field in the minutes collection using a custom analyzer named kStemmer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
- lowercase token filter to convert the tokens to lowercase.
- kStemming token filter to stem words using a combination of algorithmic stemming and a built-in dictionary for the english language.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type kStemmer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select kStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select kStemmer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "kStemmer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "kStemmer",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "kStemming"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "analyzer": "kStemmer",
5     "mappings": {
6       "dynamic": true
7     },
8     "analyzers": [
9       {
10         "name": "kStemmer",
11         "tokenizer": {
12           "type": "standard"
13         },
14         "tokenFilters": [
15           {
16             "type": "lowercase"
17           },
18           {
19             "type": "kStemming"
20           }
21         ]
22       }
23     ]
24   }
25 )

The following query searches the text.en_US field in the minutes collection for the term Meeting.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "Meeting",
      "path": "text.en_US"
    }
  }
}

SCORE: 0.5960260629653931 _id:  "1"
 message: "try to siGn-In"
 page_updated_by: Object
 text: Object
   en_US: "<head> This page deals with department meetings.</head>"
   sv_FI: "Den här sidan behandlar avdelningsmöten"
   fr_CA: "Cette page traite des réunions de département"

db.minutes.aggregate([
 {
    "$search": {
      "index": "default",
      "text": {
        "query": "Meeting",
        "path": "text.en_US"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.en_US": 1
    }
  }
])

[
  {
    _id: 1,
    text: {
      en_US: '<head> This page deals with department meetings. </head>'
    }
  }
]

MongoDB Search returns the document with _id: 1, which contains the plural term meetings in lowercase. MongoDB Search matches the query term to the document because the lowercase token filter normalizes token text to lowercase and the kStemming token filter lets MongoDB Search match the plural meetings in the text.en_US field of the document to the singular query term. MongoDB Search also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer). Specifically, MongoDB Search creates the following tokens (searchable terms) for the document in the results, which it then uses to match to the query term:

head, this, page, deal, with, department, meeting, head

length

The length token filter removes tokens that are too short or too long.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `length`.
`min`	integer	no	Number that specifies the minimum length of a token. Value must be less than or equal to `max`. Default: `0`
`max`	integer	no	Number that specifies the maximum length of a token. Value must be greater than or equal to `min`. Default: `255`

Example

The following index definition indexes the text.sv_FI field in the minutes collection using a custom analyzer named longOnly. The custom analyzer specifies the following:

Use the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
- icuFolding token filter to apply character foldings.
- length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type longOnly in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select icuFolding from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select length from the dropdown and configure the following field for the token filter:
Field
Value
min
20
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.sv.FI nested field.
Select text.sv.FI nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select longOnly from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "text": {
        "type": "document",
        "dynamic": true,
        "fields": {
          "sv_FI": {
            "type": "string",
            "analyzer": "longOnly"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "longOnly",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        },
        {
          "type": "length",
          "min": 20
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "text": {
7           "type": "document",
8           "dynamic": true,
9           "fields": {
10             "sv_FI": {
11               "type": "string",
12               "analyzer": "longOnly"
13             }
14           }
15         }
16       }
17     },
18     "analyzers": [
19       {
20         "name": "longOnly",
21         "charFilters": [],
22         "tokenizer": {
23           "type": "standard"
24         },
25         "tokenFilters": [
26           {
27             "type": "icuFolding"
28           },
29           {
30             "type": "length",
31             "min": 20
32           }
33         ]
34       }
35     ]
36   }
37 )

The following query searches the text.sv_FI field in the minutes collection for the term forsaljningsavdelningen.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "forsaljningsavdelningen",
      "path": "text.sv_FI"
    }
  }
}

SCORE: 0.13076457381248474  _id:  "2"
 message: "do not forget to SIGN-IN. See ① for details."
 page_updated_by: Object
 text: Object
   en_US: "The head of the sales department spoke first."
   fa_IR: "ابتدا رئیس بخش فروش صحبت کرد"
   sv_FI: "Först talade chefen för försäljningsavdelningen"

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "forsaljningsavdelningen",
        "path": "text.sv_FI"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.sv_FI": 1
    }
  }
])

[
  {
    _id: 2,
    text: {
      sv_FI: 'Först talade chefen för försäljningsavdelningen'
    }
  }
]

MongoDB Search returns the document with _id: 2, which contains the term försäljningsavdelningen. MongoDB Search matches the document to the query term because the term has more than 20 characters. Additionally, although the query term forsaljningsavdelningen doesn't include the diacritic characters, MongoDB Search matches the query term to the document by folding the diacritics in the original term in the document. Specifically, MongoDB Search creates the following tokens (searchable terms) for the document with _id: 2.

forsaljningsavdelningen

MongoDB Search won't return any results for a search for any other term in the text.sv_FI field in the collection because all other terms in the field have less than 20 characters.

lowercase

The lowercase token filter normalizes token text to lowercase.

Attributes

It has the following attribute:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `lowercase`.

Examples

The following example index definition indexes the title field in the minutes collection as type autocomplete with the nGram tokenization strategy. It applies a custom analyzer named keywordLowerer on the title field. The custom analyzer specifies the following:

Apply keyword tokenizer to create a single token for a string or array of strings.
Apply the lowercase token filter to convert token text to lowercase.

In the Custom Analyzers section, click Add Custom Analyzer.
Choose Create Your Own radio button and click Next.
Type keywordLowerer in the Analyzer Name field.
Expand Tokenizer if it's collapsed and select the keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and Autocomplete from the Data Type dropdown.
In the properties section for the data type, select the following values from the dropdown for the property:
Property Name
Value
Analyzer
keywordLowerer
Tokenization
nGram
Click Add, then Save Changes.

Replace the default index definition with the following:

{
  "mappings": {
    "fields": {
      "title": {
        "analyzer": "keywordLowerer",
        "tokenization": "nGram",
        "type": "autocomplete"
      }
    }
  },
  "analyzers": [
    {
      "name": "keywordLowerer",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "title": {
7           "analyzer": "keywordLowerer",
8           "tokenization": "nGram",
9           "type": "autocomplete"
10         }
11       }
12     },
13     "analyzers": [
14       {
15         "name": "keywordLowerer",
16         "charFilters": [],
17         "tokenizer": {
18           "type": "keyword"
19         },
20         "tokenFilters": [
21           {
22             "type": "lowercase"
23           }
24         ]
25       }
26     ]
27   }
28 )

The following query searches the title field using the autocomplete operator for the characters standup.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "autocomplete": { "query": "standup", "path": "title" } } }
SCORE: 0.9239386320114136 _id: “4” message: "write down your signature or phone №" page_updated_by: Object text: Object

db.minutes.aggregate([
 {
    "$search": {
      "index": "default",
      "autocomplete": {
        "query": "standup",
        "path": "title"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "title": 1
    }
  }
])

[ { _id: 4, title: 'The daily huddle on tHe StandUpApp2' } ]

MongoDB Search returns the document with _id: 4 in the results because the document contains the query term standup. MongoDB Search creates tokens for the title field using the keyword tokenizer, lowercase token filter, and the nGram tokenization strategy for the autocomplete type. Specifically, MongoDB Search uses the keyword tokenizer to tokenize the entire string as a single token, which supports only exact matches on the entire string, and then it uses the lowercase token filter to convert the tokens to lowercase. For the document in the results, MongoDB Search creates the following token using the custom analyzer:

Document ID	Output Tokens
`_id: 4`	`the daily huddle on the standupapp2`

After applying the custom analyzer, MongoDB Search creates further tokens of n-grams because MongoDB Search indexes the title field as the autocomplete type as specified in the index definition. MongoDB Search uses the tokens of n-grams, which includes a token for standup, to match the document to the query term standup.

The following index definition indexes the message field in the minutes collection using a custom analyzer named lowerCaser. The custom analyzer specifies the following:

Apply standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
- icuNormalizer to normalize the tokens using a standard Unicode Normalization Mode.
- lowercase token filter to convert token text to lowercase.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type lowerCaser in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select icuNormalizer from the dropdown and then select nfkd from the normalizationForm dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select lowercase from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select lowerCaser from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

{
   "mappings": {
   "fields": {
       "message": {
       "type": "string",
       "analyzer": "lowerCaser"
       }
   }
   },
   "analyzers": [
   {
       "name": "lowerCaser",
       "charFilters": [],
       "tokenizer": {
       "type": "standard"
       },
       "tokenFilters": [
       {
           "type": "icuNormalizer",
           "normalizationForm": "nfkd"
       },
       {
           "type": "lowercase"
       }
       ]
   }
   ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "message": {
7           "type": "string",
8           "analyzer": "lowerCaser"
9         }
10       }
11     },
12     "analyzers": [
13       {
14         "name": "lowerCaser",
15         "charFilters": [],
16         "tokenizer": {
17           "type": "standard"
18         },
19         "tokenFilters": [
20           {
21             "type": "icuNormalizer",
22             "normalizationForm": "nfkd"
23           },
24           {
25             "type": "lowercase"
26           }
27         ]
28       }
29     ]
30   }
31 )

The following query searches the message field for the term sign-in.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "sign-in",
      "path": "message"
    }
  }
}

SCORE: 0.37036222219467163  _id:  "1"
  message: "try to siGn-In"
  page_updated_by: Object
 text: Object
SCORE: 0.37036222219467163  _id:  "3"
  message: "try to sign-in"
  page_updated_by: Object
  text: Object
SCORE: 0.2633555233478546   _id:  "2"
  message: "do not forget to SIGN-IN. See ① for details."
  page_updated_by: Object
  text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "sign-in",
        "path": "message"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "message": 1
    }
  }
])

[
  { _id: 1, message: 'try to siGn-In' },
  { _id: 3, message: 'try to sign-in' },
  { _id: 2, message: 'do not forget to SIGN-IN. See ① for details.' }
]

MongoDB Search returns the documents with _id: 1, _id: 3, and _id: 2 in the results for the query term sign-in because the icuNormalizer tokenizer first creates separate tokens by splitting the text, including the hyphenated word, but retains the original letter case in the document and then the lowercase token filter converts the tokens to lowercase. MongoDB Search also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer) to split the query term and match it to the document.

Normalization Forms	Output Tokens
`_id: 1`	`try`, `to`, `sign`, `in`
`_id: 3`	`try`, `to`, `sign`, `in`
`_id: 2`	`do`, `not`, `forget`, `to`, `sign`, `in`, `see`, `for`, `details`

nGram

The nGram token filter tokenizes input into n-grams of configured sizes. You can't use the nGram token filter in synonym or autocomplete mapping definitions.

Note

For querying with regex (MongoDB Search Operator) or wildcard Operator, you can't use ngram token filter as the searchAnalyzer as it produces more than one output token per input token. Specify a different analyzer as the searchAnalyzer in your index definition.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `nGram`.
`minGram`	integer	yes	Number that specifies the minimum length of generated n-grams. Value must be less than or equal to `maxGram`.
`maxGram`	integer	yes	Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to `minGram`.
`termNotInBounds`	string	no	String that specifies whether to index tokens shorter than `minGram` or longer than `maxGram`. Accepted values are: `include` `omit` If `include` is specified, tokens shorter than `minGram` or longer than `maxGram` are indexed as-is. If `omit` is specified, those tokens are not indexed. Default: `omit`

Example

The following index definition indexes the title field in the minutes collection using the custom analyzer named titleAutocomplete. It specifies the Keyword Analyzer as the searchAnalyzer. The custom analyzer function specifies the following:

Apply the standard tokenizer to create tokens based on the word break rules.
Apply a series of token filters on the tokens:
- englishPossessive to remove possessives (trailing 's) from words.
- nGram to tokenize words into 4 to 7 characters in length.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type titleAutocomplete in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select englishPossessive from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select nGram from the dropdown and configure the following fields for the token filter:
Field
Value
minGram
4
maxGram
7
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select titleAutocomplete from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
 "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "titleAutocomplete",
        "searchAnalyzer": "lucene.keyword"
      }
    }
  },
  "analyzers": [
    {
      "name": "titleAutocomplete",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "englishPossessive"
        },
        {
          "type": "nGram",
          "minGram": 4,
          "maxGram": 7
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "title": {
7           "type": "string",
8           "analyzer": "titleAutocomplete",
9           "searchAnalyzer": "lucene.keyword"
10         }
11       }
12     },
13     "analyzers": [
14       {
15         "name": "titleAutocomplete",
16         "charFilters": [],
17         "tokenizer": {
18           "type": "standard"
19         },
20         "tokenFilters": [
21           {
22             "type": "englishPossessive"
23           },
24           {
25             "type": "nGram",
26             "minGram": 4,
27             "maxGram": 7
28           }
29         ]
30       }
31     ]
32   }
33 )

The following query uses the wildcard operator to search the title field in the minutes collection for the term meet followed by any number of other characters after the term.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "wildcard": {
      "query": "meet*",
      "path": "title",
      "allowAnalyzedField": true
    }
  }
}

SCORE: 1  _id:  "1"
  message: "try to siGn-In"
  page_updated_by: Object
  text: Object
  title: "The team's weekly meeting"
SCORE: 1  _id:  "3"
  message: "try to sign-in"
  page_updated_by: Object
  text: Object
  title: "The regular board meeting"

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "wildcard": {
        "query": "meet*",
        "path": "title",
        "allowAnalyzedField": true
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "title": 1
    }
  }
])

[
  { _id: 1, title: 'The team's weekly meeting' },
  { _id: 3, title: 'The regular board meeting' }
]

MongoDB Search returns the documents with _id: 1 and _id: 3 because the documents contain the term meeting, which MongoDB Search matches to the query criteria meet* by creating the following tokens (searchable terms).

Normalization Forms	Output Tokens
`_id: 2`	`team`, `week`, `weekl`, `weekly`, `eekl`, `eekly`, `ekly`, `meet`, `meeti`, `meetin`, `meeting`, `eeti`, `eeti`, `eeting`, `etin`, `eting`, `ting`
`_id: 3`	`regu`, `regul`, `regula`, `regular`, `egul`, `egula`, `egular`, `gula`, `gular`, `ular`, `boar`, `board`, `oard`, `meet`, `meeti`, `meetin`, `meeting`, `eeti`, `eeti`, `eeting`, `etin`, `eting`, `ting`

Note

MongoDB Search doesn't create tokens for terms less than 4 characters (such as the) and greater than 7 characters because the termNotInBounds parameter is set to omit by default. If you set the value for termNotInBounds parameter to include, MongoDB Search would create tokens for the term the also.

porterStemming

The porterStemming token filter uses the porter stemming algorithm to remove the common morphological and inflectional suffixes from words in English. It expects lowercase text and doesn't work as expected for uppercase text.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `porterStemming`.

Example

The following index definition indexes the title field in the minutes collection using a custom analyzer named porterStemmer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the following token filters on the tokens:
- lowercase token filter to convert the words to lowercase.
- porterStemming token filter to remove the common morphological and inflectional suffixes from the words.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type porterStemmer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select porterStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select porterStemmer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "porterStemmer"
      }
    }
  },
  "analyzers": [
    {
      "name": "porterStemmer",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "porterStemming"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "porterStemmer"
      }
    }
  },
  "analyzers": [
    {
      "name": "porterStemmer",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "porterStemming"
        }
      ]
    }
  ]
})

The following query searches the title field in the minutes collection for the term Meet.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "Meet",
      "path": "title"
    }
  }
}

SCORE: 0.34314215183258057  _id:  “1”
 message: "try to siGn-In"
 page_updated_by: Object
 text: Object
SCORE: 0.34314215183258057  _id:  “3”
 message: "try to sign-in"
 page_updated_by: Object
 text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "Meet",
        "path": "title"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "title": 1
    }
  }
])

[
  {
    _id: 1,
    title: 'The team's weekly meeting'
  },
  {
    _id: 3,
    title: 'The regular board meeting'
  }
]

MongoDB Search returns the documents with _id: 1 and _id: 3 because the lowercase token filter normalizes token text to lowercase and then the porterStemming token filter stems the morphological suffix from the meeting token to create the meet token, which MongoDB Search matches to the query term Meet. Specifically, MongoDB Search creates the following tokens (searchable terms) for the documents in the results, which it then matches to the query term Meet:

Normalization Forms	Output Tokens
`_id: 1`	`the`, `team'`, `weekli`, `meet`
`_id: 3`	`the`, `regular`, `board`, `meet`

regex

The regex token filter applies a regular expression with Java regex syntax to each token, replacing matches with a specified string.

Attributes

It has the following attributes:

Name

Type

Required?

Description

type

string

yes

Human-readable label that identifies this token filter. Value must be regex.

pattern

string

yes

Regular expression pattern to apply to each token.

replacement

string

yes

Replacement string to substitute wherever a matching pattern occurs.

If you specify an empty string ("") to ignore or delete a token, MongoDB Search creates a token with an empty string instead. To delete tokens with empty strings, use the stopword token filter after the regex token filter. For example:

"analyzers": [
  {
    "name": "custom.analyzer.name",
    "charFilters": [],
    "tokenizer": {
      "type": "whitespace"
    },
    "tokenFilters": [
      {
        "matches": "all",
        "pattern": "^(?!\\$)\\w+",
        "replacement": "",
        "type": "regex"
      },
      {
        "type": "stopword",
        "tokens": [""]
      }
    ]
  }
]

matches

string

yes

Acceptable values are:

all
first

If matches is set to all, replace all matching patterns. Otherwise, replace only the first matching pattern.

Example

The following index definition indexes the page_updated_by.email field in the minutes collection using a custom analyzer named emailRedact. The custom analyzer specifies the following:

Apply the keyword tokenizer to index all words in the field value as a single term.
Apply the following token filters on the tokens:
- lowercase token filter to turn uppercase characters in the tokens to lowercase.
- regex token filter to find strings that look like email addresses in the tokens and replace them with the word redacted.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type emailRedact in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select regex from the dropdown and configure the following for the token filter:
1. Type ^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$ in the pattern field.
2. Type redacted in the replacement field.
3. Select all from the matches dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email nested field.
Select page_updated_by.email nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select emailRedact from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "lucene.standard",
  "mappings": {
    "dynamic": false,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "fields": {
          "email": {
            "type": "string",
            "analyzer": "emailRedact"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "emailRedact",
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "matches": "all",
          "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
          "replacement": "redacted",
          "type": "regex"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "analyzer": "lucene.standard",
  "mappings": {
    "dynamic": false,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "fields": {
          "email": {
            "type": "string",
            "analyzer": "emailRedact"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "emailRedact",
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "matches": "all",
          "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
          "replacement": "redacted",
          "type": "regex"
        }
      ]
    }
  ]
})

The following query searches the page_updated_by.email field in the minutes collection using the wildcard operator for the term example.com preceded by any number of other characters.

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "index": "default",
5       "wildcard": {
6         "query": "*example.com",
7         "path": "page_updated_by.email",
8         "allowAnalyzedField": true
9       }
10     }
11   },
12   {
13     "$project": {
14       "_id": 1,
15       "page_updated_by.email": 1
16     }
17   }
18 ])

MongoDB Search doesn't return any results for the query although the page_updated_by.email field contains the word example.com in the email addresses. MongoDB Search tokenizes strings that match the regular expression provided in the custom analyzer with the word redacted and so, MongoDB Search doesn't match the query term to any document.

reverse

The reverse token filter reverses each string token.

Attributes

It has the following attribute:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter. Value must be `reverse`.

Example

The following index definition indexes the page_updated_by.email fields in the minutes collection using a custom analyzer named keywordReverse. The custom analyzer specifies the following:

Apply the keyword tokenizer to tokenize entire strings as single terms.
Apply the reverse token filter to reverse the string tokens.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type reverseAnalyzer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select reverse from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select reverseAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "keywordReverse",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "keywordReverse",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "reverse"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "analyzer": "keywordReverse",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "keywordReverse",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "reverse"
        }
      ]
    }
  ]
})

The following query searches the page_updated_by.email field in the minutes collection using the wildcard operator to match any characters preceding the characters @example.com in reverse order. The reverse token filter can speed up leading wildcard queries.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "wildcard": {
      "query": "*@example.com",
      "path": "page_updated_by.email",
      "allowAnalyzedField": true
    }
  }
}

SCORE: 1 _id:  "1"
 message: "try to siGn-In"
 page_updated_by: Object
   last_name: "AUERBACH"
   first_name: "Siân"
   email: "auerbach@example.com"
   phone: "(123)-456-7890"
 text: Object
SCORE: 1  _id:  "2"
 message: "do not forget to SIGN-IN. See ① for details."
 page_updated_by: Object
   last_name: "OHRBACH"
   first_name: "Noël"
   email: "ohrbach@example.com"
   phone: "(123) 456 0987"
 text: Object
SCORE: 1  _id:  "3"
 message: "try to sign-in"
 page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
SCORE: 1  _id:  "4"
 message: "write down your signature or phone №"
 page_updated_by: Object
   last_name: "LEVINSKI"
   first_name: "François"
   email: "levinski@example.com"
   phone: "123-456-8907"
 text: Object

.. io-code-block::
   :copyable: true
   
   .. input:: 
      :language:  json
      db.minutes.aggregate([
        {
          "$search": {
            "index": "default",
            "wildcard": {
              "query": "*@example.com",
              "path": "page_updated_by.email",
              "allowAnalyzedField": true
            }
          }
        },
        {
          "$project": {
            "_id": 1,
            "page_updated_by.email": 1,
          }
        }
      ])
   
   .. output:: 
      :language:  json
         
      [
        { _id: 1, page_updated_by: { email: 'auerbach@example.com' } },
        { _id: 2, page_updated_by: { email: 'ohrback@example.com' } },
        { _id: 3, page_updated_by: { email: 'lewinsky@example.com' } },
        { _id: 4, page_updated_by: { email: 'levinski@example.com' } }
      ]

For the preceding query, MongoDB Search applies the custom analyzer to the wildcard query to transform the query as follows:

moc.elpmaxe@*

MongoDB Search then runs the query against the indexed tokens, which are also reversed. Specifically, MongoDB Search creates the following tokens (searchable terms) for the documents in the results, which it then matches to the query term moc.elpmaxe@*:

Normalization Forms	Output Tokens
`_id: 1`	`moc.elpmaxe@hcabreua`
`_id: 2`	`moc.elpmaxe@kcabrho`
`_id: 3`	`moc.elpmaxe@yksniwel`
`_id: 4`	`moc.elpmaxe@iksnivel`

shingle

The shingle token filter constructs shingles (token n-grams) from a series of tokens. You can't use the shingle token filter in synonym or autocomplete mapping definitions.

Note

For querying with regex (MongoDB Search Operator) or wildcard Operator, you can't use shingle token filter as the searchAnalyzer as it produces more than one output token per input token. Specify a different analyzer as the searchAnalyzer in your index definition.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `shingle`.
`minShingleSize`	integer	yes	Minimum number of tokens per shingle. Must be greater than or equal to `2` and less than or equal to `maxShingleSize`.
`maxShingleSize`	integer	yes	Maximum number of tokens per shingle. Must be greater than or equal to `minShingleSize`.

Example

The following index definition example on the page_updated_by.email field in the minutes collection uses two custom analyzers, emailAutocompleteIndex and emailAutocompleteSearch, to implement autocomplete-like functionality. MongoDB Search uses the emailAutocompleteIndex analyzer during index creation to:

Replace @ characters in a field with AT
Create tokens with the whitespace tokenizer
Shingle tokens
Create edgeGram of those shingled tokens

MongoDB Search uses the emailAutocompleteSearch analyzer during a search to:

Replace @ characters in a field with AT
Create tokens with the whitespace tokenizer tokenizer

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type emailAutocompleteIndex in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following key and value:
Key
Value
@
AT
Click Add character filter to add the character filter to your custom analyzer.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown and enter 15 in the maxTokenLength field.
Expand Token Filters and click Add token filter.
Select shingle from the dropdown and configure the following fields.
Field
Field Value
minShingleSize
2
minShingleSize
3
Click Add token filter to add another token filter.
Click Add token filter to add another token filter.
Select edgeGram from the dropdown and configure the following fields for the token filter:
Field
Field Value
minGram
2
maxGram
15
Click Add token filter to add the token filter to your custom analyzer.
Click Add to add the custom analyzer to your index.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type emailAutocompleteSearch in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following key and value:
Key
Value
@
AT
Click Add character filter to add the character filter to your custom analyzer.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown and enter 15 in the maxTokenLength field.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email nested field.
Select page_updated_by.email nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select emailAutocompleteIndex from the Index Analyzer dropdown and emailAutocompleteSearch from the Search Analyzer dropdown.
Click Add, then Save Changes.

Replace the default index definition with the following example:

1 {
2   "analyzer": "lucene.keyword",
3   "mappings": {
4     "dynamic": true,
5     "fields": {
6       "page_updated_by": {
7         "type": "document",
8         "fields": {
9           "email": {
10             "type": "string",
11             "analyzer": "emailAutocompleteIndex",
12             "searchAnalyzer": "emailAutocompleteSearch"
13           }
14         }
15       }
16     }
17   },
18   "analyzers": [
19     {
20       "name": "emailAutocompleteIndex",
21       "charFilters": [
22         {
23           "mappings": {
24             "@": "AT"
25           },
26           "type": "mapping"
27         }
28       ],
29       "tokenizer": {
30         "maxTokenLength": 15,
31         "type": "whitespace"
32       },
33       "tokenFilters": [
34         {
35           "maxShingleSize": 3,
36            **** "minShingleSize": 2,
37           "type": "shingle"
38         },
39         {
40           "maxGram": 15,
41           "minGram": 2,
42           "type": "edgeGram"
43         }
44       ]
45     },
46     {
47       "name": "emailAutocompleteSearch",
48       "charFilters": [
49         {
50           "mappings": {
51             "@": "AT"
52           },
53           "type": "mapping"
54         }
55       ],
56       "tokenizer": {
57         "maxTokenLength": 15,
58         "type": "whitespace"
59       }
60     }
61   ]
62 }

db.minutes.createSearchIndex("default", {
  "analyzer": "lucene.keyword",
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "fields": {
          "email": {
            "type": "string",
            "analyzer": "emailAutocompleteIndex",
            "searchAnalyzer": "emailAutocompleteSearch"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "emailAutocompleteIndex",
      "charFilters": [
        {
          "mappings": {
            "@": "AT"
          },
          "type": "mapping"
        }
      ],
      "tokenizer": {
        "maxTokenLength": 15,
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "maxShingleSize": 3,
          "minShingleSize": 2,
          "type": "shingle"
        },
        {
          "maxGram": 15,
          "minGram": 2,
          "type": "edgeGram"
        }
      ]
    },
    {
      "name": "emailAutocompleteSearch",
      "charFilters": [
        {
          "mappings": {
            "@": "AT"
          },
          "type": "mapping"
        }
      ],
      "tokenizer": {
        "maxTokenLength": 15,
        "type": "whitespace"
      }
    }
  ]
})

The following query searches for an email address in the page_updated_by.email field of the minutes collection:

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "auerbach@ex",
      "path": "page_updated_by.email"
    }
  }
}

SCORE: 0.8824931383132935  _id:  "1"
 message: "try to siGn-In"
 page_updated_by: Object
   last_name: "AUERBACH"
   first_name: "Siân"
   email: "auerbach@example.com"
   phone: "(123)-456-7890"
 text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "auerbach@ex",
        "path": "page_updated_by.email"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.es_MX": 1
    }
  }
])

[ { _id: 1, page_updated_by: { email: 'auerbach@example.com' } } ]

MongoDB Search creates search tokens using the emailAutocompleteSearch analyzer, which it then matches to the index tokens that it created using the emailAutocompleteIndex analyzer. The following table shows the search and index tokens (up to 15 characters) that MongoDB Search creates:

Search Tokens	Index Tokens
`auerbachATexamp`	`au`, `aue`, `auer`, `auerb`, `auerba`, `auerbac`, `auerbach`, `auerbachA`, `auerbachAT`, `auerbachATe`, `auerbachATex`, `auerbachATexa`, `auerbachATexam`, `auerbachATexamp`

snowballStemming

The snowballStemming token filters Stems tokens using a Snowball-generated stemmer.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `snowballStemming`.
`stemmerName`	string	yes	The following values are valid: `arabic` `armenian` `basque` `catalan` `danish` `dutch` `english` `estonian` `finnish` `french` `german` `german2` (outdated) `hungarian` `irish` `italian` `lithuanian` `norwegian` `porter` (The original Porter English stemming algorithm.) `portuguese` `romanian` `russian` `spanish` `swedish` `turkish`

Example

The following index definition indexes the text.fr_CA field in the minutes collection using a custom analyzer named frenchStemmer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the following token filters on the tokens:
- lowercase token filter to convert the tokens to lowercase.
- french variant of the snowballStemming token filter to stem words.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type frenchStemmer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select snowballStemming from the dropdown and then select french from the stemmerName dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.fr_CA nested field.
Select text.fr_CA nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select frenchStemmer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "text": {
        "type": "document",
        "fields": {
          "fr_CA": {
            "type": "string",
            "analyzer": "frenchStemmer"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "frenchStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "snowballStemming",
          "stemmerName": "french"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "fields": {
      "text": {
        "type": "document",
        "fields": {
          "fr_CA": {
            "type": "string",
            "analyzer": "frenchStemmer"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "frenchStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "snowballStemming",
          "stemmerName": "french"
        }
      ]
    }
  ]
})

The following query searches the text.fr_CA field in the minutes collection for the term réunion.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "réunion",
      "path": "text.fr_CA"
    }
  }
}

SCORE: 0.13076457381248474 _id:  "1"
 message: "try to siGn-In"
 page_updated_by: Object
 text: Object
   en_US: "<head> This page deals with department meetings.</head>"
   sv_FI: "Den här sidan behandlar avdelningsmöten"
   fr_CA: "Cette page traite des réunions de département"

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "réunion",
        "path": "text.fr_CA"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.fr_CA": 1
    }
  }
])

[
  {
    _id: 1,
    text: {
      fr_CA: 'Cette page traite des réunions de département'
    }
  }
]

MongoDB Search returns document with _id: 1 in the results. MongoDB Search matches the query term to the document because it creates the following tokens for the document, which it then used to match to the query term réunion:

Document ID	Output Tokens
`_id: 1`	`cet`, `pag`, `trait`, `de`, `réunion`, `de`, `départ`

spanishPluralStemming

The spanishPluralStemming token filter stems spanish plural words. It expects lowercase text.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `spanishPluralStemming`.

Example

The following index definition indexes the text.es_MX field in the minutes collection using a custom analyzer named spanishPluralStemmer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the following token filters on the tokens:
- lowercase token filter to convert spanish terms to lowercase.
- spanishPluralStemming token filter to stem plural spanish words in the tokens into their singular form.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type spanishPluralStemmer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select spanishPluralStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.es_MX nested field.
Select text.es_MX nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select spanishPluralStemmer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "analyzer": "spanishPluralStemmer",
  "mappings": {
    "fields": {
      "text: {
        "type": "document",
        "fields": {
          "es_MX": {
            "analyzer": "spanishPluralStemmer",
            "searchAnalyzer": "spanishPluralStemmer",
            "type": "string"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "spanishPluralStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "spanishPluralStemming"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "analyzer": "spanishPluralStemmer", 
  "mappings": {
    "fields": {
      "text": { 
        "type": "document",
        "fields": {
          "es_MX": {
            "analyzer": "spanishPluralStemmer",
            "searchAnalyzer": "spanishPluralStemmer",
            "type": "string"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "spanishPluralStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "spanishPluralStemming"
        }
      ]
    }
  ]
})

The following query searches the text.es_MX field in the minutes collection for the spanish term punto.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "index": "default",
    "text": {
      "query": "punto",
      "path": "text.es_MX"
    }
  }
}

 SCORE: 0.13076457381248474  _id:  "4"
   message: "write down your signature or phone №"
   page_updated_by: Object
   text: Object
     en_US: "<body>This page has been updated with the items on the agenda.</body>"
     es_MX: "La página ha sido actualizada con los puntos de la agenda."
     pl_PL: "Strona została zaktualizowana o punkty porządku obrad."

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "punto",
        "path": "text.es_MX"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.es_MX": 1
    }
  }
])

[
  {
    _id: 4,
    text : {
      es_MX: 'La página ha sido actualizada con los puntos de la agenda.',
    }
  }
]

MongoDB Search returns the document with _id: 4 because the text.es_MX field in the document contains the plural term puntos. MongoDB Search matches this document for the query term punto because MongoDB Search analyzes puntos as punto by stemming the plural (s) from the term. Specifically, MongoDB Search creates the following tokens (searchable terms) for the document in the results, which it then uses to match to the query term:

Document ID	Output Tokens
`_id: 4`	`la`, `pagina`, `ha`, `sido`, `actualizada`, `con`, `los`, `punto`, `de`, `la`, `agenda`

stempel

The stempel token filter uses Lucene's default Polish stemmer table to stem words in the Polish language. It expects lowercase text.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `stempel`.

Example

The following index definition indexes the text.pl_PL field in the minutes collection using a custom analyzer named stempelStemmer. The custom analyzer specifies the following:

Apply the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
- lowercase token filter to convert the words to lowercase.
- stempel token filter to stem the Polish words.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type stempelAnalyzer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select spanishPluralStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select stempelAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "stempelAnalyzer"
      }
    }
  },
  "analyzers": [
    {
      "name": "stempelAnalyzer",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "stempel"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "analyzer": "stempelStemmer", 
  "mappings": {
    "dynamic": true,
    "fields": {
      "text.pl_PL": {
        "analyzer": "stempelStemmer",
        "searchAnalyzer": "stempelStemmer",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "name": "stempelStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "stempel"
        }
      ]
    }
  ]
})

The following query searches the text.pl_PL field in the minutes collection for the Polish term punkt.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "text": { "query": "punkt", "path": "text.pl_PL" } } }
SCORE: 0.5376965999603271 _id: "4" text: Object pl_PL: "Strona została zaktualizowana o punkty porządku obrad."

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "punkt",
        "path": "text.pl_PL"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "text.pl_PL": 1
    }
  }
])

[
  {
    _id: 4,
    text: {
      pl_PL: 'Strona została zaktualizowana o punkty porządku obrad.'
    }
  }
]

MongoDB Search returns the document with _id: 4 because the text.pl_PL field in the document contains the plural term punkty. MongoDB Search matches this document for the query term punkt because MongoDB Search analyzes punkty as punkt by stemming the plural (y) from the term. Specifically, MongoDB Search creates the following tokens (searchable terms) for the document in the results, which it then matches to the query term:

Document ID	Output Tokens
`_id: 4`	`strona`, `zostać`, `zaktualizować`, `o`, `punkt`, `porządek`, `obrada`

stopword

The stopword token filter removes tokens that correspond to the specified stop words. This token filter doesn't analyze the specified stop words.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `stopword`.
`tokens`	array of strings	yes	List that contains the stop words that correspond to the tokens to remove. Value must be one or more stop words.
`ignoreCase`	boolean	no	Flag that indicates whether to ignore the case of stop words when filtering the tokens to remove. The value can be one of the following: `true` - ignore case and remove all tokens that match the specified stop words `false` - be case-sensitive and remove only tokens that exactly match the specified case Default: `true`

Example

The following index definition indexes the title field in the minutes collection using a custom analyzer named stopwordRemover. The custom analyzer specifies the following:

Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Apply the stopword token filter to remove the tokens that match the defined stop words is, the, and at. The token filter is case-insensitive and will remove all tokens that match the specified stopwords.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type stopwordRemover in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select stopword from the dropdown and type the following in the tokens field:
is, the, at
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select stopwordRemover from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "text": {
        "type" : "document",
        "fields": {
          "en_US": {
            "type": "string",
            "analyzer": "stopwordRemover"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "stopwordRemover",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "stopword",
          "tokens": ["is", "the", "at"]
        }
      ]
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "fields": {
6         "text": {
7           "type": "document",
8           "fields": {
9             "en_US": {
10               "type": "string",
11               "analyzer": "stopwordRemover"
12             }
13           }
14         }
15       }
16     },
17     "analyzers": [
18       {
19         "name": "stopwordRemover",
20         "charFilters": [],
21         "tokenizer": {
22           "type": "whitespace"
23         },
24         "tokenFilters": [
25           {
26             "type": "stopword",
27             "tokens": ["is", "the", "at"]
28           }
29         ]
30       }
31     ]
32   }
33 )

The following query searches for the phrase head of the sales in the text.en_US field in the minutes collection.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "phrase": { "query": "head of the sales", "path": "text.en_US" } } }
SCORE: 1.5351942777633667 _id: "2" message: "do not forget to SIGN-IN. See ① for details." page_updated_by: Object text: Object

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "phrase": {
5         "query": "head of the sales",
6         "path": "text.en_US"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "text.en_US": 1
14     }
15   }
16 ])

[
  {
    _id: 2,
    text: { en_US: 'The head of the sales department spoke first.' }
  }
]

MongoDB Search returns the document with _id: 2 because the en_US field contains the query term. MongoDB Search doesn't create tokens for the stopword the in the document during analysis, but is still able to match it to the query term because for string fields, it also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer) and removes the stopword from the query term, which allows MongoDB Search to match the query term to the document. Specifically, MongoDB Search creates the following tokens for the document in the results:

Document ID	Output Tokens
`_id: 2`	`head`, `of`, `sales`, `department`, `spoke`, `first.`

trim

The trim token filter trims leading and trailing whitespace from tokens.

Attributes

It has the following attribute:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `trim`.

Example

The following index definition indexes the text.en_US in the the minutes collection using a custom analyzer named tokenTrimmer. The custom analyzer specifies the following:

Apply the htmlStrip character filter to remove all HTML tags from the text except the a tag.
Apply the keyword tokenizer to create a single token for the entire string.
Apply the trim token filter to remove leading and trailing whitespace in the tokens.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type tokenTrimmer in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select htmlStrip from the dropdown and type a in the ignoredTags field.
Click Add character filter to add the character filter to your custom analyzer.
Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select trim from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select tokenTrimmer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "text": {
        "type": "document",
        "fields": {
          "en_US": {
            "type": "string",
            "analyzer": "tokenTrimmer"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "tokenTrimmer",
      "charFilters": [{
        "type": "htmlStrip",
        "ignoredTags": ["a"]
      }],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "trim"
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "fields": {
      "text": {
        "type": "document",
        "fields": {
          "en_US": {
            "type": "string",
            "analyzer": "tokenTrimmer" 
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "tokenTrimmer",
      "charFilters": [{
        "type": "htmlStrip",
        "ignoredTags": ["a"]
      }],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "trim"
        }
      ]
    }
  ]
})

The following query searches for the phrase *department meetings* preceded and followed by any number of other characters in the text.en_US field in the minutes collection.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

{
  "$search": {
    "wildcard": {
      "query": "*department meetings*",
      "path": "text.en_US",
      "allowAnalyzedField": true
    }
  }
}

SCORE: 1  _id:  "1"
message: "try to siGn-In"
page_updated_by: Object
text: Object
  en_US: "<head> This page deals with department meetings.</head>"
  sv_FI: "Den här sidan behandlar avdelningsmöten"
  fr_CA: "Cette page traite des réunions de département"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "wildcard": {
5         "query": "*department meetings*",
6         "path": "text.en_US",
7         "allowAnalyzedField": true
8       }
9     }
10   },
11   {
12     "$project": {
13       "_id": 1,
14       "text.en_US": 1
15     }
16   }
17 ])

1 [
2   {
3     _id: 1,
4     text: { en_US: '<head> This page deals with department meetings. </head>' }
5   }
6 ]

MongoDB Search returns the document with _id: 1 because the en_US field contains the query term department meetings. MongoDB Search creates the following token for the document in the results, which shows that MongoDB Search removed the HTML tags, created a single token for the entire string, and removed leading and trailing whitespaces in the token:

Document ID	Output Tokens
`_id: 1`	`This page deals with department meetings.`

wordDelimiterGraph

The wordDelimiterGraph token filter splits tokens into sub-tokens based on configured rules. We recommend that you don't use this token filter with the standard tokenizer because this tokenizer removes many of the intra-word delimiters that this token filter uses to determine boundaries.

Note

For querying with regex (MongoDB Search Operator) or wildcard Operator, you can't use wordDelimiterGraph token filter as the searchAnalyzer as it produces more than one output token per input token. Specify a different analyzer as the searchAnalyzer in your index definition.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this token filter type. Value must be `wordDelimiterGraph`.
`delimiterOptions`	object	no	Object that contains the rules that determine how to split words into sub-words. Default: `{}`
`delimiterOptions` `.generateWordParts`	boolean	no	Flag that indicates whether to split tokens based on sub-words. For example, if `true`, this option splits `PowerShot` into `Power` and `Shot`. Default: `true`
`delimiterOptions` `.generateNumberParts`	boolean	no	Flag that indicates whether to split tokens based on sub-numbers. For example, if `true`, this option splits `100-2` into `100` and `2`. Default: `true`
`delimiterOptions` `.concatenateWords`	boolean	no	Flag that indicates whether to concatenate runs of sub-words. For example, if `true`, this option concatenates `wi-fi` into `wifi`. Default: `false`
`delimiterOptions` `.concatenateNumbers`	boolean	no	Flag that indicates whether to concatenate runs of sub-numbers. For example, if `true`, this option concatenates `100-2` into `1002`. Default: `false`
`delimiterOptions` `.concatenateAll`	boolean	no	Flag that indicates whether to concatenate all runs. For example, if `true`, this option concatenates `wi-fi-100-2` into `wifi1002`. Default: `false`
`delimiterOptions` `.preserveOriginal`	boolean	no	Flag that indicates whether to generate tokens of the original words. Default: `true`
`delimiterOptions` `.splitOnCaseChange`	boolean	no	Flag that indicates whether to split tokens based on letter-case transitions. For example, if `true`, this option splits `camelCase` into `camel` and `Case`. Default: `true`
`delimiterOptions` `.splitOnNumerics`	boolean	no	Flag that indicates whether to split tokens based on letter-number transitions. For example, if `true`, this option splits `g2g` into `g`, `2`, and `g`. Default: `true`
`delimiterOptions` `.stemEnglishPossessive`	boolean	no	Flag that indicates whether to remove trailing possessives from each sub-word. For example, if `true`, this option changes `who's` into `who`. Default: `true`
`delimiterOptions` `.ignoreKeywords`	boolean	no	Flag that indicates whether to skip tokens with the `keyword` attribute set to `true`. Default: `false`
`protectedWords`	object	no	Object that contains options for protected words. Default: `{}`
`protectedWords` `.words`	array	conditional	List that contains the tokens to protect from delimination. If you specify `protectedWords`, you must specify this option.
`protectedWords` `.ignoreCase`	boolean	no	Flag that indicates whether to ignore case sensisitivity for protected words. Default: `true`

If true, apply the flattenGraph token filter after this option to make the token stream suitable for indexing.

Example

The following index definition indexes the title field in the minutes collection using a custom analyzer named wordDelimiterGraphAnalyzer. The custom analyzer specifies the following:

Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Apply the wordDelimiterGraph token filter for the following:
- Don't try and split is, the, and at. The exclusion is case sensitive. For example Is and tHe are not excluded.
- Split tokens on case changes and remove tokens that contain only alphabetical letters from the English alphabet.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type wordDelimiterGraphAnalyzer in the Analyzer Name field.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select wordDelimiterGraph from the dropdown and configure the following fields:
1. Deselect delimiterOptions.generateWordParts and select delimiterOptions.splitOnCaseChange.
2. Type and then select from the dropdown the words is, the, and at, one at a time, in the protectedWords.words field.
3. Deselect protectedWords.ignoreCase.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title nested field.
Select title nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select wordDelimiterGraphAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following example:

{
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "wordDelimiterGraphAnalyzer"
      }
    }
  },
  "analyzers": [
    {
      "name": "wordDelimiterGraphAnalyzer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "wordDelimiterGraph",
          "protectedWords": {
            "words": ["is", "the", "at"],
            "ignoreCase": false
          },
          "delimiterOptions" : {
            "generateWordParts" : false,
            "splitOnCaseChange" : true
          }
        }
      ]
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "wordDelimiterGraphAnalyzer"
      }
    }
  },
  "analyzers": [
    {
      "name": "wordDelimiterGraphAnalyzer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "wordDelimiterGraph",
          "protectedWords": {
            "words": ["is", "the", "at"],
            "ignoreCase": false
          },
          "delimiterOptions" : {
            "generateWordParts" : false,
            "splitOnCaseChange" : true
          }
        }
      ]
    }
  ]
})

The following query searches the title field in the minutes collection for the term App2.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "text": { "query": "App2", "path": "title" } } }
SCORE: 0.5104123950004578 _id: "4" message: "write down your signature or phone №" page_updated_by: Object text: Object

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "text": {
        "query": "App2",
        "path": "title"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "title": 1
    }
  }
])

[
  {
    _id: 4,
    title: 'The daily huddle on tHe StandUpApp2'
  }
]

MongoDB Search returns the document with _id: 4 because the title field in the document contains App2. MongoDB Search splits tokens on case changes and removes tokens created by a split that contain only alphabetical letters. It also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer) to split the word on case change and remove the letters preceding 2. Specifically, MongoDB Search creates the following tokens for the document with _id : 4 for the protectedWords and delimiterOptions options:

`wordDelimiterGraph` Options	Output Tokens
`protectedWords`	`The`, `daily`, `huddle`, `on`, `t`, `He`, `Stand`, `Up`, `App`, `2`
`delimiterOptions`	`The`, `daily`, `huddle`, `on`, `2`

Back

Tokenizers

Field Mappings

1	db.minutes.createSearchIndex(
2	"default",
3	{
4	"mappings": {
5	"dynamic": false,
6	"fields": {
7	"page_updated_by": {
8	"type": "document",
9	"dynamic": false,
10	"fields": {
11	"first_name": {
12	"type": "string",
13	"analyzer": "asciiConverter"
14	}
15	}
16	}
17	}
18	},
19	"analyzers": [
20	{
21	"name": "asciiConverter",
22	"tokenizer": {
23	"type": "standard"
24	},
25	"tokenFilters": [
26	{
27	"type": "asciiFolding"
28	}
29	]
30	}
31	]
32	}
33	)

1	db.minutes.createSearchIndex(
2	"default",
3	{
4	"analyzer": "titleAutocomplete",
5	"mappings": {
6	"dynamic": false,
7	"fields": {
8	"title": {
9	"type": "string",
10	"analyzer": "titleAutocomplete"
11	}
12	}
13	},
14	"analyzers": [
15	{
16	"name": "titleAutocomplete",
17	"charFilters": [],
18	"tokenizer": {
19	"type": "standard"
20	},
21	"tokenFilters": [
22	{
23	"type": "icuFolding"
24	},
25	{
26	"type": "edgeGram",
27	"minGram": 4,
28	"maxGram": 7
29	}
30	]
31	}
32	]
33	}
34	)

Field	Value
`delimiterOptions.generateWordParts`	true
`delimiterOptions.preserveOriginal`	true

1	db.minutes.createSearchIndex(
2	"default",
3	{
4	"analyzer": "diacriticFolder",
5	"mappings": {
6	"fields": {
7	"text": {
8	"type": "document",
9	"fields": {
10	"sv_FI": {
11	"analyzer": "diacriticFolder",
12	"type": "string"
13	}
14	}
15	}
16	}
17	},
18	"analyzers": [
19	{
20	"name": "diacriticFolder",
21	"charFilters": [],
22	"tokenizer": {
23	"type": "keyword"
24	},
25	"tokenFilters": [
26	{
27	"type": "icuFolding"
28	}
29	]
30	}
31	]
32	}
33	)

1	db.minutes.createSearchIndex(
2	"default",
3	{
4	"analyzer": "textNormalizer",
5	"mappings": {
6	"fields": {
7	"message": {
8	"type": "string",
9	"analyzer": "textNormalizer"
10	}
11	}
12	},
13	"analyzers": [
14	{
15	"name": "textNormalizer",
16	"charFilters": [],
17	"tokenizer": {
18	"type": "whitespace"
19	},
20	"tokenFilters": [
21	{
22	"type": "icuNormalizer",
23	"normalizationForm": "nfkc"
24	}
25	]
26	}
27	]
28	}
29	)

1	db.minutes.createSearchIndex(
2	"default",
3	{
4	"analyzer": "kStemmer",
5	"mappings": {
6	"dynamic": true
7	},
8	"analyzers": [
9	{
10	"name": "kStemmer",
11	"tokenizer": {
12	"type": "standard"
13	},
14	"tokenFilters": [
15	{
16	"type": "lowercase"
17	},
18	{
19	"type": "kStemming"
20	}
21	]
22	}
23	]
24	}
25	)

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"index": "default",
5	"wildcard": {
6	"query": "*example.com",
7	"path": "page_updated_by.email",
8	"allowAnalyzedField": true
9	}
10	}
11	},
12	{
13	"$project": {
14	"_id": 1,
15	"page_updated_by.email": 1
16	}
17	}
18	])

1	{
2	"analyzer": "lucene.keyword",
3	"mappings": {
4	"dynamic": true,
5	"fields": {
6	"page_updated_by": {
7	"type": "document",
8	"fields": {
9	"email": {
10	"type": "string",
11	"analyzer": "emailAutocompleteIndex",
12	"searchAnalyzer": "emailAutocompleteSearch"
13	}
14	}
15	}
16	}
17	},
18	"analyzers": [
19	{
20	"name": "emailAutocompleteIndex",
21	"charFilters": [
22	{
23	"mappings": {
24	"@": "AT"
25	},
26	"type": "mapping"
27	}
28	],
29	"tokenizer": {
30	"maxTokenLength": 15,
31	"type": "whitespace"
32	},
33	"tokenFilters": [
34	{
35	"maxShingleSize": 3,
36	**** "minShingleSize": 2,
37	"type": "shingle"
38	},
39	{
40	"maxGram": 15,
41	"minGram": 2,
42	"type": "edgeGram"
43	}
44	]
45	},
46	{
47	"name": "emailAutocompleteSearch",
48	"charFilters": [
49	{
50	"mappings": {
51	"@": "AT"
52	},
53	"type": "mapping"
54	}
55	],
56	"tokenizer": {
57	"maxTokenLength": 15,
58	"type": "whitespace"
59	}
60	}
61	]
62	}

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"phrase": {
5	"query": "head of the sales",
6	"path": "text.en_US"
7	}
8	}
9	},
10	{
11	"$project": {
12	"_id": 1,
13	"text.en_US": 1
14	}
15	}
16	])

1	[
2	{
3	_id: 1,
4	text: { en_US: '<head> This page deals with department meetings. </head>' }
5	}
6	]