A custom analyzer's tokenizer determines how MongoDB Search splits up text into discrete chunks for indexing. Tokenizers require a type field, and some take additional options as well.
"tokenizer": { "type": "<tokenizer-type>", "<additional-option>": "<value>" }
Tokenizer Types
MongoDB Search supports the following types of tokenizer:
The following sample index definitions and queries use the sample
collection named minutes.
To follow along with these examples, load the minutes collection on your cluster
and navigate to the Create a Search Index page in the Atlas UI following the steps
in the Create a MongoDB Search Index tutorial.
Then, select the minutes collection as your data source, and follow the example procedure
to create an index from the Atlas UI or using mongosh.
➤ Use the Select your language drop-down menu to set the language of the example on this page.
edgeGram
The edgeGram tokenizer tokenizes input from the left side, or
"edge", of a text input into n-grams of given sizes. You can't use a
custom analyzer with edgeGram tokenizer
in the analyzer field for synonym or
autocomplete field mapping
definitions.
Attributes
It has the following attributes:
Note
The edgeGram tokenizer yields multiple output tokens per word and across words
in input text, producing token graphs.
Because autocomplete field type mapping definitions
and analyzers with synonym mappings only work when used with
non-graph-producing tokenizers, you can't use a custom analyzer with
edgeGram tokenizer in the analyzer field for
autocomplete
field type mapping definitions or analyzers with synonym mappings.
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | yes | Number of characters to include in the shortest token created. |
| integer | yes | Number of characters to include in the longest token created. |
Example
The following index definition indexes the message field in the
minutes collection using a custom analyzer named
edgegramExample. It uses the edgeGram tokenizer to create
tokens (searchable terms) between 2 and 7 characters long
starting from the first character on the left side of words in the
message field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
edgegramExamplein the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select edgeGram from the dropdown and type the value for the following fields:
FieldValueminGram
2maxGram
7Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
edgegramExamplefrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "edgegramExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "edgegramExample", "tokenFilters": [], "tokenizer": { "maxGram": 7, "minGram": 2, "type": "edgeGram" } } ] }
1 db.minutes.createSearchIndex( 2 "default", 3 { 4 "mappings": { 5 "dynamic": true, 6 "fields": { 7 "message": { 8 "analyzer": "edgegramExample", 9 "type": "string" 10 } 11 } 12 }, 13 "analyzers": [ 14 { 15 "charFilters": [], 16 "name": "edgegramExample", 17 "tokenFilters": [], 18 "tokenizer": { 19 "maxGram": 7, 20 "minGram": 2, 21 "type": "edgeGram" 22 } 23 } 24 ] 25 } 26 )
The following query searches the message field in
the minutes collection for text that begin with tr.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "text": { "query": "tr", "path": "message" } } } SCORE: 0.3150668740272522 _id: "1" message: "try to siGn-In" page_updated_by: Object last_name: "AUERBACH" first_name: "Siân" email: "auerbach@example.com" phone: "(123)-456-7890" text: Object en_US: "<head> This page deals with department meetings.</head>" sv_FI: "Den här sidan behandlar avdelningsmöten" fr_CA: "Cette page traite des réunions de département" SCORE: 0.3150668740272522 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "tr", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 1, message: 'try to siGn-In' }, { _id: 3, message: 'try to sign-in' }
MongoDB Search returns documents with _id: 1 and _id: 3 in the
results because MongoDB Search created a token with the value tr using the
edgeGram tokenizer for the documents, which matches the search
term. If you index the message field using the standard
tokenizer, MongoDB Search would not return any results for the search term
tr.
The following table shows the tokens that the edgeGram tokenizer
and by comparison, the standard tokenizer, create for the
documents in the results:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
keyword
The keyword tokenizer tokenizes the entire input as a single token.
MongoDB Search doesn't index string fields that exceed 32766 characters using the
keyword tokenizer.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
Example
The following index definition indexes the message field in the
minutes collection using a custom analyzer named
keywordExample. It uses the keyword tokenizer to create
a token (searchable terms) on the entire field as a single term.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
keywordExamplein the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
keywordExamplefrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
1 { 2 "mappings": { 3 "dynamic": true, 4 "fields": { 5 "message": { 6 "analyzer": "keywordExample", 7 "type": "string" 8 } 9 } 10 }, 11 "analyzers": [ 12 { 13 "charFilters": [], 14 "name": "keywordExample", 15 "tokenFilters": [], 16 "tokenizer": { 17 "type": "keyword" 18 } 19 } 20 ] 21 }
1 db.minutes.createSearchIndex( 2 "default", 3 { 4 "mappings": { 5 "dynamic": true, 6 "fields": { 7 "message": { 8 "analyzer": "keywordExample", 9 "type": "string" 10 } 11 } 12 }, 13 "analyzers": [ 14 { 15 "charFilters": [], 16 "name": "keywordExample", 17 "tokenFilters": [], 18 "tokenizer": { 19 "type": "keyword" 20 } 21 } 22 ] 23 } 24 )
The following query searches the message field in
the minutes collection for the phrase try to sign-in.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "text": { "query": "try to sign-in", "path": "message" } } } SCORE: 0.5472603440284729 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "try to sign-in", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 3, message: 'try to sign-in' }
MongoDB Search returns the document with _id: 3 in the results because MongoDB Search
created a token with the value try to sign-in using the
keyword tokenizer for the documents, which matches the search
term. If you index the message field using the standard
tokenizer, MongoDB Search returns documents with _id: 1,
_id: 2 and _id: 3 for the search term try to sign-in
because each document contains some of the tokens the standard
tokenizer creates.
The following table shows the tokens that the keyword tokenizer
and by comparison, the standard tokenizer, create for the
document with _id: 3:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
nGram
The nGram tokenizer tokenizes into text chunks, or "n-grams", of
given sizes. You can't use a custom analyzer with nGram tokenizer in the analyzer field for
synonym or autocomplete field mapping definitions.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | yes | Number of characters to include in the shortest token created. |
| integer | yes | Number of characters to include in the longest token created. |
Example
The following index definition indexes the title field in the
minutes collection using a custom analyzer named
ngramExample. It uses the nGram tokenizer to create
tokens (searchable terms) between 4 and 6 characters long
in the title field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
ngramAnalyzerin the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select nGram from the dropdown and type the value for the following fields:
FieldValueminGram
4maxGram
6Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
ngramAnalyzerfrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
1 { 2 "mappings": { 3 "dynamic": true, 4 "fields": { 5 "title": { 6 "analyzer": "ngramExample", 7 "type": "string" 8 } 9 } 10 }, 11 "analyzers": [ 12 { 13 "charFilters": [], 14 "name": "ngramExample", 15 "tokenFilters": [], 16 "tokenizer": { 17 "maxGram": 6, 18 "minGram": 4, 19 "type": "nGram" 20 } 21 } 22 ] 23 }
db.minutes.createSearchIndex("default", { "mappings": { "dynamic": true, "fields": { "title": { "analyzer": "ngramExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "ngramExample", "tokenFilters": [], "tokenizer": { "maxGram": 6, "minGram": 4, "type": "nGram" } } ] })
The following query searches the title field in
the minutes collection for the term week.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "text": { "query": "week", "path": "title" } } } SCORE: 0.5895273089408875 _id: "1" message: "try to siGn-In" page_updated_by: Object last_name: "AUERBACH" first_name: "Siân" email: "auerbach@example.com" phone: "(123)-456-7890" text: Object en_US: "<head> This page deals with department meetings.</head>" sv_FI: "Den här sidan behandlar avdelningsmöten" fr_CA: "Cette page traite des réunions de département"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "week", 6 "path": "title" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "title": 1 14 } 15 } 16 ])
{ _id: 1, title: "The team's weekly meeting" }
MongoDB Search returns the document with _id: 1 in the results because MongoDB Search
created a token with the value week using the nGram tokenizer
for the documents, which matches the search term. If you index the
title field using the standard or edgeGram tokenizer,
MongoDB Search would not return any results for the search term week.
The following table shows the tokens that the nGram tokenizer
and by comparison, the standard and edgeGram tokenizer create
for the document with _id: 1:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
|
|
regexCaptureGroup
The regexCaptureGroup tokenizer matches a Java regular expression
pattern to extract tokens.
Tip
To learn more about the Java regular expression syntax, see the Pattern class in the Java documentation.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| string | yes | Regular expression to match against. |
| integer | yes | Index of the character group within the matching expression to
extract into tokens. Use |
Example
The following index definition indexes the page_updated_by.phone
field in the minutes collection using a custom analyzer named
phoneNumberExtractor. It uses the following:
mappingscharacter filter to remove parenthesis around the first three digits and replace all spaces and periods with dashesregexCaptureGrouptokenizer to create a single token from the first US-formatted phone number present in the text input
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
phoneNumberExtractorin the Analyzer Name field.Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following characters in the Original field, one at a time, and leave the corresponding Replacement field empty.
OriginalReplacement-.()Click Add character filter.
Expand Tokenizer if it's collapsed.
Select regexCaptureGroup from the dropdown and type the value for the following fields:
FieldValuepattern
^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$group
0Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone field.
Select page_updated_by.phone from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
phoneNumberExtractorfrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "dynamic": true, "fields": { "page_updated_by": { "fields": { "phone": { "analyzer": "phoneNumberExtractor", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [ { "mappings": { " ": "-", "(": "", ")": "", ".": "-" }, "type": "mapping" } ], "name": "phoneNumberExtractor", "tokenFilters": [], "tokenizer": { "group": 0, "pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$", "type": "regexCaptureGroup" } } ] }
db.minutes.createSearchIndex("default", { "mappings": { "dynamic": true, "fields": { "page_updated_by": { "fields": { "phone": { "analyzer": "phoneNumberExtractor", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [ { "mappings": { " ": "-", "(": "", ")": "", ".": "-" }, "type": "mapping" } ], "name": "phoneNumberExtractor", "tokenFilters": [], "tokenizer": { "group": 0, "pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$", "type": "regexCaptureGroup" } } ] })
The following query searches the page_updated_by.phone field in
the minutes collection for the phone number 123-456-9870.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "text": { "query": "123-456-9870", "path": "page_updated_by.phone" } } } SCORE: 0.5472603440284729 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "123-456-9870", 6 "path": "page_updated_by.phone" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.phone": 1 14 } 15 } 16 ])
{ _id: 3, page_updated_by: { phone: '(123).456.9870' }
MongoDB Search returns the document with _id: 3 in the results because MongoDB Search
created a token with the value 123-456-7890 using the
regexCaptureGroup tokenizer for the documents, which matches the
search term. If you index the page_updated_by.phone field using
the standard tokenizer, MongoDB Search returns all of the documents for
the search term 123-456-7890.
The following table shows the tokens that the regexCaptureGroup tokenizer
and by comparison, the standard tokenizer, create for the
document with _id: 3:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
regexSplit
The regexSplit tokenizer splits tokens with a Java regular-expression
based delimiter.
Tip
To learn more about the Java regular expression syntax, see the Pattern class in the Java documentation.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| string | yes | Regular expression to match against. |
Example
The following index definition indexes the page_updated_by.phone
field in the minutes collection using a custom analyzer named
dashDotSpaceSplitter. It uses the regexSplit tokenizer to
create tokens (searchable terms) from one or more hyphens, periods
and spaces on the page_updated_by.phone field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
dashDotSpaceSplitterin the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select regexSplit from the dropdown and type the value for the following field:
FieldValuepattern
[-. ]+Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone field.
Select page_updated_by.phone from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
dashDotSpaceSplitterfrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "page_updated_by": { "fields": { "phone": { "analyzer": "dashDotSpaceSplitter", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [], "name": "dashDotSpaceSplitter", "tokenFilters": [], "tokenizer": { "pattern": "[-. ]+", "type": "regexSplit" } } ] }
db.minutes.createSearchIndex("default", { "mappings": { "dynamic": true, "fields": { "page_updated_by": { "fields": { "phone": { "analyzer": "dashDotSpaceSplitter", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [], "name": "dashDotSpaceSplitter", "tokenFilters": [], "tokenizer": { "pattern": "[-. ]+", "type": "regexSplit" } } ] })
The following query searches the page_updated_by.phone field in
the minutes collection for the digits 9870.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "index": "default", "text": { "query": "9870", "path": "page_updated_by.phone" } } } SCORE: 0.5472603440284729 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "9870", 6 "path": "page_updated_by.phone" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.phone": 1 14 } 15 } 16 ])
{ _id: 3, page_updated_by: { phone: '(123).456.9870' }
MongoDB Search returns the document with _id: 3 in the results because MongoDB Search
created a token with the value 9870 using the regexSplit
tokenizer for the documents, which matches the search term. If you
index the page_updated_by.phone field using the standard
tokenizer, MongoDB Search would not return any results for the search term
9870.
The following table shows the tokens that the regexCaptureGroup
tokenizer and by comparison, the standard tokenizer, create for
the document with _id: 3:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
standard
The standard tokenizer tokenizes based on word break rules from the
Unicode Text Segmentation algorithm.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | no | Maximum length for a single token. Tokens greater than this
length are split at Default: |
Example
The following index definition indexes the message
field in the minutes collection using a custom analyzer named
standardExample. It uses the standard tokenizer and the
stopword token filter.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
standardExamplein the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
standardExamplefrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "standardExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "standardExample", "tokenFilters": [], "tokenizer": { "type": "standard" } } ] }
1 db.minutes.createSearchIndex( 2 "default", 3 { 4 "mappings": { 5 "dynamic": true, 6 "fields": { 7 "message": { 8 "analyzer": "standardExample", 9 "type": "string" 10 } 11 } 12 }, 13 "analyzers": [ 14 { 15 "charFilters": [], 16 "name": "standardExample", 17 "tokenFilters": [], 18 "tokenizer": { 19 "type": "standard" 20 } 21 } 22 ] 23 } 24 )
The following query searches the message field in
the minutes collection for the term signature.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "text": { "query": "signature", "path": "message" } } } SCORE: 0.5376965999603271 _id: "4" message: "write down your signature or phone №" page_updated_by: Object last_name: "LEVINSKI" first_name: "François" email: "levinski@example.com" phone: "123-456-8907" text: Object en_US: "<body>This page has been updated with the items on the agenda.</body>" es_MX: "La página ha sido actualizada con los puntos de la agenda." pl_PL: "Strona została zaktualizowana o punkty porządku obrad."
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "signature", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 4, message: 'write down your signature or phone №' }
MongoDB Search returns the document with _id: 4 because MongoDB Search
created a token with the value signature using the standard
tokenizer for the documents, which matches the search term. If you
index the message field using the keyword tokenizer, MongoDB Search
would not return any results for the search term signature.
The following table shows the tokens that the standard
tokenizer and by comparison, the keyword analyzer, create for
the document with _id: 4:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
uaxUrlEmail
The uaxUrlEmail tokenizer tokenizes URLs and email addresses.
Although uaxUrlEmail tokenizer tokenizes based on word break rules
from the Unicode Text Segmentation
algorithm,
we recommend using uaxUrlEmail tokenizer only when the indexed
field value includes URLs and email
addresses. For fields that don't include URLs or email addresses, use
the standard tokenizer to create tokens based on
word break rules.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| int | no | Maximum number of characters in one token. Default: |
Example
The following index definition indexes the page_updated_by.email
field in the minutes collection using a custom analyzer named
basicEmailAddressAnalyzer. It uses the uaxUrlEmail tokenizer to
create tokens (searchable terms) from URLs and email
addresses in the page_updated_by.email field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
basicEmailAddressAnalyzerin the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select uaxUrlEmail from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email field.
Select page_updated_by.email from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
basicEmailAddressAnalyzerfrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "fields": { "page_updated_by": { "fields": { "email": { "analyzer": "basicEmailAddressAnalyzer", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "name": "basicEmailAddressAnalyzer", "tokenizer": { "type": "uaxUrlEmail" } } ] }
db.minutes.createSearchIndex("default", { "mappings": { "fields": { "page_updated_by": { "fields": { "email": { "analyzer": "basicEmailAddressAnalyzer", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "name": "basicEmailAddressAnalyzer", "tokenizer": { "type": "uaxUrlEmail" } } ] })
The following query searches the page_updated_by.email field in
the minutes collection for the email lewinsky@example.com.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "text": { "query": "lewinsky@example.com", "path": "page_updated_by.email" } } } SCORE: 0.5472603440284729 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "autocomplete": { 5 "query": "lewinsky@example.com", 6 "path": "page_updated_by.email" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.email": 1 14 } 15 } 16 ])
{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }
MongoDB Search returns the document with _id: 3 in the results
because MongoDB Search created a token with the value
lewinsky@example.com using the uaxUrlEmail tokenizer
for the documents, which matches the search term. If you
index the page_updated_by.email field using the
standard tokenizer, MongoDB Search returns all the documents for
the search term lewinsky@example.com.
The following table shows the tokens that the uaxUrlEmail
tokenizer and by comparison, the standard tokenizer, create for
the document with _id: 3:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|
The following index definition indexes the page_updated_by.email
field in the minutes collection using a custom analyzer named
emailAddressAnalyzer. It uses the following:
The autocomplete type with an
edgeGramtokenization strategyThe
uaxUrlEmailtokenizer to create tokens (searchable terms) from URLs and email addresses
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
emailAddressAnalyzerin the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select uaxUrlEmail from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email field.
Select page_updated_by.email from the Field Name dropdown and Autocomplete from the Data Type dropdown.
In the properties section for the data type, select
emailAddressAnalyzerfrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "fields": { "page_updated_by": { "fields": { "email": { "analyzer": "emailAddressAnalyzer", "tokenization": "edgeGram", "type": "autocomplete" } }, "type": "document" } } }, "analyzers": [ { "name": "emailAddressAnalyzer", "tokenizer": { "type": "uaxUrlEmail" } } ] }
db.minutes.createSearchIndex("default", { "mappings": { "fields": { "page_updated_by": { "fields": { "email": { "analyzer": "emailAddressAnalyzer", "tokenization": "edgeGram", "type": "autocomplete" } }, "type": "document" } } }, "analyzers": [ { "name": "emailAddressAnalyzer", "tokenizer": { "type": "uaxUrlEmail" } } ] })
The following query searches the page_updated_by.email field in
the minutes collection for the term exam.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "autocomplete": { "query": "lewinsky@example.com", "path": "page_updated_by.email" } } } SCORE: 1.0203158855438232 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "autocomplete": { 5 "query": "lewinsky@example.com", 6 "path": "page_updated_by.email" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.email": 1 14 } 15 } 16 ])
{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }
MongoDB Search returns the document with _id: 3 in the results
because MongoDB Search created a token with the value
lewinsky@example.com using the uaxUrlEmail tokenizer
for the documents, which matches the search term. If you
index the page_updated_by.email field using the
standard tokenizer, MongoDB Search returns all the documents for
the search term lewinsky@example.com.
The following table shows the tokens that the uaxUrlEmail
tokenizer and by comparison, the standard tokenizer, create for
the document with _id: 3:
Tokenizer | MongoDB Search Field Type | Token Outputs |
|---|---|---|
|
|
|
|
|
|
whitespace
The whitespace tokenizer tokenizes based on occurrences of
whitespace between words.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
|---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | no | Maximum length for a single token. Tokens greater than this
length are split at Default: |
Example
The following index definition indexes the message
field in the minutes collection using a custom analyzer named whitespaceExample. It uses the whitespace tokenizer to
create tokens (searchable terms) from any whitespaces in the
message field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
whitespaceExamplein the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
whitespaceExamplefrom the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "whitespaceExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "whitespaceExample", "tokenFilters": [], "tokenizer": { "type": "whitespace" } } ] }
db.minutes.createSearchIndex("default", { "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "whitespaceExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "whitespaceExample", "tokenFilters": [], "tokenizer": { "type": "whitespace" } } ] })
The following query searches the message field in
the minutes collection for the term SIGN-IN.
Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.
Replace the default query with the following and click Find:
{ "$search": { "text": { "query": "sign-in", "path": "message" } } } SCORE: 0.6722691059112549 _id: "3" message: "try to sign-in" page_updated_by: Object last_name: "LEWINSKY" first_name: "Brièle" email: "lewinsky@example.com" phone: "(123).456.9870" text: Object en_US: "<body>We'll head out to the conference room by noon.</body>"
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "sign-in", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
[ { _id: 3, message: 'try to sign-in' } ]
MongoDB Search returns the document with _id: 3 in the results because MongoDB Search
created a token with the value sign-in using the whitespace
tokenizer for the documents, which matches the search term. If you
index the message field using the standard tokenizer, MongoDB Search
returns documents with _id: 1, _id: 2 and _id: 3 for the
search term sign-in.
The following table shows the tokens that the whitespace
tokenizer and by comparison, the standard tokenizer, create for
the document with _id: 3:
Tokenizer | Token Outputs |
|---|---|
|
|
|
|