Merge pull request #233 from scott1702/feature/enable-macrons-default

Enable macrons and non-ASCII chars in search by default

Via the use of [`ASCIIFoldingFilter`](https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ASCIIFoldingFilter) which will convert chars from outside the ASCII range to their closest equivalent - e.g. an ō into a simple o.

>## ASCII Folding Filter
>This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts characters from the following Unicode blocks:
>
> - C1 Controls and Latin-1 Supplement (PDF)
> - Latin Extended-A (PDF)
> - Latin Extended-B (PDF)
> - Latin Extended Additional (PDF)
> - Latin Extended-C (PDF)
> - Latin Extended-D (PDF)
> - IPA Extensions (PDF)
> - Phonetic Extensions (PDF)
> - Phonetic Extensions Supplement (PDF)
> - General Punctuation (PDF)
> - Superscripts and Subscripts (PDF)
> - Enclosed Alphanumerics (PDF)
> - Dingbats (PDF)
> - Supplemental Punctuation (PDF)
> - Alphabetic Presentation Forms (PDF)
> - Halfwidth and Fullwidth Forms (PDF)
> 
> 
> **Factory class**: solr.ASCIIFoldingFilterFactory
> 
> **Arguments**:
> `preserveOriginal` - (boolean, default false) If true, the original token is preserved: "thé" → "the", "thé"
> 
> **Example**:
> ```xml
> <analyzer>
>   <tokenizer class="solr.WhitespaceTokenizer"/>
>   <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
> </analyzer>
> ```
> **In**: "á" (Unicode character 00E1)
> **Out**: "a" (ASCII character 97)
This commit is contained in:
Dylan Wagstaff 2018-10-24 09:22:38 +13:00 committed by GitHub
commit 6eb6ebacb4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 32 additions and 28 deletions

View File

@ -136,9 +136,11 @@
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
@ -162,9 +164,11 @@
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

View File

@ -428,7 +428,7 @@ To allow searches on words containing numeric tokens, you'll need to change the
The `ASCIIFoldingFilterFactory` filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.
Find the fields in your overloaded `types.ss` that you want to enable this behaviour in, for example inside the `<fieldType name="htmltext">` block, add the following to both its index analyzer and query analyzer records.
By default, this functionality is enabled on the `htmltext` and `text` fieldTypes. If you want it enabled for any other fieldTypes simply find the fields in your overloaded `types.ss` that you want to enable this behaviour in, for example inside the `<fieldType name="textTight">` block, add the following to both its index analyzer and query analyzer records.
```xml
<filter class="solr.ASCIIFoldingFilterFactory"/>