Switchable tokenizer #776

trinity-1686a · 2020-05-25T17:48:04Z

KitaitiMakoto commented

2020-05-25 17:48:04 +00:00

(Migrated from github.com)

Hi,

This is a pull request for Japanese full-text search feature. I made it possible to:

switch search tokenizer by environment variable
use Lindera morphological analysis library for Japanese full-text search
- I extracted this as search-lindera feature so that users can choose not to include it, which has 10MiB dictionary

But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera?

Thanks.

Hi, This is a pull request for Japanese full-text search feature. I made it possible to: * switch search tokenizer by environment variable * use Lindera morphological analysis library for Japanese full-text search * I extracted this as `search-lindera` feature so that users can choose not to include it, which has 10MiB dictionary But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera? Thanks.

codecov[bot] commented

2020-05-25 18:23:35 +00:00

(Migrated from github.com)

Codecov Report

Merging #776 into master will increase coverage by 0.01%.
The diff coverage is 80.35%.

@@            Coverage Diff             @@
##           master     #776      +/-   ##
==========================================
+ Coverage   39.08%   39.09%   +0.01%     
==========================================
  Files          73       73              
  Lines        9736     9756      +20     
  Branches     2227     2233       +6     
==========================================
+ Hits         3805     3814       +9     
- Misses       4879     4886       +7     
- Partials     1052     1056       +4

# [Codecov](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=h1) Report > Merging [#776](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=desc) into [master](https://codecov.io/gh/Plume-org/Plume/commit/ef70cb93e6d9457355bce4f6dae485c700bb07c6&el=desc) will **increase** coverage by `0.01%`. > The diff coverage is `80.35%`. ```diff @@ Coverage Diff @@ ## master #776 +/- ## ========================================== + Coverage 39.08% 39.09% +0.01% ========================================== Files 73 73 Lines 9736 9756 +20 Branches 2227 2233 +6 ========================================== + Hits 3805 3814 +9 - Misses 4879 4886 +7 - Partials 1052 1056 +4 ```

elegaanz (Migrated from github.com) reviewed 2020-06-10 17:42:23 +00:00

elegaanz (Migrated from github.com) left a comment

Sorry I didn't reviewed that earlier.

plume-models/src/config.rs

					
				@ -191,0 +213,4 @@

				                    ),

				                    property_tokenizer: Ngram,

				                }

				            }

elegaanz (Migrated from github.com) commented

2020-06-10 17:41:56 +00:00

Does this mean that search can only work with one language that the admin chooses?

KitaitiMakoto (Migrated from github.com) reviewed 2020-06-10 23:48:39 +00:00

plume-models/src/config.rs

					
				@ -191,0 +213,4 @@

				                    ),

				                    property_tokenizer: Ngram,

				                }

				            }

KitaitiMakoto (Migrated from github.com) commented

2020-06-10 23:48:39 +00:00

Yes, this does.

There are two things to do for supporting multiple languages:

making Plume possible to accept multiple language search configuration
makeing it possible to detect language of query automatically or by user selection

I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.

Yes, this does. There are two things to do for supporting multiple languages: * making Plume possible to accept multiple language search configuration * makeing it possible to detect language of query automatically or by user selection I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.

elegaanz (Migrated from github.com) reviewed 2020-06-11 06:43:02 +00:00

plume-models/src/config.rs

					
				@ -191,0 +213,4 @@

				                    ),

				                    property_tokenizer: Ngram,

				                }

				            }

elegaanz (Migrated from github.com) commented

2020-06-11 06:43:02 +00:00

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once.

Having only one language/alphabet supported by search seems more inconvenient to me...

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once. Having only one language/alphabet supported by search seems more inconvenient to me...

KitaitiMakoto (Migrated from github.com) reviewed 2020-06-13 14:33:42 +00:00

plume-models/src/config.rs

					
				@ -191,0 +213,4 @@

				                    ),

				                    property_tokenizer: Ngram,

				                }

				            }

KitaitiMakoto (Migrated from github.com) commented

2020-06-13 14:33:42 +00:00

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search tokenizers via environment variables(another part is introducing Lindera). This is achieved by env vars SEARCH_TAG_TOKENIZER and SEARCH_CONTENT_TOKENIZER. SEARCH_LANG is just a shortcut for combinations of those env vars.
If you don't set any env vars of SEARCH_TAG_TOKENIZER, SEARCH_CONTENT_TOKENIZER and SEARCH_LANG, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars.

Accepting only one SEARCH_LANG might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes SEARCH_LANG more than just a shortcut and it requires multiple index directories.

Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search *tokenizers* via environment variables(another part is introducing Lindera). This is achieved by env vars `SEARCH_TAG_TOKENIZER` and `SEARCH_CONTENT_TOKENIZER`. `SEARCH_LANG` is just a shortcut for combinations of those env vars. If you don't set any env vars of `SEARCH_TAG_TOKENIZER`, `SEARCH_CONTENT_TOKENIZER` and `SEARCH_LANG`, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars. Accepting only one `SEARCH_LANG` might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes `SEARCH_LANG` more than just a shortcut and it requires multiple index directories. Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.

elegaanz (Migrated from github.com) reviewed 2020-06-13 17:39:51 +00:00

plume-models/src/config.rs

					
				@ -191,0 +213,4 @@

				                    ),

				                    property_tokenizer: Ngram,

				                }

				            }

elegaanz (Migrated from github.com) commented

2020-06-13 17:39:51 +00:00

OK, it would indeed add a lot of complexity. Let's do it this way!

Could you please make a PR to the documentation to document the new variables please?

OK, it would indeed add a lot of complexity. Let's do it this way! Could you please make a PR to [the documentation](https://github.com/Plume-org/docs/) to document the new variables please?

elegaanz (Migrated from github.com) approved these changes 2020-06-13 17:40:21 +00:00

elegaanz commented

2020-06-13 18:28:22 +00:00

(Migrated from github.com)

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?

KitaitiMakoto commented

2020-06-16 16:01:47 +00:00

(Migrated from github.com)

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?

igalic commented

2020-06-16 17:14:40 +00:00

(Migrated from github.com)

we have all the time in the world

👍 1 😆 1

trinity-1686a commented

2020-06-16 23:09:03 +00:00

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason.
Anyway all tests passed 👍

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason. Anyway all tests passed :+1:

👍 2 😆 1

elegaanz commented

2020-06-17 14:57:21 +00:00

(Migrated from github.com)

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.

elegaanz commented

2020-06-17 14:58:04 +00:00

(Migrated from github.com)

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

KitaitiMakoto commented

2020-06-21 15:53:21 +00:00

(Migrated from github.com)

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI!

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI! > @KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself? I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?

elegaanz commented

2020-06-22 12:26:57 +00:00

(Migrated from github.com)

Take your time, don't worry. 🙂

Take your time, don't worry. :slightly_smiling_face:

kiwii referenced this pull request from a commit

2020-08-11 18:12:11 +00:00

Switchable tokenizer (#776) * [REFACTORING]Rename whitespace_tokenizer to tag_tokenizer for registration Name representing its purpose is preferred. * Add lindera-tantivy to plume-model's dependencies * Install lindera-tantivy * Add SearchTokenizerConfig struct * Add search tokenizers to config option * Use CONFIG for tokenizers * Use enum to hold tokenizer config instead of initializing on config phase * Use guard instead of duplicate default values * Use as_deref() instead of guard * Move SearchTokenizer from plume-models to plume-models::search::tokenizer * Rename SearchTokenizer to TokenizerKind * Define SearchTokenierConfig::determine_tokenizer() * Use determine_tokenizer in SearchTokenizerConfig::init() * Pass tokenizer config to Searcher methods * Add LowerCase filter to Lindera tokenizer * Add test for Lindera tokenizer * Define SEARCH_LANG env to specify tokenizers set * Run cargo fmt * Make Lindera tokenizer optional * Fix typos

Sign in to join this conversation.

No reviewers

elegaanz