#776 Switchable tokenizer

Merged
KitaitiMakoto merged 20 commits from switchable-tokenizer into master 5 months ago
KitaitiMakoto commented 6 months ago (Migrated from github.com)

Hi,

This is a pull request for Japanese full-text search feature. I made it possible to:

  • switch search tokenizer by environment variable
  • use Lindera morphological analysis library for Japanese full-text search
    • I extracted this as search-lindera feature so that users can choose not to include it, which has 10MiB dictionary

But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera?

Thanks.

Hi, This is a pull request for Japanese full-text search feature. I made it possible to: * switch search tokenizer by environment variable * use Lindera morphological analysis library for Japanese full-text search * I extracted this as `search-lindera` feature so that users can choose not to include it, which has 10MiB dictionary But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera? Thanks.
codecov[bot] commented 6 months ago (Migrated from github.com)
Owner

Codecov Report

Merging #776 into master will increase coverage by 0.01%.
The diff coverage is 80.35%.

@@            Coverage Diff             @@
##           master     #776      +/-   ##
==========================================
+ Coverage   39.08%   39.09%   +0.01%     
==========================================
  Files          73       73              
  Lines        9736     9756      +20     
  Branches     2227     2233       +6     
==========================================
+ Hits         3805     3814       +9     
- Misses       4879     4886       +7     
- Partials     1052     1056       +4     
# [Codecov](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=h1) Report > Merging [#776](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=desc) into [master](https://codecov.io/gh/Plume-org/Plume/commit/ef70cb93e6d9457355bce4f6dae485c700bb07c6&el=desc) will **increase** coverage by `0.01%`. > The diff coverage is `80.35%`. ```diff @@ Coverage Diff @@ ## master #776 +/- ## ========================================== + Coverage 39.08% 39.09% +0.01% ========================================== Files 73 73 Lines 9736 9756 +20 Branches 2227 2233 +6 ========================================== + Hits 3805 3814 +9 - Misses 4879 4886 +7 - Partials 1052 1056 +4 ```
elegaanz (Migrated from github.com) reviewed 5 months ago
elegaanz (Migrated from github.com) left a comment

Sorry I didn’t reviewed that earlier.

@@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Plume_migration_agent commented 5 months ago

Does this mean that search can only work with one language that the admin chooses?

Does this mean that search can only work with one language that the admin chooses?
KitaitiMakoto (Migrated from github.com) reviewed 5 months ago
@@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Plume_migration_agent commented 5 months ago

Yes, this does.

There are two things to do for supporting multiple languages:

  • making Plume possible to accept multiple language search configuration
  • makeing it possible to detect language of query automatically or by user selection

I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.

Yes, this does. There are two things to do for supporting multiple languages: * making Plume possible to accept multiple language search configuration * makeing it possible to detect language of query automatically or by user selection I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.
elegaanz (Migrated from github.com) reviewed 5 months ago
@@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Plume_migration_agent commented 5 months ago

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once.

Having only one language/alphabet supported by search seems more inconvenient to me...

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once. Having only one language/alphabet supported by search seems more inconvenient to me...
KitaitiMakoto (Migrated from github.com) reviewed 5 months ago
@@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Plume_migration_agent commented 5 months ago

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search tokenizers via environment variables(another part is introducing Lindera). This is achieved by env vars SEARCH_TAG_TOKENIZER and SEARCH_CONTENT_TOKENIZER. SEARCH_LANG is just a shortcut for combinations of those env vars.
If you don’t set any env vars of SEARCH_TAG_TOKENIZER, SEARCH_CONTENT_TOKENIZER and SEARCH_LANG, Plume behaves as always. Therefore, nothing will be lost if you don’t set those env vars.

Accepting only one SEARCH_LANG might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes SEARCH_LANG more than just a shortcut and it requires multiple index directories.

Those are worthy to work. But is one search lang a good start point if it doesn’t lost anything? I don’t think it’s good for each pull request to be bigger in general.

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search *tokenizers* via environment variables(another part is introducing Lindera). This is achieved by env vars `SEARCH_TAG_TOKENIZER` and `SEARCH_CONTENT_TOKENIZER`. `SEARCH_LANG` is just a shortcut for combinations of those env vars. If you don't set any env vars of `SEARCH_TAG_TOKENIZER`, `SEARCH_CONTENT_TOKENIZER` and `SEARCH_LANG`, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars. Accepting only one `SEARCH_LANG` might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes `SEARCH_LANG` more than just a shortcut and it requires multiple index directories. Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.
elegaanz (Migrated from github.com) reviewed 5 months ago
@@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Plume_migration_agent commented 5 months ago

OK, it would indeed add a lot of complexity. Let’s do it this way!

Could you please make a PR to the documentation to document the new variables please?

OK, it would indeed add a lot of complexity. Let's do it this way! Could you please make a PR to [the documentation](https://github.com/Plume-org/docs/) to document the new variables please?
elegaanz (Migrated from github.com) approved these changes 5 months ago
elegaanz commented 5 months ago (Migrated from github.com)
Owner

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?
KitaitiMakoto commented 5 months ago (Migrated from github.com)
Owner

I’m sorry, I didn’t. But now my dev env is broken. Can you wait a little bit?

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?
igalic commented 5 months ago (Migrated from github.com)
Owner

we have all the time in the world

we have all the time in the world
trinity-1686a commented 5 months ago
Owner

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason.
Anyway all tests passed 👍

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason. Anyway all tests passed :+1:
elegaanz commented 5 months ago (Migrated from github.com)
Owner

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn’t helped. Computers are weird.

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.
elegaanz commented 5 months ago (Migrated from github.com)
Owner

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?
KitaitiMakoto commented 5 months ago (Migrated from github.com)
Owner

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI!

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

I want to write myself. But I’m a little bit busy now. If you hurry, can you write it?

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI! > @KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself? I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?
elegaanz commented 5 months ago (Migrated from github.com)
Owner

Take your time, don’t worry. 🙂

Take your time, don't worry. :slightly_smiling_face:

Reviewers

Plume_migration_agent approved these changes 5 months ago
The pull request has been merged as 92a386277b.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This pull request currently doesn't have any dependencies.

Loading…
There is no content yet.