Switchable tokenizer #776

Sammanfogat
KitaitiMakoto sammanfogade 20 incheckningar från switchable-tokenizer in i master 4 år sedan
KitaitiMakoto kommenterad 4 år sedan (Migrerad från github.com)

Hi,

This is a pull request for Japanese full-text search feature. I made it possible to:

  • switch search tokenizer by environment variable
  • use Lindera morphological analysis library for Japanese full-text search
    • I extracted this as search-lindera feature so that users can choose not to include it, which has 10MiB dictionary

But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera?

Thanks.

Hi, This is a pull request for Japanese full-text search feature. I made it possible to: * switch search tokenizer by environment variable * use Lindera morphological analysis library for Japanese full-text search * I extracted this as `search-lindera` feature so that users can choose not to include it, which has 10MiB dictionary But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera? Thanks.
codecov[bot] kommenterad 4 år sedan (Migrerad från github.com)

Codecov Report

Merging #776 into master will increase coverage by 0.01%.
The diff coverage is 80.35%.

@@            Coverage Diff             @@
##           master     #776      +/-   ##
==========================================
+ Coverage   39.08%   39.09%   +0.01%     
==========================================
  Files          73       73              
  Lines        9736     9756      +20     
  Branches     2227     2233       +6     
==========================================
+ Hits         3805     3814       +9     
- Misses       4879     4886       +7     
- Partials     1052     1056       +4     
# [Codecov](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=h1) Report > Merging [#776](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=desc) into [master](https://codecov.io/gh/Plume-org/Plume/commit/ef70cb93e6d9457355bce4f6dae485c700bb07c6&el=desc) will **increase** coverage by `0.01%`. > The diff coverage is `80.35%`. ```diff @@ Coverage Diff @@ ## master #776 +/- ## ========================================== + Coverage 39.08% 39.09% +0.01% ========================================== Files 73 73 Lines 9736 9756 +20 Branches 2227 2233 +6 ========================================== + Hits 3805 3814 +9 - Misses 4879 4886 +7 - Partials 1052 1056 +4 ```
elegaanz (Migrerad från github.com) granskad av 4 år sedan
elegaanz (Migrerad från github.com) lämnade en kommentar

Sorry I didn't reviewed that earlier.

Sorry I didn't reviewed that earlier.
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrerad från github.com) kommenterad 4 år sedan

Does this mean that search can only work with one language that the admin chooses?

Does this mean that search can only work with one language that the admin chooses?
KitaitiMakoto (Migrerad från github.com) granskad av 4 år sedan
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
KitaitiMakoto (Migrerad från github.com) kommenterad 4 år sedan

Yes, this does.

There are two things to do for supporting multiple languages:

  • making Plume possible to accept multiple language search configuration
  • makeing it possible to detect language of query automatically or by user selection

I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.

Yes, this does. There are two things to do for supporting multiple languages: * making Plume possible to accept multiple language search configuration * makeing it possible to detect language of query automatically or by user selection I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.
elegaanz (Migrerad från github.com) granskad av 4 år sedan
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrerad från github.com) kommenterad 4 år sedan

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once.

Having only one language/alphabet supported by search seems more inconvenient to me...

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once. Having only one language/alphabet supported by search seems more inconvenient to me...
KitaitiMakoto (Migrerad från github.com) granskad av 4 år sedan
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
KitaitiMakoto (Migrerad från github.com) kommenterad 4 år sedan

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search tokenizers via environment variables(another part is introducing Lindera). This is achieved by env vars SEARCH_TAG_TOKENIZER and SEARCH_CONTENT_TOKENIZER. SEARCH_LANG is just a shortcut for combinations of those env vars.
If you don't set any env vars of SEARCH_TAG_TOKENIZER, SEARCH_CONTENT_TOKENIZER and SEARCH_LANG, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars.

Accepting only one SEARCH_LANG might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes SEARCH_LANG more than just a shortcut and it requires multiple index directories.

Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search *tokenizers* via environment variables(another part is introducing Lindera). This is achieved by env vars `SEARCH_TAG_TOKENIZER` and `SEARCH_CONTENT_TOKENIZER`. `SEARCH_LANG` is just a shortcut for combinations of those env vars. If you don't set any env vars of `SEARCH_TAG_TOKENIZER`, `SEARCH_CONTENT_TOKENIZER` and `SEARCH_LANG`, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars. Accepting only one `SEARCH_LANG` might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes `SEARCH_LANG` more than just a shortcut and it requires multiple index directories. Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.
elegaanz (Migrerad från github.com) granskad av 4 år sedan
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrerad från github.com) kommenterad 4 år sedan

OK, it would indeed add a lot of complexity. Let's do it this way!

Could you please make a PR to the documentation to document the new variables please?

OK, it would indeed add a lot of complexity. Let's do it this way! Could you please make a PR to [the documentation](https://github.com/Plume-org/docs/) to document the new variables please?
elegaanz (Migrerad från github.com) godkände dessa ändringar 4 år sedan
elegaanz kommenterad 4 år sedan (Migrerad från github.com)

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?
KitaitiMakoto kommenterad 4 år sedan (Migrerad från github.com)

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?
igalic kommenterad 4 år sedan (Migrerad från github.com)

we have all the time in the world

we have all the time in the world
Ägare

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason.
Anyway all tests passed 👍

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason. Anyway all tests passed :+1:
elegaanz kommenterad 4 år sedan (Migrerad från github.com)

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.
elegaanz kommenterad 4 år sedan (Migrerad från github.com)

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?
KitaitiMakoto kommenterad 4 år sedan (Migrerad från github.com)

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI!

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI! > @KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself? I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?
elegaanz kommenterad 4 år sedan (Migrerad från github.com)

Take your time, don't worry. 🙂

Take your time, don't worry. :slightly_smiling_face:

Granskare

Pull-förfrågan har sammanfogats som 92a386277b.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b switchable-tokenizer master
git pull origin switchable-tokenizer

Step 2:

Merge the changes and update on Forgejo.
git checkout master
git merge --no-ff switchable-tokenizer
git push origin master
Logga in för att delta i denna konversation.
Inga granskare
Ingen Milsten
Ingen tilldelad
2 Deltagare
Notiser
Förfallodatum
Förfallodatumet är ogiltigt eller utanför gränserna. Använd formatet 'åååå-mm-dd'.

Inget förfallodatum satt.

Beroenden

No dependencies set.

Reference: Plume/Plume#776
Laddar…
Det finns inget innehåll än.