Switchable tokenizer #776

Sapludināts
KitaitiMakoto sapludināja 20 revīzijas no switchable-tokenizer uz master pirms 4 gadiem
KitaitiMakoto " komentēja pirms 4 gadiem" (Migrēts no github.com)

Hi,

This is a pull request for Japanese full-text search feature. I made it possible to:

  • switch search tokenizer by environment variable
  • use Lindera morphological analysis library for Japanese full-text search
    • I extracted this as search-lindera feature so that users can choose not to include it, which has 10MiB dictionary

But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera?

Thanks.

Hi, This is a pull request for Japanese full-text search feature. I made it possible to: * switch search tokenizer by environment variable * use Lindera morphological analysis library for Japanese full-text search * I extracted this as `search-lindera` feature so that users can choose not to include it, which has 10MiB dictionary But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera? Thanks.
codecov[bot] " komentēja pirms 4 gadiem" (Migrēts no github.com)

Codecov Report

Merging #776 into master will increase coverage by 0.01%.
The diff coverage is 80.35%.

@@            Coverage Diff             @@
##           master     #776      +/-   ##
==========================================
+ Coverage   39.08%   39.09%   +0.01%     
==========================================
  Files          73       73              
  Lines        9736     9756      +20     
  Branches     2227     2233       +6     
==========================================
+ Hits         3805     3814       +9     
- Misses       4879     4886       +7     
- Partials     1052     1056       +4     
# [Codecov](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=h1) Report > Merging [#776](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=desc) into [master](https://codecov.io/gh/Plume-org/Plume/commit/ef70cb93e6d9457355bce4f6dae485c700bb07c6&el=desc) will **increase** coverage by `0.01%`. > The diff coverage is `80.35%`. ```diff @@ Coverage Diff @@ ## master #776 +/- ## ========================================== + Coverage 39.08% 39.09% +0.01% ========================================== Files 73 73 Lines 9736 9756 +20 Branches 2227 2233 +6 ========================================== + Hits 3805 3814 +9 - Misses 4879 4886 +7 - Partials 1052 1056 +4 ```
elegaanz (Migrēts no github.com) recenzēja pirms 4 gadiem
elegaanz (Migrēts no github.com) atstāja komentāru

Sorry I didn't reviewed that earlier.

Sorry I didn't reviewed that earlier.
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrēts no github.com) " komentēja pirms 4 gadiem"

Does this mean that search can only work with one language that the admin chooses?

Does this mean that search can only work with one language that the admin chooses?
KitaitiMakoto (Migrēts no github.com) recenzēja pirms 4 gadiem
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
KitaitiMakoto (Migrēts no github.com) " komentēja pirms 4 gadiem"

Yes, this does.

There are two things to do for supporting multiple languages:

  • making Plume possible to accept multiple language search configuration
  • makeing it possible to detect language of query automatically or by user selection

I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.

Yes, this does. There are two things to do for supporting multiple languages: * making Plume possible to accept multiple language search configuration * makeing it possible to detect language of query automatically or by user selection I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.
elegaanz (Migrēts no github.com) recenzēja pirms 4 gadiem
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrēts no github.com) " komentēja pirms 4 gadiem"

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once.

Having only one language/alphabet supported by search seems more inconvenient to me...

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once. Having only one language/alphabet supported by search seems more inconvenient to me...
KitaitiMakoto (Migrēts no github.com) recenzēja pirms 4 gadiem
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
KitaitiMakoto (Migrēts no github.com) " komentēja pirms 4 gadiem"

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search tokenizers via environment variables(another part is introducing Lindera). This is achieved by env vars SEARCH_TAG_TOKENIZER and SEARCH_CONTENT_TOKENIZER. SEARCH_LANG is just a shortcut for combinations of those env vars.
If you don't set any env vars of SEARCH_TAG_TOKENIZER, SEARCH_CONTENT_TOKENIZER and SEARCH_LANG, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars.

Accepting only one SEARCH_LANG might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes SEARCH_LANG more than just a shortcut and it requires multiple index directories.

Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search *tokenizers* via environment variables(another part is introducing Lindera). This is achieved by env vars `SEARCH_TAG_TOKENIZER` and `SEARCH_CONTENT_TOKENIZER`. `SEARCH_LANG` is just a shortcut for combinations of those env vars. If you don't set any env vars of `SEARCH_TAG_TOKENIZER`, `SEARCH_CONTENT_TOKENIZER` and `SEARCH_LANG`, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars. Accepting only one `SEARCH_LANG` might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes `SEARCH_LANG` more than just a shortcut and it requires multiple index directories. Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.
elegaanz (Migrēts no github.com) recenzēja pirms 4 gadiem
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrēts no github.com) " komentēja pirms 4 gadiem"

OK, it would indeed add a lot of complexity. Let's do it this way!

Could you please make a PR to the documentation to document the new variables please?

OK, it would indeed add a lot of complexity. Let's do it this way! Could you please make a PR to [the documentation](https://github.com/Plume-org/docs/) to document the new variables please?
elegaanz (Migrēts no github.com) apstiprināja izmaiņas pirms 4 gadiem
elegaanz " komentēja pirms 4 gadiem" (Migrēts no github.com)

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?
KitaitiMakoto " komentēja pirms 4 gadiem" (Migrēts no github.com)

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?
igalic " komentēja pirms 4 gadiem" (Migrēts no github.com)

we have all the time in the world

we have all the time in the world
Īpašnieks

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason.
Anyway all tests passed 👍

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason. Anyway all tests passed :+1:
elegaanz " komentēja pirms 4 gadiem" (Migrēts no github.com)

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.
elegaanz " komentēja pirms 4 gadiem" (Migrēts no github.com)

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?
KitaitiMakoto " komentēja pirms 4 gadiem" (Migrēts no github.com)

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI!

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI! > @KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself? I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?
elegaanz " komentēja pirms 4 gadiem" (Migrēts no github.com)

Take your time, don't worry. 🙂

Take your time, don't worry. :slightly_smiling_face:

Recenzenti

Izmaiņu pieprasījums tika sapludināts ar revīziju 92a386277b.
Varat aplūkot arī komandrindas instrukcijas.

Solis 1:

Projekta repozitorijā izveidojiet jaunu jaunu atzaru un pārbaudiet savas izmaiņas.
git checkout -b switchable-tokenizer master
git pull origin switchable-tokenizer

Solis 2:

Sapludināt izmaiņas un atjaunot tās Forgejo.
git checkout master
git merge --no-ff switchable-tokenizer
git push origin master
Pierakstieties, lai pievienotos šai sarunai.
Nav recenzentu
Nav atskaites punktu
Nav atbildīgo
2 dalībnieki
Paziņojumi
Izpildes termiņš
Datums līdz nav korekts. Izmantojiet formātu 'gggg-mm-dd'.

Izpildes termiņš nav uzstādīts.

Atkarības

Nav atkarību.

Atsaucas uz: Plume/Plume#776
Notiek ielāde…
Vēl nav satura.