Switchable tokenizer #776

Fusionado
KitaitiMakoto fusionados 20 commits de switchable-tokenizer en master hace 4 años
KitaitiMakoto comentado hace 4 años (Migrado desde github.com)

Hi,

This is a pull request for Japanese full-text search feature. I made it possible to:

  • switch search tokenizer by environment variable
  • use Lindera morphological analysis library for Japanese full-text search
    • I extracted this as search-lindera feature so that users can choose not to include it, which has 10MiB dictionary

But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera?

Thanks.

Hi, This is a pull request for Japanese full-text search feature. I made it possible to: * switch search tokenizer by environment variable * use Lindera morphological analysis library for Japanese full-text search * I extracted this as `search-lindera` feature so that users can choose not to include it, which has 10MiB dictionary But I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera? Thanks.
codecov[bot] comentado hace 4 años (Migrado desde github.com)

Codecov Report

Merging #776 into master will increase coverage by 0.01%.
The diff coverage is 80.35%.

@@            Coverage Diff             @@
##           master     #776      +/-   ##
==========================================
+ Coverage   39.08%   39.09%   +0.01%     
==========================================
  Files          73       73              
  Lines        9736     9756      +20     
  Branches     2227     2233       +6     
==========================================
+ Hits         3805     3814       +9     
- Misses       4879     4886       +7     
- Partials     1052     1056       +4     
# [Codecov](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=h1) Report > Merging [#776](https://codecov.io/gh/Plume-org/Plume/pull/776?src=pr&el=desc) into [master](https://codecov.io/gh/Plume-org/Plume/commit/ef70cb93e6d9457355bce4f6dae485c700bb07c6&el=desc) will **increase** coverage by `0.01%`. > The diff coverage is `80.35%`. ```diff @@ Coverage Diff @@ ## master #776 +/- ## ========================================== + Coverage 39.08% 39.09% +0.01% ========================================== Files 73 73 Lines 9736 9756 +20 Branches 2227 2233 +6 ========================================== + Hits 3805 3814 +9 - Misses 4879 4886 +7 - Partials 1052 1056 +4 ```
elegaanz (Migrado desde github.com) revisado hace 4 años
elegaanz (Migrado desde github.com) dejó un comentario

Sorry I didn't reviewed that earlier.

Sorry I didn't reviewed that earlier.
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrado desde github.com) comentado hace 4 años

Does this mean that search can only work with one language that the admin chooses?

Does this mean that search can only work with one language that the admin chooses?
KitaitiMakoto (Migrado desde github.com) revisado hace 4 años
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
KitaitiMakoto (Migrado desde github.com) comentado hace 4 años

Yes, this does.

There are two things to do for supporting multiple languages:

  • making Plume possible to accept multiple language search configuration
  • makeing it possible to detect language of query automatically or by user selection

I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.

Yes, this does. There are two things to do for supporting multiple languages: * making Plume possible to accept multiple language search configuration * makeing it possible to detect language of query automatically or by user selection I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.
elegaanz (Migrado desde github.com) revisado hace 4 años
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrado desde github.com) comentado hace 4 años

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once.

Having only one language/alphabet supported by search seems more inconvenient to me...

We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once. Having only one language/alphabet supported by search seems more inconvenient to me...
KitaitiMakoto (Migrado desde github.com) revisado hace 4 años
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
KitaitiMakoto (Migrado desde github.com) comentado hace 4 años

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search tokenizers via environment variables(another part is introducing Lindera). This is achieved by env vars SEARCH_TAG_TOKENIZER and SEARCH_CONTENT_TOKENIZER. SEARCH_LANG is just a shortcut for combinations of those env vars.
If you don't set any env vars of SEARCH_TAG_TOKENIZER, SEARCH_CONTENT_TOKENIZER and SEARCH_LANG, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars.

Accepting only one SEARCH_LANG might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes SEARCH_LANG more than just a shortcut and it requires multiple index directories.

Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.

At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search *tokenizers* via environment variables(another part is introducing Lindera). This is achieved by env vars `SEARCH_TAG_TOKENIZER` and `SEARCH_CONTENT_TOKENIZER`. `SEARCH_LANG` is just a shortcut for combinations of those env vars. If you don't set any env vars of `SEARCH_TAG_TOKENIZER`, `SEARCH_CONTENT_TOKENIZER` and `SEARCH_LANG`, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars. Accepting only one `SEARCH_LANG` might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makes `SEARCH_LANG` more than just a shortcut and it requires multiple index directories. Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.
elegaanz (Migrado desde github.com) revisado hace 4 años
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
elegaanz (Migrado desde github.com) comentado hace 4 años

OK, it would indeed add a lot of complexity. Let's do it this way!

Could you please make a PR to the documentation to document the new variables please?

OK, it would indeed add a lot of complexity. Let's do it this way! Could you please make a PR to [the documentation](https://github.com/Plume-org/docs/) to document the new variables please?
elegaanz (Migrado desde github.com) aprobado estos cambios hace 4 años
elegaanz comentado hace 4 años (Migrado desde github.com)

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?

One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?
KitaitiMakoto comentado hace 4 años (Migrado desde github.com)

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?

I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?
igalic comentado hace 4 años (Migrado desde github.com)

we have all the time in the world

we have all the time in the world
Propietario

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason.
Anyway all tests passed 👍

I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason. Anyway all tests passed :+1:
elegaanz comentado hace 4 años (Migrado desde github.com)

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.

OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.
elegaanz comentado hace 4 años (Migrado desde github.com)

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?
KitaitiMakoto comentado hace 4 años (Migrado desde github.com)

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI!

@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?

I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?

Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI! > @KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself? I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?
elegaanz comentado hace 4 años (Migrado desde github.com)

Take your time, don't worry. 🙂

Take your time, don't worry. :slightly_smiling_face:

Revisores

El Pull Request se ha fusionado como 92a386277b.

Paso 1:

Desde el repositorio de su proyecto, revisa una nueva rama y prueba los cambios.
git checkout -b switchable-tokenizer master
git pull origin switchable-tokenizer

Paso 2:

Combine los cambios y actualice en Forgejo.
git checkout master
git merge --no-ff switchable-tokenizer
git push origin master
Inicie sesión para unirse a esta conversación.
No hay revisores
Sin Milestone
No asignados
2 participantes
Notificaciones
Fecha de vencimiento
La fecha de vencimiento es inválida o está fuera de rango. Por favor utilice el formato 'aaaa-mm-dd'.

Sin fecha de vencimiento.

Dependencias

No se han establecido dependencias.

Referencia: Plume/Plume#776
Cargando…
Aún no existe contenido.