Switchable tokenizer #776
No reviewers
Labels
No labels
A: API
A: Backend
A: Federation
A: Front-End
A: I18N
A: Meta
A: Security
Build
C: Bug
C: Discussion
C: Enhancement
C: Feature
Compatibility
Dependency
Design
Documentation
Good first issue
Help welcome
Mobile
Rendering
S: Blocked
S: Duplicate
S: Incomplete
S: Instance specific
S: Invalid
S: Needs Voting/Discussion
S: Ready for review
Suggestion
S: Voted on Loomio
S: Wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Plume/Plume#776
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "switchable-tokenizer"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi,
This is a pull request for Japanese full-text search feature. I made it possible to:
search-lindera
feature so that users can choose not to include it, which has 10MiB dictionaryBut I did not work with CI tests. How do you think about it? Run by four build of postgres or sqlite x with or without search-lindera?
Thanks.
Codecov Report
Sorry I didn't reviewed that earlier.
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Does this mean that search can only work with one language that the admin chooses?
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
Yes, this does.
There are two things to do for supporting multiple languages:
I think the latter thing is hard to solve. Detecting language automatically from a few search words is technically difficult. For users, selecting language every time they search is bothering.
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
We could default to the interface language maybe? And let user change it of needed. Or remember the last searched language so that you only have to change it only once.
Having only one language/alphabet supported by search seems more inconvenient to me...
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
At first, let me explain this pull request. One of essential parts of this pull request is that we become able to choose search tokenizers via environment variables(another part is introducing Lindera). This is achieved by env vars
SEARCH_TAG_TOKENIZER
andSEARCH_CONTENT_TOKENIZER
.SEARCH_LANG
is just a shortcut for combinations of those env vars.If you don't set any env vars of
SEARCH_TAG_TOKENIZER
,SEARCH_CONTENT_TOKENIZER
andSEARCH_LANG
, Plume behaves as always. Therefore, nothing will be lost if you don't set those env vars.Accepting only one
SEARCH_LANG
might be inconvenient as you say. Setting default interface language and remembering the last lang are possible. But allowing both them and specifying tokenizers introduces complexity. It makesSEARCH_LANG
more than just a shortcut and it requires multiple index directories.Those are worthy to work. But is one search lang a good start point if it doesn't lost anything? I don't think it's good for each pull request to be bigger in general.
@ -191,0 +213,4 @@
),
property_tokenizer: Ngram,
}
}
OK, it would indeed add a lot of complexity. Let's do it this way!
Could you please make a PR to the documentation to document the new variables please?
One test keeps failing… Did you tested your branch with both SQLite and PostgreSQL?
I'm sorry, I didn't. But now my dev env is broken. Can you wait a little bit?
we have all the time in the world
I was bugged by penultimate commit passing the CI, but a simple typo fix fail, and the error looked more like test script not able to connect to Selenium (thing that allow to run tests inside an actual web browser) than an actual error, so I reran it. Apparently CI is back at it again, failing for no obvious reason.
Anyway all tests passed 👍
OK, no need to do more tests @KitaitiMakoto then, LGTM! Thank you @trinity-1686a for restarting the job, I tried once too, but it didn't helped. Computers are weird.
@KitaitiMakoto do you want me to write the documentation for this feature, or do you want to do it yourself?
Thank you, @trinity-1686a and @elegaanz for researching and rerunning CI!
I want to write myself. But I'm a little bit busy now. If you hurry, can you write it?
Take your time, don't worry. 🙂