-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-language support #12
Comments
Was going to drop some Chinese and Thai in and submit a PR, but saw this issue first and I couldn't agree more. Having a langs param would be really important for getting rid of false positives, especially regarding languages written using something other than the Latin alphabet when they are romanized or abbreviated. |
When reviewing the code to adapt it for implementation in Portuguese, I realized that opting for integration in the .csv file would not make much sense. In my opinion, exploring an alternative implementation method has numerous potential advantages and benefits. Here's a breakdown of the key points: Benefits:
|
I have little to no experience with lanague-based content, but I've seen that the ISO 639 is the go-to web/API standard. Some resources:
Here are two sub-issues:
Although I'm pretty sure the answer is no for both questions in most cases, I would like to implement the ability to use both 3 letter codes and localization too. If we were to implement 3 letter codes and localization: We could have a Params Langs endpoint Footnotes
|
Great discussion. From a user perspective I would find it confusing if there were multiple variations of On the other side, the risk might be that a word in spanish looking similar to an english swear word might get flagged without meaning something bad in spanish. Then again, some profanities remain the same, i.e. the most popular english profanities also work in spanish I suppose - we'd have to re-index all of them for the spanish, portugese, etc. versions. So I propose just adding them to the same database and seeing how that goes, the lang parameters make sense |
An example: the word "negro" in Italian is literally the n-word, however in Spanish it's just the black color, without any negative connotation whatsoever (as far as I know) |
Yes, same thing in Portuguese. |
It would be nice to have some input on this from the community on Asian and Arabic languages (on Wikipedia I saw multiple 3-letter languages under the "ar" common macro one)
I think an easier approach would be to just let the dev pass the For English/German-root-related languages there are certain shared words (for example English-German with "shit" and probably more that you @joschan21 know better), but at least I can tell you that most profanities in Italian are... In Italian. Like, you might hear someone say the n-word in English, however it's not extremely common. With this said, Italian is full of swear words, so it might also just be we don't "need" English for profanities. Also, I'm sorry to ask, but: what's the difference between "namespace" and database"? Is the database the single training csv data? |
I tried writing harsh words in Indonesian but it wasn't detected that they were toxic |
Quick update: namespace support was added last week to upstash js api: upstash/vector-js#25 Edit: answering my old question: namespaces are a way to group data under a single index in a similar fashion to metadata, however, contrary to metadata, it is selectable on query. In other words: one database with groups. |
Also @joschan21 which model do you recommend to convert the raw text to vector data? I'm writing a guide in a README.md to keep track of what I did in order to get started and would like to know if you have a recommendation. |
While checking out PR #11 I realized that supporting multi languages should be fairly easy to implement, however I would allow an optional
langs
parameter to pass a list of languages (egeng,deu,ita
) and split each language in its own training data.If no
langs
param is passed, then all are checked.Why? Well:
The text was updated successfully, but these errors were encountered: