-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fjaccard #42
Comments
Dear Kohei, Thank you very much for this and for all the work you are doing for quanteda. Really amazing! I saw the vignette but I suppose I do not have enough sophistication with the notation to understand how that formula (which looks like the Tanimoto coefficient?) relates to those other variants of Jaccard. Min-max or Ruzicka is: I suppose I was hoping for more clarifications on the mathematics rather than on the code implementation. |
This package was originally created to replicate the proxy package for text analysis (for https://cran.r-project.org/web/packages/proxy/vignettes/overview.pdf |
oh yes, of course. They list both of them so it makes sense that they are distinct coefficients. Sorry about that. |
Dear Kohei, I forked your repo as I believed I could easily add this myself and then send a pull request. I made the change, I think, but I'm not an expert of C++ and I'm stuck at loading the R package to test it. It seems there is a library missing or not in the right path. The change I made is I added the following function to pair.cpp
and then of course added "fjaccard" as an option in the similarity functions. If the code above is correct and the change is quite small, would you mind adding it yourself? I'm not sure how long it would take for me to figure out what's wrong with my path. This similarity measure is very important in stylometry and I am developing a package for stylometry which is dependent on quanteda (https://github.com/andreanini/idiolect) so adding this to proxyC and/or quanteda would actually help lots of future users of my package and of quanteda. |
I am developing the fuzzy Jaccard measure in v1 <- c(0.1, 0.2, 0.3, 0.9)
v2 <- c(0.3, 0.1, 0.2, 0.4)
sum(pmin(v1, v2)) / sum(pmax(v1, v2))
#> [1] 0.4705882
proxyC::simil(v1, v2, method = "fjaccard", margin = 2)
#> 1 x 1 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 0.4705882
proxy::simil(v1, v2, method = "fjaccard", by_rows = FALSE)
#> [,1]
#> [1,] 0.6538462
1 - proxy::dist(v1, v2, method = "fjaccard", by_rows = FALSE)
#> [,1]
#> [1,] 0.4705882
|
I use |
I think the C code for |
yeah, this should be an easy coefficient to transform. Thanks again! |
Is the eJaccard in this package equivalent to the min-max similarity (aka Ruzicka Distance aka fuzzy Jaccard)?
The text was updated successfully, but these errors were encountered: