Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples show boolean vectors, what about string vectors? #27

Open
stenpiren opened this issue Oct 24, 2019 · 4 comments
Open

Examples show boolean vectors, what about string vectors? #27

stenpiren opened this issue Oct 24, 2019 · 4 comments

Comments

@stenpiren
Copy link

Hi,
I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each.
The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.

@linusjf
Copy link

linusjf commented Nov 10, 2019

Excellent point. It would help to have examples converting strings and documents to vectors using something like word2vec but more lightweight.

Something like this but with a fully coded example:

https://link.medium.com/xF9m59LMr1

@nanshakov
Copy link

nanshakov commented Mar 26, 2020

I use this approach:


import java.nio.ByteBuffer;
import java.util.Base64;
import java.util.stream.IntStream;

import info.debatty.java.lsh.MinHash;

/**
 * Hello world!
 */
public class App {

    //https://github.com/tdebatty/java-LSH
    public static void main(String[] args) {
        int signature_size = 128;
        int dictionary_size = 256;
        int initial_seed = 1234567890;

        MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed);
        String pass1 = "Nikita";
        String pass2 = "Nikika";
        String pass3 = "Nikika1995strong";
        String pass4 = "qweertyuiop";

        var sig1 = mh.signature(fromString(pass1));
        var sig2 = mh.signature(fromString(pass2));
        var sig3 = mh.signature(fromString(pass3));
        var sig4 = mh.signature(fromString(pass4));

        System.out.println("sig1-sig2 : " + mh.similarity(sig1, sig2));
        System.out.println("sig1-sig3 : " + mh.similarity(sig1, sig3));
        System.out.println("sig1-sig4 : " + mh.similarity(sig1, sig4));

        println(sig1);
        println(sig2);
        println(sig3);
        println(sig4);
    }

    public static boolean[] fromString(String str) {
        var bytes = str.getBytes();
        var result = new boolean[256];
        int index = 0;
        for (int i = 0; i < str.length(); i++, index += 8) {
            var bits = byteToBoolArr(bytes[i]);
            System.arraycopy(bits, 0, result, index, bits.length);
        }
        return result;
    }

    public static boolean[] byteToBoolArr(byte b) {
        var boolArr = new boolean[8];
        for (int i = 0; i < 8; i++) {
            boolArr[i] = (b & (byte) (128 / Math.pow(2, i))) != 0;
        }
        return boolArr;
    }

    static void println(final int[] array) {
        System.out.println(convertToBase64(array));
    }

    public static String convertToBase64(int[] ints) {
        ByteBuffer buf = ByteBuffer.allocate(ints.length);
        IntStream.of(ints).forEach(i -> buf.put((byte)i));
        return Base64.getEncoder().encodeToString(buf.array());
    }
}

@Jurian
Copy link

Jurian commented Jul 8, 2022

This is a very interesting method and I would like to use it in my project. At the moment I am using ngrams to create boolean vectors, perhaps this works better and faster.

I do have a few questions/observations:

  1. You copy 8 bits into the result array for every character in the string, this means the strings have a maximum length?
  2. I assume the signature size and the number 128 in byteToBoolArr() are the same variable?
  3. When does it make sense to change the signature size?

Cheers

PS. I slightly changed your code to make more use of (Java) constants:

protected static final int DIMENSIONS = 512;
protected static final int SIGNATURE_SIZE = 128;

private boolean[] byteToBoolArr(byte b) {
    boolean[] boolArr = new boolean[Byte.SIZE];
    for (int i = 0; i < Byte.SIZE; i++) {
        boolArr[i] = (b & (byte) (SIGNATURE_SIZE / Math.pow(2, i))) != 0;
    }
    return boolArr;
}

private boolean[] fromString(String token) {
    if(token.length() / Byte.SIZE > DIMENSIONS) {
        throw new IllegalArgumentException(
                "Input string too large to be encoded. " +
                "Would need at least " + (token.length()*Byte.SIZE) + " dimensions.");
    }
    byte[] bytes = token.getBytes();
    boolean[] result = new boolean[DIMENSIONS];
    for (int c_i = 0, b_i = 0; c_i < bytes.length; c_i++, b_i += Byte.SIZE) {
        System.arraycopy(byteToBoolArr(bytes[c_i]), 0, result, b_i, Byte.SIZE);
    }
    return result;
}

@jianshu93
Copy link

Hi all,

I have the same question here, why do not we use some random hash functions to hash strings into some u64 and consider it as bit vectors/binary vectors, this is how it was used in practice since I can imagine for truely random hash functions, same strings will be hashed to the same binary vectors. Check the Rust version of SRP-LSH here: https://github.com/serega/gaoya

Any idea how to modified it?

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants