-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in a min_tokens flag in SFileFilter.filter_sfile()? #37
Comments
Right now the built it in filtering is on tokens not on docs, i.e. the filter_extremes method. For individual documents the idea was that you would first generate a list of doc_ids to keep and the pass this to filter_sfile(), but I can see passing criteria directly to the method. Another option is to allow for filter_sfile() to take a list of filters all with the same api; for example, each function would take record_dict item, idf list, **kwargs and return a bool. Then in filter_sfile a document would be written if all([filter_func(args, kwargs) for filter_func in filter_func_list]). This way you allow the user to pass whatever filters, and we can add you idf_filter to a list of utils. thoughts? |
I like the idea of a list of filters. One thing I'm noticing is that there are two types of filters. First, there are the type that operate on tokens, and filter the same tokesn from every file, e.g. |
we could def have two methods, say filter_tokens, filter_documents, each could take a list of filters. I do like the current ease of usability, esp when first starting with the lib, i.e. it's nice to have some "natural" filters such as max_doc_freq, min_doc_fraq as opposed to having these filters living somewhere, adding an extra step to a first time run through, but perhaps I am wrong... there could be a compromise, such as default filters in filter_tokens, i.e. the ones that are there, and filter_documents, doc_id, and the allow the user to pass in a list which would be checked |
Yeah, SFileFilter already has a However....The other difference is that (currently) token filtering is done instantly: We instantly remove items from our mapping of tokens to keep. But document filtering only makes sense as you execute |
I like the idea of passing a function that does the filtering. One thing to note though is that the way I wrote the tf_idf stuff isn't really filtering whole columns. It can filter out 'and' from one doc while leaving it in another. So that one is more like a document-level filter. The tf-idf stuff could be actually kind of be unified with the doc filtering function idea. tf-idf could be implemented as a function that's passed in and the min_tokens flag could be implemented by throwing away all tokens in a document if the number of tokens in that document is low. then there could be a flag (say 'drop_empty_docs = False') if we want to actually drop them entirely. I think what I'll do is whip up a couple quick examples to see how it feels. |
Here's a basic example of one thing I'm thinking about: ApproximateIdentity@e5e27f5 (The second commit fixes a little bug in the first.) The point is having this generic list of filters would allow more or less anything to be done without adding any future parameters. I also added the check 'if not record_dict.has_key('skip_doc'):' which allows the user to specify if a document should be kept or dropped. The two functions at the end are two example filters. One drops the doc and one leaves it, but removes all words. This allows most kinds of streaming filters and means that you wouldn't have to keep adding function parameters. If I changed it to pass self to func as well in line 713, then even the min_tf_idf could be rewritten in this framework, but that might be overkill. To me this feels like it would be right, but I haven't really used it yet. I might just branch a version doing this and then write some stuff with it and see how it feels. Thoughts? Is this a totally terrible idea? |
When dealing with sfiles I sometimes just want to drop documents (lines) altogether if they have too few tokens in them. Up until now I've just used hacked solutions and been too lazy to actually figure how and if it should be integrated into rosetta. Here is a mock-up commit of what I mean:
ApproximateIdentity@6e2916d
All it does is add a flag min_tokens to filter_sfile() which will cause the filtering to not write lines with fewer than that many tokens. Would it make sense to add this thing here? Is there a better place to add it?
And maybe most importantly, does this functionality already exist somewhere else and I've just always missed it?
The text was updated successfully, but these errors were encountered: