Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

Open
ApproximateIdentity opened this issue Dec 7, 2014 · 6 comments
Open

Comments

@ApproximateIdentity
Copy link
Contributor

When dealing with sfiles I sometimes just want to drop documents (lines) altogether if they have too few tokens in them. Up until now I've just used hacked solutions and been too lazy to actually figure how and if it should be integrated into rosetta. Here is a mock-up commit of what I mean:

ApproximateIdentity@6e2916d

All it does is add a flag min_tokens to filter_sfile() which will cause the filtering to not write lines with fewer than that many tokens. Would it make sense to add this thing here? Is there a better place to add it?

And maybe most importantly, does this functionality already exist somewhere else and I've just always missed it?

@dkrasner
Copy link
Contributor

dkrasner commented Dec 7, 2014

Right now the built it in filtering is on tokens not on docs, i.e. the filter_extremes method. For individual documents the idea was that you would first generate a list of doc_ids to keep and the pass this to filter_sfile(), but I can see passing criteria directly to the method.

Another option is to allow for filter_sfile() to take a list of filters all with the same api; for example, each function would take record_dict item, idf list, **kwargs and return a bool. Then in filter_sfile a document would be written if all([filter_func(args, kwargs) for filter_func in filter_func_list]). This way you allow the user to pass whatever filters, and we can add you idf_filter to a list of utils.

thoughts?

@langmore
Copy link
Contributor

langmore commented Dec 7, 2014

I like the idea of a list of filters. One thing I'm noticing is that there are two types of filters. First, there are the type that operate on tokens, and filter the same tokesn from every file, e.g. filter_extremes. In a sense these filter columns. Second, there is the tf-idf filter and the doc_id filter (and the additional filter proposed here) that filter individual docs (filter in the space of rows). If the distinction is made clear then it will be helpful for users...so if I could re-design the API I might make that explicit in the function names...but maybe there is another way.

@dkrasner
Copy link
Contributor

dkrasner commented Dec 7, 2014

we could def have two methods, say filter_tokens, filter_documents, each could take a list of filters. I do like the current ease of usability, esp when first starting with the lib, i.e. it's nice to have some "natural" filters such as max_doc_freq, min_doc_fraq as opposed to having these filters living somewhere, adding an extra step to a first time run through, but perhaps I am wrong...

there could be a compromise, such as default filters in filter_tokens, i.e. the ones that are there, and filter_documents, doc_id, and the allow the user to pass in a list which would be checked

@langmore
Copy link
Contributor

langmore commented Dec 7, 2014

Yeah, SFileFilter already has a filter_tokens method, and filter_extremes simply calls filter_tokens under the hood. So the same thing could be done with document filters, i.e. we have filter_documents and the current "higher level" functions call this under the hood...but the user can call either one. So already we have the natural filters e.g. "tf-idf" as the high level function, and the lower-level interface is half way done...the real question is where should the generic list of filters be put in. Maybe the generic list of filters is passed as a kwarg to filter_tokens or filter_documents.

However....The other difference is that (currently) token filtering is done instantly: We instantly remove items from our mapping of tokens to keep. But document filtering only makes sense as you execute filter_sfile. So maybe it doesn't make sense to have filter_documents since that would seem to imply an action, but really the "action" doesn't happen until you call flter_sfile.

@ApproximateIdentity
Copy link
Contributor Author

I like the idea of passing a function that does the filtering. One thing to note though is that the way I wrote the tf_idf stuff isn't really filtering whole columns. It can filter out 'and' from one doc while leaving it in another. So that one is more like a document-level filter.

The tf-idf stuff could be actually kind of be unified with the doc filtering function idea. tf-idf could be implemented as a function that's passed in and the min_tokens flag could be implemented by throwing away all tokens in a document if the number of tokens in that document is low. then there could be a flag (say 'drop_empty_docs = False') if we want to actually drop them entirely.

I think what I'll do is whip up a couple quick examples to see how it feels.

@ApproximateIdentity
Copy link
Contributor Author

Here's a basic example of one thing I'm thinking about:

ApproximateIdentity@e5e27f5
0a9db41

(The second commit fixes a little bug in the first.)

The point is having this generic list of filters would allow more or less anything to be done without adding any future parameters. I also added the check 'if not record_dict.has_key('skip_doc'):' which allows the user to specify if a document should be kept or dropped. The two functions at the end are two example filters. One drops the doc and one leaves it, but removes all words. This allows most kinds of streaming filters and means that you wouldn't have to keep adding function parameters. If I changed it to pass self to func as well in line 713, then even the min_tf_idf could be rewritten in this framework, but that might be overkill.

To me this feels like it would be right, but I haven't really used it yet. I might just branch a version doing this and then write some stuff with it and see how it feels. Thoughts? Is this a totally terrible idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants