Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

ApproximateIdentity · 2014-12-07T03:38:46Z

When dealing with sfiles I sometimes just want to drop documents (lines) altogether if they have too few tokens in them. Up until now I've just used hacked solutions and been too lazy to actually figure how and if it should be integrated into rosetta. Here is a mock-up commit of what I mean:

ApproximateIdentity@6e2916d

All it does is add a flag min_tokens to filter_sfile() which will cause the filtering to not write lines with fewer than that many tokens. Would it make sense to add this thing here? Is there a better place to add it?

And maybe most importantly, does this functionality already exist somewhere else and I've just always missed it?

The text was updated successfully, but these errors were encountered:

dkrasner · 2014-12-07T09:36:12Z

Right now the built it in filtering is on tokens not on docs, i.e. the filter_extremes method. For individual documents the idea was that you would first generate a list of doc_ids to keep and the pass this to filter_sfile(), but I can see passing criteria directly to the method.

Another option is to allow for filter_sfile() to take a list of filters all with the same api; for example, each function would take record_dict item, idf list, **kwargs and return a bool. Then in filter_sfile a document would be written if all([filter_func(args, kwargs) for filter_func in filter_func_list]). This way you allow the user to pass whatever filters, and we can add you idf_filter to a list of utils.

thoughts?

langmore · 2014-12-07T14:13:25Z

I like the idea of a list of filters. One thing I'm noticing is that there are two types of filters. First, there are the type that operate on tokens, and filter the same tokesn from every file, e.g. filter_extremes. In a sense these filter columns. Second, there is the tf-idf filter and the doc_id filter (and the additional filter proposed here) that filter individual docs (filter in the space of rows). If the distinction is made clear then it will be helpful for users...so if I could re-design the API I might make that explicit in the function names...but maybe there is another way.

dkrasner · 2014-12-07T18:29:30Z

we could def have two methods, say filter_tokens, filter_documents, each could take a list of filters. I do like the current ease of usability, esp when first starting with the lib, i.e. it's nice to have some "natural" filters such as max_doc_freq, min_doc_fraq as opposed to having these filters living somewhere, adding an extra step to a first time run through, but perhaps I am wrong...

there could be a compromise, such as default filters in filter_tokens, i.e. the ones that are there, and filter_documents, doc_id, and the allow the user to pass in a list which would be checked

langmore · 2014-12-07T18:42:59Z

Yeah, SFileFilter already has a filter_tokens method, and filter_extremes simply calls filter_tokens under the hood. So the same thing could be done with document filters, i.e. we have filter_documents and the current "higher level" functions call this under the hood...but the user can call either one. So already we have the natural filters e.g. "tf-idf" as the high level function, and the lower-level interface is half way done...the real question is where should the generic list of filters be put in. Maybe the generic list of filters is passed as a kwarg to filter_tokens or filter_documents.

However....The other difference is that (currently) token filtering is done instantly: We instantly remove items from our mapping of tokens to keep. But document filtering only makes sense as you execute filter_sfile. So maybe it doesn't make sense to have filter_documents since that would seem to imply an action, but really the "action" doesn't happen until you call flter_sfile.

ApproximateIdentity · 2014-12-07T19:29:53Z

I like the idea of passing a function that does the filtering. One thing to note though is that the way I wrote the tf_idf stuff isn't really filtering whole columns. It can filter out 'and' from one doc while leaving it in another. So that one is more like a document-level filter.

The tf-idf stuff could be actually kind of be unified with the doc filtering function idea. tf-idf could be implemented as a function that's passed in and the min_tokens flag could be implemented by throwing away all tokens in a document if the number of tokens in that document is low. then there could be a flag (say 'drop_empty_docs = False') if we want to actually drop them entirely.

I think what I'll do is whip up a couple quick examples to see how it feels.

ApproximateIdentity · 2014-12-09T23:10:43Z

Here's a basic example of one thing I'm thinking about:

ApproximateIdentity@e5e27f5
0a9db41

(The second commit fixes a little bug in the first.)

The point is having this generic list of filters would allow more or less anything to be done without adding any future parameters. I also added the check 'if not record_dict.has_key('skip_doc'):' which allows the user to specify if a document should be kept or dropped. The two functions at the end are two example filters. One drops the doc and one leaves it, but removes all words. This allows most kinds of streaming filters and means that you wouldn't have to keep adding function parameters. If I changed it to pass self to func as well in line 713, then even the min_tf_idf could be rewritten in this framework, but that might be overkill.

To me this feels like it would be right, but I haven't really used it yet. I might just branch a version doing this and then write some stuff with it and see how it feels. Thoughts? Is this a totally terrible idea?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

ApproximateIdentity commented Dec 7, 2014

dkrasner commented Dec 7, 2014

langmore commented Dec 7, 2014

dkrasner commented Dec 7, 2014

langmore commented Dec 7, 2014

ApproximateIdentity commented Dec 7, 2014

ApproximateIdentity commented Dec 9, 2014

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

Comments

ApproximateIdentity commented Dec 7, 2014

dkrasner commented Dec 7, 2014

langmore commented Dec 7, 2014

dkrasner commented Dec 7, 2014

langmore commented Dec 7, 2014

ApproximateIdentity commented Dec 7, 2014

ApproximateIdentity commented Dec 9, 2014