-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
token selection #45
Comments
This is such an interesting issue. Ranking relative to position could be hugely useful. E.g. give me the top ranked turn-initial words in this corpus. Or the top-rank turn-final ones that ALSO occur standalone. In terms of sensible defaults, the only one that doesn't make conceptual sense is "middle" (because that is so variable with utterance length) Terminology alert that relates to your final point I think. What has been very productive for us so far is a simple turn-level rank (the current In trying to answer your questions I'm trying to get my head around the place of this function in a workflow. So maybe it helps to describe (adding to #38) a few more cases that I think would be very cool.
The main attraction of these kinds of queries, which are conceptually pretty simple, is that they can all be done in language-agnostic ways, as they refer only to relative positions, utterances, and ranks. This is what a position-sensitive tokenize function brings into reach. |
The current
tokenize()
function splits the text up into individual tokens, and ranks them.It would be useful to provide customization to ranking, as there are several potential use cases for ranking tokens.
In #38 @mdingemanse listed:
The tokenize function therefore should be able to:
only
; should we rename this?),first
,last
,middle
)I envisage something like:
This way, a user can e.g. rank only those tokens that also occur as standalone, but return all tokens. Some tokens therefore will not have a rank.
A potential confusion would be that ranking could take place either only in the context of e.g. standalone tokens, or in the context of the entire conversation. This could be another option that the user could provide? To be quite honest, I am not sure what would even be a sensible default here, so happy to hear from @mdingemanse @liesenf @aliandalopez here!
The text was updated successfully, but these errors were encountered: