Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token selection #45

Open
1 of 3 tasks
bvreede opened this issue Jan 10, 2024 · 1 comment
Open
1 of 3 tasks

token selection #45

bvreede opened this issue Jan 10, 2024 · 1 comment

Comments

@bvreede
Copy link
Collaborator

bvreede commented Jan 10, 2024

The current tokenize() function splits the text up into individual tokens, and ranks them.

It would be useful to provide customization to ranking, as there are several potential use cases for ranking tokens.

In #38 @mdingemanse listed:

  • find top turns (e.g. mhmm, oh, yes)
  • pick a turn format of interest (e.g. oh)
  • find all standalone turns of this format (oh)
  • find this format as a token within turns (e.g., oh I see)
  • plot the standalone turns as one layer
  • plot the within-turn tokens of oh as another layer
  • plot as text

The tokenize function therefore should be able to:

  • Rank and all tokens, irrespective of context
  • Rank only tokens in a specific location (standalone (currently only; should we rename this?), first, last, middle)
  • Return only tokens in a specific context

I envisage something like:

tokenize(data,
                rank = 'all' # or e.g. c('only', 'first'),
                return = 'all') # or e.g. c('first', 'middle', 'last')

This way, a user can e.g. rank only those tokens that also occur as standalone, but return all tokens. Some tokens therefore will not have a rank.

A potential confusion would be that ranking could take place either only in the context of e.g. standalone tokens, or in the context of the entire conversation. This could be another option that the user could provide? To be quite honest, I am not sure what would even be a sensible default here, so happy to hear from @mdingemanse @liesenf @aliandalopez here!

@mdingemanse
Copy link
Contributor

mdingemanse commented Jan 11, 2024

This is such an interesting issue. Ranking relative to position could be hugely useful. E.g. give me the top ranked turn-initial words in this corpus. Or the top-rank turn-final ones that ALSO occur standalone. In terms of sensible defaults, the only one that doesn't make conceptual sense is "middle" (because that is so variable with utterance length)

Terminology alert that relates to your final point I think. What has been very productive for us so far is a simple turn-level rank (the current rank in e.g. IFADV columns, which takes all turns without tokenizing them, sorts them by frequency, then spits out ranking. To avoid terminological confusion we might rename this one to sth like rank_token. (And ideally the other one to rank_turn?)

In trying to answer your questions I'm trying to get my head around the place of this function in a workflow. So maybe it helps to describe (adding to #38) a few more cases that I think would be very cool.

  • For a given set of common standalone words (= top 5 of rank_turn), get utterance relative rank and frequency info (in terms of standalone, initial, medial, final). Would allow you to make distributional plots like so:
    image

  • Find the top 10 turn-initial words in this language / corpus (this gets you oh, well, so, but, I)

    • Define these as a set (a vector of strings like `top_initial <- tokenize(rank = 'c('only','first),return=c('first')).
    • Now give me uids of utterances for which first word is top_initial[1], i.e. the most common utterance-initial word.
    • Now convplot a random sample of ten of these uids and highlight that particular word.
  • Or find the top 10 turn-initial words (top_initial as above) and the top 10 standalone (top_standalone).

    • Now allow me to build two sets of uids: set 1: uids of standalone oh followed by an utterance by same speaker in which first word is another from the top_initial set. Example would be [oh] [but I thought]. Set 2: uids of utterances where turn-initial oh is immediately followed by that same word. Example would be [oh but yesterday].
    • Now convplot 10 cases from set 1 and 10 from set 2. This would allow one to eyeball how standalone uses of particular words (like oh) followed by more material from same speaker relate to turn-initial uses of same. (The hypothesis is that if some combination of items is common enough in set 1, a language will over time develop a format in the turn-initial space in which they are glommed together as in set 2.)
  • Or find the top 10 turn-initial words (top_initial as above).

    • Now find me utterances that feature multiple adjacent ones from the set (oh so you think, oh but I don't know, well but you see). This enables you to build a small distributional grammar (e.g. you'd find that oh tends to come before everything else, oh but > but oh).

The main attraction of these kinds of queries, which are conceptually pretty simple, is that they can all be done in language-agnostic ways, as they refer only to relative positions, utterances, and ranks. This is what a position-sensitive tokenize function brings into reach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants