token selection #45

bvreede · 2024-01-10T09:57:34Z

The current tokenize() function splits the text up into individual tokens, and ranks them.

It would be useful to provide customization to ranking, as there are several potential use cases for ranking tokens.

find top turns (e.g. mhmm, oh, yes)

pick a turn format of interest (e.g. oh)

find all standalone turns of this format (oh)

find this format as a token within turns (e.g., oh I see)

plot the standalone turns as one layer

plot the within-turn tokens of oh as another layer

plot as text

The tokenize function therefore should be able to:

Rank and all tokens, irrespective of context
Rank only tokens in a specific location (standalone (currently only; should we rename this?), first, last, middle)
Return only tokens in a specific context

I envisage something like:

tokenize(data,
                rank = 'all' # or e.g. c('only', 'first'),
                return = 'all') # or e.g. c('first', 'middle', 'last')

This way, a user can e.g. rank only those tokens that also occur as standalone, but return all tokens. Some tokens therefore will not have a rank.

A potential confusion would be that ranking could take place either only in the context of e.g. standalone tokens, or in the context of the entire conversation. This could be another option that the user could provide? To be quite honest, I am not sure what would even be a sensible default here, so happy to hear from @mdingemanse @liesenf @aliandalopez here!

The text was updated successfully, but these errors were encountered:

mdingemanse · 2024-01-11T11:14:35Z

This is such an interesting issue. Ranking relative to position could be hugely useful. E.g. give me the top ranked turn-initial words in this corpus. Or the top-rank turn-final ones that ALSO occur standalone. In terms of sensible defaults, the only one that doesn't make conceptual sense is "middle" (because that is so variable with utterance length)

Terminology alert that relates to your final point I think. What has been very productive for us so far is a simple turn-level rank (the current rank in e.g. IFADV columns, which takes all turns without tokenizing them, sorts them by frequency, then spits out ranking. To avoid terminological confusion we might rename this one to sth like rank_token. (And ideally the other one to rank_turn?)

In trying to answer your questions I'm trying to get my head around the place of this function in a workflow. So maybe it helps to describe (adding to #38) a few more cases that I think would be very cool.

For a given set of common standalone words (= top 5 of rank_turn), get utterance relative rank and frequency info (in terms of standalone, initial, medial, final). Would allow you to make distributional plots like so:
Find the top 10 turn-initial words in this language / corpus (this gets you oh, well, so, but, I)
- Define these as a set (a vector of strings like `top_initial <- tokenize(rank = 'c('only','first),return=c('first')).
- Now give me uids of utterances for which first word is top_initial[1], i.e. the most common utterance-initial word.
- Now convplot a random sample of ten of these uids and highlight that particular word.
Or find the top 10 turn-initial words (top_initial as above) and the top 10 standalone (top_standalone).
- Now allow me to build two sets of uids: set 1: uids of standalone oh followed by an utterance by same speaker in which first word is another from the top_initial set. Example would be [oh] [but I thought]. Set 2: uids of utterances where turn-initial oh is immediately followed by that same word. Example would be [oh but yesterday].
- Now convplot 10 cases from set 1 and 10 from set 2. This would allow one to eyeball how standalone uses of particular words (like oh) followed by more material from same speaker relate to turn-initial uses of same. (The hypothesis is that if some combination of items is common enough in set 1, a language will over time develop a format in the turn-initial space in which they are glommed together as in set 2.)
Or find the top 10 turn-initial words (top_initial as above).
- Now find me utterances that feature multiple adjacent ones from the set (oh so you think, oh but I don't know, well but you see). This enables you to build a small distributional grammar (e.g. you'd find that oh tends to come before everything else, oh but > but oh).

The main attraction of these kinds of queries, which are conceptually pretty simple, is that they can all be done in language-agnostic ways, as they refer only to relative positions, utterances, and ranks. This is what a position-sensitive tokenize function brings into reach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token selection #45

token selection #45

bvreede commented Jan 10, 2024

mdingemanse commented Jan 11, 2024 •

edited

Loading

token selection #45

token selection #45

Comments

bvreede commented Jan 10, 2024

mdingemanse commented Jan 11, 2024 • edited Loading

mdingemanse commented Jan 11, 2024 •

edited

Loading