Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

TCAT fails to extract all hashtags / mentions in tweets longer than 140 characters #311

Closed
Snurb opened this issue Apr 16, 2018 · 14 comments

Comments

@Snurb
Copy link

Snurb commented Apr 16, 2018

There is a critical issue with TCAT's handling of tweets that are longer than 140 characters when they are received via the streaming API.

TCAT now uses the full_text attribute from the extended_tweet JSON element, rather than the simple test attribute, for the tweet text; this is correct and avoids the truncation of tweets to 140 characters.

However, to extract hashtags and mentions it still relies on the hashtags and user_mentions arrays in the default entities JSON element - but this contains only the hashtags and mentions that are visible in the truncated 140-character version of the tweet.

Instead, TCAT should use the hashtags and user_mentions arrays from the entities sub-element within the extended_tweet element - only this contains all hashtags and mentions. (This is very badly documented on Twitter's developer site.)

Below is a sample JSON from the streaming API that demonstrates the issue (user data removed to save space). Note that all hashtags and mentions in the original 280-character tweet occur after the 140-character mark, and are only visible in the full_text version of the tweet text. As a result, they are also only contained in the extended_tweet > entities > hashtags / user_mentions arrays, while the entities > hashtags / user_mentions arrays at the higher level of the JSON hierarchy remain empty.

Sample JSON:

  {
       'created_at': 'Mon Apr 16 03:43:07 +0000 2018',
       'id': 985725247889518592,
       'id_str': '985725247889518592',
       'text': 'This is another test tweet to track down some TCAT issues. Move along, nothing to see here people. No really, why a… https://t.co/Vezna2TS6V',
       'source': '<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
       'truncated': True,
       'in_reply_to_status_id': None,
       'in_reply_to_status_id_str': None,
       'in_reply_to_user_id': None,
       'in_reply_to_user_id_str': None,
       'in_reply_to_screen_name': None,
       'user': {

           [...]

       },
       'geo': None,
       'coordinates': None,
       'place': None,
       'contributors': None,
       'is_quote_status': False,
       'extended_tweet': {
           'full_text': 'This is another test tweet to track down some TCAT issues. Move along, nothing to see here people. No really, why are you still looking at any of this - there's really nothing to see here. #TCAT #debugging #doesthiswork @socialmediaQUT @_ATNIX_ #testing #moretesting #bugsquashing',
           'display_text_range': [0, 280],
           'entities': {
               'hashtags': [{
                   'text': 'TCAT',
                   'indices': [189, 194]
               }, {
                   'text': 'debugging',
                   'indices': [195, 205]
               }, {
                   'text': 'doesthiswork',
                   'indices': [206, 219]
               }, {
                   'text': 'testing',
                   'indices': [245, 253]
               }, {
                   'text': 'moretesting',
                   'indices': [254, 266]
               }, {
                   'text': 'bugsquashing',
                   'indices': [267, 280]
               }],
               'urls': [],
               'user_mentions': [{
                   'screen_name': 'socialmediaQUT',
                   'name': 'Social Media @ QUT',
                   'id': 950802878,
                   'id_str': '950802878',
                   'indices': [220, 235]
               }, {
                   'screen_name': '_ATNIX_',
                   'name': 'ATNIX',
                   'id': 2470512481,
                   'id_str': '2470512481',
                   'indices': [236, 244]
               }],
               'symbols': []
           }
       },
       'quote_count': 0,
       'reply_count': 0,
       'retweet_count': 0,
       'favorite_count': 0,
       'entities': {
           'hashtags': [],
           'urls': [{
               'url': 'https://t.co/Vezna2TS6V',
               'expanded_url': 'https://twitter.com/i/web/status/985725247889518592',
               'display_url': 'twitter.com/i/web/status/9…',
               'indices': [117, 140]
           }],
           'user_mentions': [],
           'symbols': []
       },
       'favorited': False,
       'retweeted': False,
       'filter_level': 'low',
       'lang': 'en',
       'timestamp_ms': '1523850187643'
  }

Suggested fix: the problem is in capture/common/functions.php, lines 1775 and 1776 of the current revision, where the arrays for hashtags and user_mentions are extracted from the JSON. Each of these assignations will need to be wrapped in similar code to how the extended information for media objects is being handled.

Unfortunately we've been unable to implement a fix successfully so far, so I'm reporting this issue now - will post an update if we manage to fix it on our end.

In principle, it seems that on line 1775,

    $this->user_mentions = json_decode(json_encode($data["entities"]["user_mentions"]), FALSE);

should become

    if (array_key_exists('extended_tweet', $data) &&
        array_key_exists('user_mentions', $data["extended_tweet"]["entities"])) {
            $this->user_mentions = json_decode(json_encode($data["extended_tweet"]["entities"]["user_mentions"]), FALSE);
    } else {
        $this->user_mentions = json_decode(json_encode($data["entities"]["user_mentions"]), FALSE);
    }

(and the same with the hashtags on 1776), but this doesn't seem to work. Perhaps we're missing something to do with the multiple nested arrays.

Hope this helps address this issue. Note that this error is quite critical: any TCAT running the current revision will systemically underestimate the volume of hashtags and mentions in the data, potentially by a significant margin. And worse, because some hashtags and mentions are being captured, it does not look as if there is anything obviously wrong...

PS: we haven't explored this yet, but the same issue presumably also applies to URLs contained in the tweet text. Here, too, TCAT uses entities > urls rather than extended_tweet > entities > urls (see line 1713ff.

@dentoir
Copy link
Contributor

dentoir commented Apr 16, 2018

Hi @Snurb

Thanks for revealing this. Indeed, I did not see this in the documentation. There is no sensible reason to split this metadata in a separate hierarchy. I'll look into it now and at least branch some experimental fix.

@dentoir
Copy link
Contributor

dentoir commented Apr 16, 2018

Hi @Snurb

Can confirm this issue and indeed am quite baffled as the official documentation of 'extended tweets' still mention only media and the API overview page (https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/extended-entities-object) anso sends us to 'entities" (https://developer.twitter.com/en/docs/basics/getting-started). I'm curious how you uncovered this and where did you find the correct specifications. I'm assuming this will relate to URLs as well?

@dentoir
Copy link
Contributor

dentoir commented Apr 16, 2018

Hi @Snurb

I've branched an issue311 which tackles this problem by using the extended_tweets->entities hierarchy if the context demands it (see commit above). Should be tested, tested this only with onepercent stream where it appears to work.

If TCAT is successfully patched, it should be feasible to use common/upgrade.php to scan existing (recent) tweets which may be effected by a query similar to the one below (could be further refined to limit false positives by looking only behind the first 140 characters and it should obviously handle mentions and URLs too).

select id, text, length(text), LENGTH(text) - LENGTH(REPLACE(text, "#", "")) as nr_of_hashtags from some_bin_name_tweets where length(text) > 140 and nr_of_hashtags > 0 order by nr_of_hashtags desc

@Snurb
Copy link
Author

Snurb commented Apr 17, 2018

Thanks @dentoir ! We’ll test the new fix and let you know whether it works for us.

We found the issue largely by accident - testing a new TCAT setup that pushes gathered data (tweets, hashtags, and mentions exports) straight to BigQuery, where we join those three tables together. We found that there were more tweets containing # and @ symbols than tweets with hashtags and mentions captured in those respective tables, and further investigation showed that this consistently affected tweets >140 characters. We then confirmed our hunch that there was a difference between the standard entities and extended_tweet entities by capturing the JSON for the sample tweet above.

As you say, sadly Twitter just hasn’t documented this behaviour at all, so this is all based on experimentation only; I don’t think there are any official specs on this anywhere ! I’d also be very interested to see how URLs behave in long tweets - we haven’t tested this fully yet.

(All we know for URLs is that in 140+-character tweets with one or more URLs, one of those URLs is always contained in the truncated tweet - and the surrounding tweet text is therefore truncated at 140-[length of short URL] characters. That URL is also provided in the standard entities > urls element. I assume that the extended_tweet > entities > urls element will be an array containing all URLs instead - but again, haven’t tested that yet. So building in a logic similar to that for hashtags, mentions, and media objects for the urls as well seems like a good idea.)

Seems to me that the whole 280-character switch was introduced rather hurriedly...

@Snurb
Copy link
Author

Snurb commented Apr 17, 2018

@dentoir, another follow-up:

we've done some more testing on a number of TCATs, and the results look good - it looks like hashtags and @mentions are now being picked up reliably for all tweets, regardless of length.

Thanks for your fast work on this ! Happy to test any further changes to the URL handling as well, of course...

@dentoir
Copy link
Contributor

dentoir commented Apr 17, 2018

Hi @Snurb

I've been testing (again, on a small onepercent set) the new method and I believe URLs end up correctly in the bin. It's a bit complicated comparing the full text of a tweet with the URLs extracted because embedded media in a tweet is present in the text as a plain URL, but it does not end up in the JSON in extended_tweet->entities->urls, but in extended_tweet->extended_entities->media

From my diagnostics, if the tweet is <= 140 characters, the extended_tweet hierarchy will not exist at all in the JSON hierarchy, it exists only for tweets > 140 characters.

To summarize, I believe the new code should get everything right for the streaming API: urls, hashtags & mentions. I'll now test the state-of-affairs for the REST API for which I have hope the issue does not exist at all, because we're using an tweet_mode=extended parameter (not available for stream), which keeps the structure as it was but with 280 character support.

We could theoretically fully reconstruct mentions and hashtags from historical tweet texts (in the relevant period), but not embedded media, because image dimensions etc. will not be there. For an upgrade step, there are two options:

Reconstruct hashtags, urls and mentions for tweets longer than 140 characters, and handle embedded media as a regular URL. This would be defendable because searching for media is not fully implemented in TCAT and ideally we'd like to include the URLs table in searches to not just search for embedded media. The URL expander would automatically pick up references to https://pbs.twimg.com/media/*

Perform REST lookups for suspect tweets and re-insert them into the database

The second alternative is more time-consuming but more precise and easier to implement, so I'm leaning towards that option. I've got initial upgrade code, which I hope to push later today.

@dentoir
Copy link
Contributor

dentoir commented Apr 17, 2018

Update: the capture code did not yet get the embedded media from extended_tweet, which should now work. I've also added an experimental upgrade step to reconstruct entities by performing REST lookups on suspect tweets. All this now needs testing.

The upgrade step can be performed in the usual way by running common/upgrade.php

I suggest testing the upgrade on arbitrary capture data, or copies of data, because it manipulates content (obviously).

@Snurb
Copy link
Author

Snurb commented Apr 22, 2018

Thanks @dentoir ! Both those options for reconstructing past tweets affected by those issues sound good to me, so I'm happy with whatever you think is best there...

I haven't had any chance to do more testing this past week, but will try to have a closer look at the URL handling when I get a moment...

@brendam
Copy link
Contributor

brendam commented Apr 23, 2018

I've just upgraded a DMI-TCAT with stopped bins to the issue311 branch. I'm getting 400 errors when it tries to upgrade the old bins.

2018-04-23 06:29:01	Starting work on agchatoz
2018-04-23 06:29:01	performing lookup for 924 tweets
2018-04-23 06:29:01	Warning: API key 0 got response code 400

These bins probably don't need to be fixed, but thought I'd report it just in case it is a problem for other people.

@dentoir
Copy link
Contributor

dentoir commented Apr 24, 2018

Hi @brendam

Could you maybe verify whether you can use scripts such as search.php on that TCAT instance, by running a dummy query. I would want to verify your REST API key is actually correct. The REST API key needs to be defined in config.php (can be the same key as used for streaming) for the upgrade to work.

@brendam
Copy link
Contributor

brendam commented Apr 24, 2018

Yes, the api key was tested and correct. But I worked out today that server was Ubuntu 14 and old PHP. So that might have been the source of the problem. I’ve built a new server and will delete the old one once I confirm the bins have been downloaded. I don’t think it’s worth spending more time on this unless someone else reports the problem on an up to date server.

@dentoir
Copy link
Contributor

dentoir commented Apr 30, 2018

I'm running the upgrade on multiple servers at the moment. If all appears to be normal, I hope to merge the issue311 branch into master tomorrow.

@dentoir
Copy link
Contributor

dentoir commented May 7, 2018

Hi all, another small fix in the upgrade process (MySQL connection stalls). For large bins, the upgrade is still much, much slower than expected. This is partially caused by ineffective _media table index (issue #299) but also by updating the _urls table. It is probably best to merge now as I see no easy way to optimize this process. At least it is non-blocking.

@dentoir
Copy link
Contributor

dentoir commented May 8, 2018

Hi all, this morning I've merged the branch to fix this issue. I'll keep the issue open for several more days for feedback. The upgrade step now uses an DISABLE KEYS and ENABLE KEYS to speed up updating rows in intermediate or large bins. The downside is a small lock time at the end of the upgrade. In addition, an OPTIMIZE TABLE remedies the potential negative effects of deleting/updating rows in MyISAM. The upgrade should be reasonably fast for recent or moderately sized bins.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants