-
Notifications
You must be signed in to change notification settings - Fork 114
TCAT fails to extract all hashtags / mentions in tweets longer than 140 characters #311
Comments
Hi @Snurb Thanks for revealing this. Indeed, I did not see this in the documentation. There is no sensible reason to split this metadata in a separate hierarchy. I'll look into it now and at least branch some experimental fix. |
Hi @Snurb Can confirm this issue and indeed am quite baffled as the official documentation of 'extended tweets' still mention only media and the API overview page (https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/extended-entities-object) anso sends us to 'entities" (https://developer.twitter.com/en/docs/basics/getting-started). I'm curious how you uncovered this and where did you find the correct specifications. I'm assuming this will relate to URLs as well? |
Hi @Snurb I've branched an issue311 which tackles this problem by using the extended_tweets->entities hierarchy if the context demands it (see commit above). Should be tested, tested this only with onepercent stream where it appears to work. If TCAT is successfully patched, it should be feasible to use common/upgrade.php to scan existing (recent) tweets which may be effected by a query similar to the one below (could be further refined to limit false positives by looking only behind the first 140 characters and it should obviously handle mentions and URLs too).
|
Thanks @dentoir ! We’ll test the new fix and let you know whether it works for us. We found the issue largely by accident - testing a new TCAT setup that pushes gathered data (tweets, hashtags, and mentions exports) straight to BigQuery, where we join those three tables together. We found that there were more tweets containing # and @ symbols than tweets with hashtags and mentions captured in those respective tables, and further investigation showed that this consistently affected tweets >140 characters. We then confirmed our hunch that there was a difference between the standard As you say, sadly Twitter just hasn’t documented this behaviour at all, so this is all based on experimentation only; I don’t think there are any official specs on this anywhere ! I’d also be very interested to see how URLs behave in long tweets - we haven’t tested this fully yet. (All we know for URLs is that in 140+-character tweets with one or more URLs, one of those URLs is always contained in the truncated tweet - and the surrounding tweet text is therefore truncated at 140-[length of short URL] characters. That URL is also provided in the standard Seems to me that the whole 280-character switch was introduced rather hurriedly... |
@dentoir, another follow-up: we've done some more testing on a number of TCATs, and the results look good - it looks like hashtags and @mentions are now being picked up reliably for all tweets, regardless of length. Thanks for your fast work on this ! Happy to test any further changes to the URL handling as well, of course... |
Hi @Snurb I've been testing (again, on a small onepercent set) the new method and I believe URLs end up correctly in the bin. It's a bit complicated comparing the full text of a tweet with the URLs extracted because embedded media in a tweet is present in the text as a plain URL, but it does not end up in the JSON in extended_tweet->entities->urls, but in extended_tweet->extended_entities->media From my diagnostics, if the tweet is <= 140 characters, the extended_tweet hierarchy will not exist at all in the JSON hierarchy, it exists only for tweets > 140 characters. To summarize, I believe the new code should get everything right for the streaming API: urls, hashtags & mentions. I'll now test the state-of-affairs for the REST API for which I have hope the issue does not exist at all, because we're using an tweet_mode=extended parameter (not available for stream), which keeps the structure as it was but with 280 character support. We could theoretically fully reconstruct mentions and hashtags from historical tweet texts (in the relevant period), but not embedded media, because image dimensions etc. will not be there. For an upgrade step, there are two options: Reconstruct hashtags, urls and mentions for tweets longer than 140 characters, and handle embedded media as a regular URL. This would be defendable because searching for media is not fully implemented in TCAT and ideally we'd like to include the URLs table in searches to not just search for embedded media. The URL expander would automatically pick up references to https://pbs.twimg.com/media/* Perform REST lookups for suspect tweets and re-insert them into the database The second alternative is more time-consuming but more precise and easier to implement, so I'm leaning towards that option. I've got initial upgrade code, which I hope to push later today. |
Update: the capture code did not yet get the embedded media from extended_tweet, which should now work. I've also added an experimental upgrade step to reconstruct entities by performing REST lookups on suspect tweets. All this now needs testing. The upgrade step can be performed in the usual way by running I suggest testing the upgrade on arbitrary capture data, or copies of data, because it manipulates content (obviously). |
Thanks @dentoir ! Both those options for reconstructing past tweets affected by those issues sound good to me, so I'm happy with whatever you think is best there... I haven't had any chance to do more testing this past week, but will try to have a closer look at the URL handling when I get a moment... |
I've just upgraded a DMI-TCAT with stopped bins to the issue311 branch. I'm getting 400 errors when it tries to upgrade the old bins.
These bins probably don't need to be fixed, but thought I'd report it just in case it is a problem for other people. |
Hi @brendam Could you maybe verify whether you can use scripts such as search.php on that TCAT instance, by running a dummy query. I would want to verify your REST API key is actually correct. The REST API key needs to be defined in config.php (can be the same key as used for streaming) for the upgrade to work. |
Yes, the api key was tested and correct. But I worked out today that server was Ubuntu 14 and old PHP. So that might have been the source of the problem. I’ve built a new server and will delete the old one once I confirm the bins have been downloaded. I don’t think it’s worth spending more time on this unless someone else reports the problem on an up to date server. |
I'm running the upgrade on multiple servers at the moment. If all appears to be normal, I hope to merge the issue311 branch into master tomorrow. |
Hi all, another small fix in the upgrade process (MySQL connection stalls). For large bins, the upgrade is still much, much slower than expected. This is partially caused by ineffective _media table index (issue #299) but also by updating the _urls table. It is probably best to merge now as I see no easy way to optimize this process. At least it is non-blocking. |
Hi all, this morning I've merged the branch to fix this issue. I'll keep the issue open for several more days for feedback. The upgrade step now uses an |
There is a critical issue with TCAT's handling of tweets that are longer than 140 characters when they are received via the streaming API.
TCAT now uses the
full_text
attribute from theextended_tweet
JSON element, rather than the simpletest
attribute, for the tweet text; this is correct and avoids the truncation of tweets to 140 characters.However, to extract hashtags and mentions it still relies on the
hashtags
anduser_mentions
arrays in the defaultentities
JSON element - but this contains only the hashtags and mentions that are visible in the truncated 140-character version of the tweet.Instead, TCAT should use the
hashtags
anduser_mentions
arrays from theentities
sub-element within theextended_tweet
element - only this contains all hashtags and mentions. (This is very badly documented on Twitter's developer site.)Below is a sample JSON from the streaming API that demonstrates the issue (user data removed to save space). Note that all hashtags and mentions in the original 280-character tweet occur after the 140-character mark, and are only visible in the
full_text
version of the tweet text. As a result, they are also only contained in theextended_tweet
>entities
>hashtags
/user_mentions
arrays, while theentities
>hashtags
/user_mentions
arrays at the higher level of the JSON hierarchy remain empty.Sample JSON:
Suggested fix: the problem is in
capture/common/functions.php
, lines 1775 and 1776 of the current revision, where the arrays for hashtags and user_mentions are extracted from the JSON. Each of these assignations will need to be wrapped in similar code to how the extended information for media objects is being handled.Unfortunately we've been unable to implement a fix successfully so far, so I'm reporting this issue now - will post an update if we manage to fix it on our end.
In principle, it seems that on line 1775,
should become
(and the same with the hashtags on 1776), but this doesn't seem to work. Perhaps we're missing something to do with the multiple nested arrays.
Hope this helps address this issue. Note that this error is quite critical: any TCAT running the current revision will systemically underestimate the volume of hashtags and mentions in the data, potentially by a significant margin. And worse, because some hashtags and mentions are being captured, it does not look as if there is anything obviously wrong...
PS: we haven't explored this yet, but the same issue presumably also applies to URLs contained in the tweet text. Here, too, TCAT uses
entities
>urls
rather thanextended_tweet
>entities
>urls
(see line 1713ff.The text was updated successfully, but these errors were encountered: