From 0ebca8ef57e21621f0c6a3d0bc77abf9d5e11b4b Mon Sep 17 00:00:00 2001 From: Ahmet Taspinar Date: Sun, 6 May 2018 07:06:33 +0200 Subject: [PATCH 1/3] Update README.rst - Change the mention of Chinese --> Japanese. This fixes https://github.com/taspinar/twitterscraper/issues/111 - Add not being able to scrape retweets to the to do list. --- README.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/README.rst b/README.rst index 03a5f29..7f7c4bc 100644 --- a/README.rst +++ b/README.rst @@ -169,7 +169,7 @@ contents of the output file will look like: --------------------------- In order to correctly handle all possible characters in the tweets -(think of chinese or arabic characters), the output is saved as utf-8 +(think of Japanese/Chinese or Arabic characters), the output is saved as utf-8 encoded bytes. That is why you could see text like "":raw-latex:`\u3`0b1:raw-latex:`\u3`0f3:raw-latex:`\u3`055:raw-latex:`\u3`07e:raw-latex:`\u3`0fe ..." in the output file. @@ -177,13 +177,14 @@ encoded bytes. That is why you could see text like What you should do is open the file with the proper encoding: .. figure:: https://user-images.githubusercontent.com/4409108/30702318-f05bc196-9eec-11e7-8234-a07aabec294f.PNG - :alt: Example of output with chinese characters + :alt: Example of output with Japanese characters - Example of output with chinese characters + Example of output with Japanese characters TO DO ===== +- Twitterscraper can not retrieve retweets. - Add caching potentially? Would be nice to be able to resume scraping if something goes wrong and have half of the data of a request cached or so. From 0e06fcdfc1776017553efe0c56aeaf662f261754 Mon Sep 17 00:00:00 2001 From: Ahmet Taspinar Date: Sun, 6 May 2018 07:08:16 +0200 Subject: [PATCH 2/3] Update README.rst --- README.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.rst b/README.rst index 7f7c4bc..d10f265 100644 --- a/README.rst +++ b/README.rst @@ -171,8 +171,7 @@ contents of the output file will look like: In order to correctly handle all possible characters in the tweets (think of Japanese/Chinese or Arabic characters), the output is saved as utf-8 encoded bytes. That is why you could see text like -"":raw-latex:`\u3`0b1:raw-latex:`\u3`0f3:raw-latex:`\u3`055:raw-latex:`\u3`07e:raw-latex:`\u3`0fe -..." in the output file. +"\u30b1 \u30f3 \u3055 \u307e \u30fe ..." in the output file. What you should do is open the file with the proper encoding: From 9a3b3bf0c7d6b7fba9720cbf8e410a24b4535234 Mon Sep 17 00:00:00 2001 From: Ahmet Taspinar Date: Sun, 6 May 2018 07:10:24 +0200 Subject: [PATCH 3/3] Update README.rst --- README.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.rst b/README.rst index d10f265..18813e9 100644 --- a/README.rst +++ b/README.rst @@ -169,14 +169,13 @@ contents of the output file will look like: --------------------------- In order to correctly handle all possible characters in the tweets -(think of Japanese/Chinese or Arabic characters), the output is saved as utf-8 +(think of Japanese or Arabic characters), the output is saved as utf-8 encoded bytes. That is why you could see text like "\u30b1 \u30f3 \u3055 \u307e \u30fe ..." in the output file. What you should do is open the file with the proper encoding: .. figure:: https://user-images.githubusercontent.com/4409108/30702318-f05bc196-9eec-11e7-8234-a07aabec294f.PNG - :alt: Example of output with Japanese characters Example of output with Japanese characters