Skip to content

Selenium crawler for the popular social media website in Taiwan, Dcard.

Notifications You must be signed in to change notification settings

enjuichang/Dcard-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dcard Crawler 爬蟲程式

有鑒於Dcard於更新後增加許多爬蟲的限制,本程式使用PTT Brain上的Dcard版進行爬蟲。

開始使用

  1. 請先於終端機上下載本repo:
git clone https://github.com/enjuichang/Dcard-Crawler.git
  1. 安裝selenium:請記得安裝selenium套件,以及下載chromedriver。如有更多疑問請觀看Tech with Tim教學影片
  2. 請更改URLs變數:
    • URLs變數預設為List of lists,每一個list對應到labelLIST中一個標籤(例如:[[A,B],[C,D]]中,A與B版將對應到一個標籤而C與D版則是另一個標籤。)
    • 如果將進行監督式學習的資料庫,請同時更改URLslabelLIST,並且兩者長度相同(len(URLs)==len(labelLIST)
    • 如果只是純粹爬蟲請�將labelLIST留下NAlabelLIST = [np.nan])並在URLs只使用一個listURLs = [[A,B,C,D]])(未來將為非監督式學習建立方程式)
  3. 更改Output檔名:請於df.to_csv處更改儲存檔名,檔案將儲存於~/csv/
  4. 執行程式

English Version

Due to the inaccessibility of crawling Dcard after the new update, this program scrapes the data from PTT Brain, which has the top 200 posts for each tag on Dcard.

Getting Started

  1. Clone this repo to your local computer:
git clone https://github.com/enjuichang/Dcard-Crawler.git
  1. Install selenium:Remeber to install selenium package and remember to install chromedriver. If there are any further questions, please refer to this tutorial from Tech with Tim.
  2. Change URLs variable:
    • URLs is set to default as a List of lists, which each list corresponding to a label in labelLIST (e.g.,In [[A,B],[C,D]],A and B tags corresponds to one tag, while C and D corresponds to another.)
    • If the crawler is for supervised learning, please change URLs and labelLIST accordingly with the lengths of the two lists exactly identitical (len(URLs)==len(labelLIST)).
    • If supervised learning is not applicable, please leave labelLIST with NA (labelLIST = [np.nan]) and include only one list in URLs (URLs = [[A,B,C,D]])(Future updates will include function for only extracting the text)
  3. Change Output name: Please change the file name for the output at df.to_csv. The file will be saved in ~/csv/.
  4. Run program.

About

Selenium crawler for the popular social media website in Taiwan, Dcard.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages