有鑒於Dcard於更新後增加許多爬蟲的限制,本程式使用PTT Brain上的Dcard版進行爬蟲。
- 請先於終端機上下載本repo:
git clone https://github.com/enjuichang/Dcard-Crawler.git
- 安裝
selenium
:請記得安裝selenium
套件,以及下載chromedriver
。如有更多疑問請觀看Tech with Tim的教學影片。 - 請更改
URLs
變數:URLs
變數預設為List of lists,每一個list對應到labelLIST
中一個標籤(例如:[[A,B],[C,D]]中,A與B版將對應到一個標籤而C與D版則是另一個標籤。)- 如果將進行監督式學習的資料庫,請同時更改
URLs
與labelLIST
,並且兩者長度相同(len(URLs)==len(labelLIST)
) - 如果只是純粹爬蟲請�將
labelLIST
留下NA
(labelLIST = [np.nan]
)並在URLs
只使用一個list
(URLs = [[A,B,C,D]]
)(未來將為非監督式學習建立方程式)
- 更改Output檔名:請於
df.to_csv
處更改儲存檔名,檔案將儲存於~/csv/
- 執行程式
Due to the inaccessibility of crawling Dcard after the new update, this program scrapes the data from PTT Brain, which has the top 200 posts for each tag on Dcard.
- Clone this repo to your local computer:
git clone https://github.com/enjuichang/Dcard-Crawler.git
- Install
selenium
:Remeber to installselenium
package and remember to installchromedriver
. If there are any further questions, please refer to this tutorial from Tech with Tim. - Change
URLs
variable:URLs
is set to default as a List of lists, which each list corresponding to a label inlabelLIST
(e.g.,In [[A,B],[C,D]],A and B tags corresponds to one tag, while C and D corresponds to another.)- If the crawler is for supervised learning, please change
URLs
andlabelLIST
accordingly with the lengths of the two lists exactly identitical (len(URLs)==len(labelLIST)
). - If supervised learning is not applicable, please leave
labelLIST
withNA
(labelLIST = [np.nan]
) and include only onelist
inURLs
(URLs = [[A,B,C,D]]
)(Future updates will include function for only extracting the text)
- Change Output name: Please change the file name for the output at
df.to_csv
. The file will be saved in~/csv/
. - Run program.