Dcard Crawler 爬蟲程式

有鑒於Dcard於更新後增加許多爬蟲的限制，本程式使用PTT Brain上的Dcard版進行爬蟲。

開始使用

請先於終端機上下載本repo：

git clone https://github.com/enjuichang/Dcard-Crawler.git

安裝selenium：請記得安裝selenium套件，以及下載chromedriver。如有更多疑問請觀看Tech with Tim的教學影片。
請更改URLs變數：
- URLs變數預設為List of lists，每一個list對應到labelLIST中一個標籤（例如：[[A,B],[C,D]]中，A與B版將對應到一個標籤而C與D版則是另一個標籤。）
- 如果將進行監督式學習的資料庫，請同時更改URLs與labelLIST，並且兩者長度相同（len(URLs)==len(labelLIST)）
- 如果只是純粹爬蟲請�將labelLIST留下NA（labelLIST = [np.nan]）並在URLs只使用一個list（URLs = [[A,B,C,D]]）(未來將為非監督式學習建立方程式)
更改Output檔名：請於df.to_csv處更改儲存檔名，檔案將儲存於~/csv/
執行程式

English Version

Due to the inaccessibility of crawling Dcard after the new update, this program scrapes the data from PTT Brain, which has the top 200 posts for each tag on Dcard.

Getting Started

Clone this repo to your local computer：

git clone https://github.com/enjuichang/Dcard-Crawler.git

Install selenium：Remeber to install selenium package and remember to install chromedriver. If there are any further questions, please refer to this tutorial from Tech with Tim.
Change URLs variable:
- URLs is set to default as a List of lists, which each list corresponding to a label in labelLIST (e.g.,In [[A,B],[C,D]]，A and B tags corresponds to one tag, while C and D corresponds to another.)
- If the crawler is for supervised learning, please change URLs and labelLIST accordingly with the lengths of the two lists exactly identitical (len(URLs)==len(labelLIST)).
- If supervised learning is not applicable, please leave labelLIST with NA (labelLIST = [np.nan]) and include only one list in URLs (URLs = [[A,B,C,D]])(Future updates will include function for only extracting the text)
Change Output name: Please change the file name for the output at df.to_csv. The file will be saved in ~/csv/.
Run program.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CSV		CSV
.gitignore		.gitignore
Dcard.py		Dcard.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dcard Crawler 爬蟲程式

開始使用

English Version

Getting Started

About

Releases

Packages

Languages

enjuichang/Dcard-Crawler

Folders and files

Latest commit

History

Repository files navigation

Dcard Crawler 爬蟲程式

開始使用

English Version

Getting Started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages