Skip to content

Latest commit

 

History

History
36 lines (22 loc) · 953 Bytes

README.md

File metadata and controls

36 lines (22 loc) · 953 Bytes

googling4data

Autonomous web text crawling (googling) for big data (natural language processing)

For what

For a given string (e.g., "apple"), these codes (1) google the string, (2) retrieve html pages, (3) extract visible texts from the pages, and then, (4) compress all the texts to a zip file.

How to use

prerequisite

To run, you'll need key.json which this repository does not include. The format should be as below, and the values should be yours. They are required by Google.

{
    "api_key": "your-google-api-key",
    "cse_id": "your-cse-id"
}

I referred http://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search.

pip install google-api-python-client
pip install html2text

If there can be more simple or easier way to do this, please lighten me up.

example

** This is still during construction.

If this helps, please add a star for me ;)