-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPROBLEM.txt
60 lines (41 loc) · 2.33 KB
/
PROBLEM.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
The following exercise is composed of both mandatory requirements and bonus parts.
*Please make sure to read it completely before beginning to write code.*
The exercise is as follows:
Write a simple web crawler (https://en.wikipedia.org/wiki/Web_crawler) in Python 3.
The crawler should crawl the site: https://pastebin.com/ and should store the most recent "pastes" in a structured data model.
A paste model must have the following parameters:
- Author - String
- Title - String
- Content - String
- Date - Date
You'll need to gather each one of the new "pastes" from pastebin, and parse it into the above structure.
The code must be self managed. It should crawl the site every 2 minutes and look for any new pastes to save.
For the most basic form of this exercise it will suffice to keep each paste in the above format in a different text file in a directory of your choosing.
Following are the bonus parts for this exercise:
Bonus #1
----
Each one of the paste model's parameters must be normalized.
For example:
* Author - In cases it's Guest, Unknown, Anonymous, etc... the author name must be the same, for example: "" (empty string)
* Title - Same as with Author.
* Date - UTC Date
* Content - Must be stripped of trailing spaces.
Bonus #2
----
Store each one of the pastes in an organized database. It could be executed with anything from a local tinydb / SQLite database to a mongo db docker container.
Bonus #3
----
Ship your crawler in a Docker image. This will allow it to run on any platform and any computer without any necessary installations. docker-compose and docker swarm solutions are also welcome.
See https://www.docker.com/ for more details.
General Notes (Apply with or without the bonus parts):
---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
The code must be supplied with a README.md that explains how to create the environment for the code to run.
Emphasis should be placed on the following:
- Write clean code. Do not document unnecessary lines, (you can consult the "Zen of Python" - https://gist.github.com/evandrix/2030615)
- Use Object Oriented Methodologies
- The code must be readable, and the directory tree must be chosen wisely.
We strongly recommend using the following libraries:
- requests (http client)
- lxml (html parsing)
- tinydb (storage)
- arrow (dates)