Skip to content

Commit

Permalink
Guided tour - interactive tutorial complete version (#27)
Browse files Browse the repository at this point in the history
* guided tour

* QA based on the interactive tutorial
  • Loading branch information
daigotanaka authored Aug 6, 2020
1 parent 69540bb commit c1078f4
Show file tree
Hide file tree
Showing 41 changed files with 2,205 additions and 1,007 deletions.
224 changes: 224 additions & 0 deletions docs/01_run_local.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
## Running a task locally

In this section, we will learn how to run a task with handoff locally,
using project 01_word_count.



Each project directory contains:

```
> ls -l 01_word_count
```
```
files
project.yml
```


project.yml looks like:

```
> cat 01_word_count/project.yml
```

```
commands:
- command: cat
args: "./files/the_great_dictator_speech.txt"
- command: wc
args: "-w"
envs:
- key: TITLE
value: "The Great Dictator"
```


Here,

- `commands` lists the commands and arguments.
- `envs` lists the environment varaibles.


The example from 01_word_count runs a command line equivalent of:

```
cat ./files/the_great_dictator_speech.txt | wc -w
```

Now let's run. Try entering this command below:

```
> handoff --project 01_word_count --workspace workspace run local
```
```
INFO - 2020-08-06 03:35:12,691 - handoff.config - Reading configurations from 01_word_count/project.yml
INFO - 2020-08-06 03:35:12,693 - handoff.config - Setting environment variables from config.
INFO - 2020-08-06 03:35:12,771 - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
INFO - 2020-08-06 03:35:13,056 - handoff.config - You have the access to AWS resources.
WARNING - 2020-08-06 03:35:13,123 - handoff.config - Environment variable HO_BUCKET is not set. Remote file read/write will fail.
INFO - 2020-08-06 03:35:13,123 - handoff.config - Writing configuration files in the workspace configuration directory workspace/config
INFO - 2020-08-06 03:35:13,123 - handoff.config - Copying files from the local project directory 01_word_count
INFO - 2020-08-06 03:35:13,124 - handoff.config - Running run local in workspace directory
INFO - 2020-08-06 03:35:13,124 - handoff.config - Job started at 2020-08-06 03:35:13.124542
INFO - 2020-08-06 03:35:13,130 - handoff.config - Job ended at 2020-08-06 03:35:13.130391
```


If you see the output that looks like:

```
INFO - 2020-08-03 04:51:01,971 - handoff.config - Reading configurations from 01_word_count/project.yml
...
INFO - 2020-08-03 04:51:02,690 - handoff.config - Processed in 0:00:00.005756
```


Then great! You just ran the first local test. It created a workspace
directory that looks like:

```
> ls -l workspace
```
```
artifacts
config
files
```

And the word count is stored at workspace/artifacts/state. Here is the content:

```
> cat workspace/artifacts/state
```

```
644
```


By the way, the example text is from the awesome speech by Charlie Chaplin's
in the movie the Great Dictator.

Here is a link to the famous speech scene.
Check out on YouTube: https://www.youtube.com/watch?v=J7GY1Xg6X20



And here is the first few paragraphs of the text:

```
I’m sorry, but I don’t want to be an emperor. That’s not my business. I don’t want to rule or conquer anyone. I should like to help everyone - if possible - Jew, Gentile - black man - white. We all want to help one another. Human beings are like that. We want to live by each other’s happiness - not by each other’s misery. We don’t want to hate and despise one another. In this world there is room for everyone. And the good earth is rich and can provide for everyone. The way of life can be free and beautiful, but we have lost the way.
Greed has poisoned men’s souls, has barricaded the world with hate, has goose-stepped us into misery and bloodshed. We have developed speed, but we have shut ourselves in. Machinery that gives abundance has left us in want. Our knowledge has made us cynical. Our cleverness, hard and unkind. We think too much and feel too little. More than machinery we need humanity. More than cleverness we need kindness and gentleness. Without these qualities, life will be violent and all will be lost….
```


Now to the second example. This time project.yml looks like:

```
> cat 02_collect_stats/project.yml
```

```
commands:
- command: cat
args: ./files/the_great_dictator_speech.txt
- command: python files/stats_collector.py
- command: wc
args: -w
```


...which is shell equivalent to

```
cat ./files/the_great_dictator_speech.txt | python ./files/stats_collector.py | wc -w
```

The script for the second command stats_collector.py can be found in
02_collect_stats/files directory and it is a Python script that looks like:


```
> cat 02_collect_stats/files/stats_collector.py
```

```
#!/usr/bin/python
import io, json, logging, sys, os
LOGGER = logging.getLogger()
def collect_stats(outfile):
"""
Read from stdin and count the lines. Output to a file after done.
"""
lines = io.TextIOWrapper(sys.stdin.buffer, encoding="utf-8")
output = {"rows_read": 0}
for line in lines:
try:
o = json.loads(line)
print(json.dumps(o))
if o["type"].lower() == "record":
output["rows_read"] += 1
except json.decoder.JSONDecodeError:
print(line)
output["rows_read"] += 1
with open(outfile, "w") as f:
json.dump(output, f)
f.write("\n")
if __name__ == "__main__":
collect_stats("artifacts/collect_stats.json")
```

The script reads from stdin and counts the lines while passing the raw input to stdout.
The raw text is then processed by the third command (wc -w) and it conts the number of words.



Now let's run. Try entering this command below:

```
> handoff --project 02_collect_stats --workspace workspace run local
```
```
INFO - 2020-08-06 03:35:13,401 - handoff.config - Reading configurations from 02_collect_stats/project.yml
INFO - 2020-08-06 03:35:13,402 - handoff.config - Setting environment variables from config.
INFO - 2020-08-06 03:35:13,481 - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
INFO - 2020-08-06 03:35:13,765 - handoff.config - You have the access to AWS resources.
WARNING - 2020-08-06 03:35:13,830 - handoff.config - Environment variable HO_BUCKET is not set. Remote file read/write will fail.
INFO - 2020-08-06 03:35:13,830 - handoff.config - Writing configuration files in the workspace configuration directory workspace/config
INFO - 2020-08-06 03:35:13,830 - handoff.config - Copying files from the local project directory 02_collect_stats
INFO - 2020-08-06 03:35:13,831 - handoff.config - Running run local in workspace directory
INFO - 2020-08-06 03:35:13,831 - handoff.config - Job started at 2020-08-06 03:35:13.831683
INFO - 2020-08-06 03:35:13,881 - handoff.config - Job ended at 2020-08-06 03:35:13.881507
INFO - 2020-08-06 03:35:13,881 - handoff.config - Processed in 0:00:00.049824
```

Let's check out the contents of the second command:


```
> cat workspace/artifacts/collect_stats.json
```

```
{"rows_read": 15}
```


In the next section, we will try pullin the currency exchange rate data.
You will also learn how to create Python virtual enviroments for each command
and pip-install commands.

130 changes: 130 additions & 0 deletions docs/02_exchange_rates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
## Virtual environment and install

In this section, we will retrieve currency exchange rates and write out to CSV
file.

We will install singer.io (https://singer.io), a data collection framework,
in Python vitual environment.



We will use 03_exchange_rates project. project.yml looks like:

```
> cat 03_exchange_rates/project.yml
```

```
commands:
- command: "tap-exchangeratesapi"
args: "--config config/tap-config.json"
venv: "proc_01"
installs:
- "pip install tap-exchangeratesapi"
- command: "python files/stats_collector.py"
venv: "proc_01"
- command: "target-csv"
args: "--config config/target-config.json"
venv: "proc_02"
installs:
- "pip install target-csv"
deploy:
provider: "aws"
platform: "fargate"
envs:
resource_group: "handoff-test"
docker_image: "singer_exchange_rates_to_csv"
task: "test-03-exchange-rates"
```


...which is shell equivalent to

tap-exchangeratesapi | python files/stats_collector.py | target-csv



Before we can run this, we need to install tap-exchangeratesapi and target-csv.
The instructions for the install are listed in install section of project.yml.

Notice `venv` entries for each command. handoff can create Python virtual
enviroment for each command to avoid conflicting dependencies among the
commands.

To install everything, run this command:

```
> handoff -p 03_exchange_rates -w workspace_03 workspace install
```
```
INFO - 2020-08-06 03:35:14,158 - handoff.config - Reading configurations from 03_exchange_rates/project.yml
INFO - 2020-08-06 03:35:14,240 - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
INFO - 2020-08-06 03:35:14,524 - handoff.config - You have the access to AWS resources.
INFO - 2020-08-06 03:35:14,524 - handoff.config - Platform: aws
INFO - 2020-08-06 03:35:19,456 - handoff.config - Running /bin/bash -c "source proc_01/bin/activate && pip install wheel && pip install tap-exchangeratesapi"
Requirement already satisfied: wheel in ./proc_01/lib/python3.6/site-packages (0.34.2)
Processing /home/ubuntu/.cache/pip/wheels/1f/73/f9/xxxxxxxx0dba8423841c1404f319bb/tap_exchangeratesapi-0.1.1-cp36-none-any.whl
Processing /home/ubuntu/.cache/pip/wheels/6e/07/1b/xxxxxxxx6d9ce55c05f67a69127e25/singer_python-5.3.3-cp36-none-any.whl
Processing /home/ubuntu/.cache/pip/wheels/fc/d8/34/xxxxxxxx027b62dfcf922fdf8e396d/backoff-1.3.2-cp36-none-any.whl
Collecting requests==2.21.0
.
.
.
Collecting python-dateutil
Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytzdata
Using cached pytzdata-2020.1-py2.py3-none-any.whl (489 kB)
Collecting pytz
Using cached pytz-2020.1-py2.py3-none-any.whl (510 kB)
Collecting six>=1.5
Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: jsonschema, simplejson, pytz, tzlocal, six, python-dateutil, pytzdata, pendulum, singer-python, target-csv
Successfully installed jsonschema-2.6.0 pendulum-1.2.0 python-dateutil-2.8.1 pytz-2020.1 pytzdata-2020.1 simplejson-3.11.1 singer-python-2.1.4 six-1.15.0 target-csv-0.3.0 tzlocal-2.1
```

Now let's run the task. Try entering this command below:

```
> handoff -p 03_exchange_rates -w workspace_03 run local
```
```
INFO - 2020-08-06 03:35:29,258 - handoff.config - Reading configurations from 03_exchange_rates/project.yml
INFO - 2020-08-06 03:35:29,339 - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
INFO - 2020-08-06 03:35:29,626 - handoff.config - You have the access to AWS resources.
INFO - 2020-08-06 03:35:29,626 - handoff.config - Platform: aws
INFO - 2020-08-06 03:35:29,626 - handoff.config - Setting environment variables from config.
INFO - 2020-08-06 03:35:29,693 - handoff.config - Environment variable HO_BUCKET was set autoamtically as xxxxxxxxxxxx-handoff-test
INFO - 2020-08-06 03:35:29,693 - handoff.config - Writing configuration files in the workspace configuration directory workspace_03/config
INFO - 2020-08-06 03:35:29,694 - handoff.config - Copying files from the local project directory 03_exchange_rates
INFO - 2020-08-06 03:35:29,695 - handoff.config - Running run local in workspace_03 directory
INFO - 2020-08-06 03:35:29,695 - handoff.config - Job started at 2020-08-06 03:35:29.695732
.
.
.
INFO - 2020-08-06 03:35:33,964 - handoff.config - Job ended at 2020-08-06 03:35:33.964206
INFO - 2020-08-06 03:35:33,964 - handoff.config - Processed in 0:00:04.268474
```

This process should have created a CSV file in artifacts directory:

```
exchange_rate-20200806T033530.csv
```

...which looks like:

```
CAD,HKD,ISK,PHP,DKK,HUF,CZK,GBP,RON,SEK,IDR,INR,BRL,RUB,HRK,JPY,THB,CHF,EUR,MYR,BGN,TRY,CNY,NOK,NZD,ZAR,USD,MXN,SGD,AUD,ILS,KRW,PLN,date
0.0127290837,0.0725398406,1.3197211155,0.4630976096,0.0618218792,2.9357569721,0.2215388446,0.007434429,0.0401958831,0.0863047809,135.1005146082,0.7041915671,0.050374336,0.6657569721,0.0625373506,1.0,0.29312749,0.0088188911,0.0083001328,0.0399311089,0.0162333997,0.0642571381,0.0655312085,0.0889467131,0.0142670983,0.158440405,0.0093592297,0.2132744024,0.0130336985,0.0134852258,0.032375498,11.244189907,0.0371372842,2020-07-10T00:00:00Z
0.0126573311,0.072330313,1.313014827,0.4612685338,0.061324547,2.9145799012,0.2195057661,0.007408402,0.0399036244,0.085529654,134.613509061,0.7019439868,0.049830313,0.6601894563,0.0620593081,1.0,0.2929324547,0.0088014827,0.0082372323,0.0397907743,0.0161103789,0.0641054366,0.0653286656,0.0878500824,0.0141894563,0.1562817133,0.0093319605,0.209931631,0.0129678748,0.013383855,0.0321466227,11.2139209226,0.0368682043,2020-07-13T00:00:00Z
```


Now that we know how to run locally, we will gradually thinking about how to deploy this in the cloud *severlessly*.
We will learn how to save and fetch the configurations to the remote storage.
Before doing that, we will cover how to set up AWS account and profile in the next section.

Loading

0 comments on commit c1078f4

Please sign in to comment.