See parent repository for more information on the dynamic-org-chart project
Some preprocessing scripts to prepare data for the Dynamic Org Chat project
This repository runs a scheduled job that fetches the Government of Canada Employee Contact Information (GEDS) dataset that is made available under the Open Government License - Canada. The data produced by the scripts in this repository are used in a REST API which can be found [here](ADD LINK), and an instance of Elastic Search. These services then provide data to a front-end that shows a searchable interactive organizational chart, which can be found [here](ADD LINK).
TODO:
- Make org chart write "children" keys instead of "_children" keys by default, so that the front end does not need to unhide and rehide during d3 indexing.
- Create separate indices in elastic search for french/english fields (will likely want to search them differently).
- Check how the organization name column is being compared with the organization path column. Some typos/casing/spacing errors cause a failure to identify a search path.
- Pull the elasticsearch docker image and run elasticsearch as a docker container.
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.5.1
docker run -p 9200:9200 -p 9300:9300 -e "http.cors.enabled=true" -e "http.cors.allow-origin=*" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.5.1
- Clone the
dynamic-org-chart-scripts
repository to a folder on your computer.
git clone https://github.com/DSD-ESDC-EDSC/dynamic-org-chart-scripts
- In a new terminal, create and activate the virtual environment in the root of the
dynamic-org-chart-scripts
folder.
conda env create -f environment.yml
conda activate ./venv
- Start the main script in the
process-geds-data
repository by running the following command from within the virtual environment in the project root:
python start.py
The data produced by the scripts in this repository can be broken into two types: flat and hierarchical. Flat data are written into the SQL tables described below.
employees
column name | type | description | |
---|---|---|---|
PK | employee_id | INTEGER | |
last_name | TEXT | ||
first_name | TEXT | ||
job_title_en | TEXT | ||
job_title_fr | TEXT | ||
phone_number | TEXT | ||
TEXT | |||
address_en | TEXT | ||
address_fr | TEXT | ||
province_en | TEXT | ||
province_fr | TEXT | ||
city_en | TEXT | ||
city_fr | TEXT | ||
postal_code | TEXT | ||
FK | org_id | INTEGER | |
FK | dept_id | INTEGER |
organizations
column name | type | description | |
---|---|---|---|
PK | org_id | INTEGER | |
org_name_en | TEXT | ||
org_name_fr | TEXT | ||
FK | dept_id | INTEGER | |
org_chart_path | TEXT | An array (serialized to string) describing the tree traversal required to arrive at the organization in the org chart. |
departments
column name | type | description | |
---|---|---|---|
PK | dept_id | INTEGER | |
department_en | TEXT | ||
department_fr | TEXT | ||
org_chart_en | TEXT | A JSON (serialized to string) describing the english org chart for the department. | |
org_chart_fr | TEXT | A JSON (serialized to string) describing the french org chart for the department. |
Using a column from the csv extracted from GEDS ("Organization Structure (EN/FR)"), it is possible to extract a hierarchical structure, which yields an organizational structure for the Government of Canada, as described in GEDS. The format of the data is as follows:
{
"name": "Employment and Social Development Canada",
"_children": [
{
"name": "Deputy Minister of Employment and Social Development Canada",
"_children": [
{
"name": "Strategic and Service Policy Branch",
"_children": [
{
...
}
]
}
]
}
]
}
Each node has associated with it a name
and _children
property. name
, of course, is the name of the organizational unit being described, and _children
is an array of nodes that are direct descendents of the current node. Due to the recursive nature of this structure, it is possible to search for specific business units. Since this linear search through the tree is slow, the path to each organization (starting at the root of its tree) is identified offline during the creation of the database tables (stored in the org_chart_path column of the organizations table). This way, as soon as an organization name or person is searched for (e.g. using ElasticSearch), the path to arrive at that organization or person is instantly available, without requiring a linear search through the tree at the time the request is made.
For illustrative purposes, a small csv with fake data is included to simulate what raw data on employees of an organization might look like.
Configurations are stored here using *.cfg
files, and parsed with Python's configparser
library.
The code below runs a scheduled job using Python's apscheduler
library. In practice, this job will likely be run every 24 hours to pick up the latest posting of GEDS data.
Every function is called from main
. The call graph below outlines which functions are called and in which order in development vs. production. The main difference between the two is that the production graph will not fetch the GEDS data from the url every time it is ran; instead, it will store a local copy as a csv file and use that instead to avoid unnecessarily downloading the same file many times while testing during development.
Production Module Dependency Graph
TODO: need to test that the org chart paths generated agree with the paths to nodes in the org chart tree structure.
This repository assumes there exists an instance of Elasticsearch that it can upload data to via Elasticsearch's bulk api. This repository makes use of the Python client for Elasticsearch.
Note: Elasticsearch 6.0.0+ removed support for multiple mapping types. As an alternative to multiple mapping types, this repository uses a single index for each type of document. In this case, the two types are
employee
andorganization
. As such, there are two Elasticsearch indices namedemployee
andorganization
.
If you have Docker installed on your system, you can get up-and-running with Elasticsearch in only a few steps. These steps are outlined briefly below, but see here for more information.
- Pull the Elasticsearch image from the Elastic docker registry.
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.5.1
- Start a single node cluster of elasticsearch on your local machine (host defaults to
localhost
). Note that the arguments-e "http.cors.enabled=true" -e "http.cors.allow-origin=*"
should only be used in development.
docker run -p 9200:9200 -p 9300:9300 -e "http.cors.enabled=true" -e "http.cors.allow-origin=*" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.5.1
Note: there are xms
and xmx
flags that control the minimum/maximum heap size for JVM.
TODO: create this and document it
- SQLalchemy guide ch 1
- sqlalchemy schema
- pandas with sqlalchemy
- python es client tutorial
- python es client documentation
- python es client bulk helpers
- SO post on using Elasticsearch with human names
Unless otherwise noted, program source code of this project is covered under Crown Copyright, Government of Canada, and is distributed under the MIT License.
The Canada wordmark and related graphics associated with this distribution are protected under trademark law and copyright law. No permission is granted to use them outside the parameters of the Government of Canada's corporate identity program. For more information, see Federal identity requirements.
This project would not be possible without the availability and use of open source software. Acknowledgement and attribution to the open source tools used, along with their corresponding open licenses (where one was found), can be found in the ATTRIBUTION.md file in this repository. Users are advised to consult original sources for official information, especially if they plan on re-distributing all or parts of these code artifacts.
Instructions for how to contribute can be found in the CONTRIBUTING.md file.