Skip to content

Some preprocessing scripts to prepare data for the Dynamic Org Chat project

License

Notifications You must be signed in to change notification settings

DSD-ESDC-EDSC/dynamic-org-chart-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

See parent repository for more information on the dynamic-org-chart project

dynamic-org-chart-scripts

Some preprocessing scripts to prepare data for the Dynamic Org Chat project

This repository runs a scheduled job that fetches the Government of Canada Employee Contact Information (GEDS) dataset that is made available under the Open Government License - Canada. The data produced by the scripts in this repository are used in a REST API which can be found [here](ADD LINK), and an instance of Elastic Search. These services then provide data to a front-end that shows a searchable interactive organizational chart, which can be found [here](ADD LINK).

TODO:

  1. Make org chart write "children" keys instead of "_children" keys by default, so that the front end does not need to unhide and rehide during d3 indexing.
  2. Create separate indices in elastic search for french/english fields (will likely want to search them differently).
  3. Check how the organization name column is being compared with the organization path column. Some typos/casing/spacing errors cause a failure to identify a search path.

Start up instructions

  1. Pull the elasticsearch docker image and run elasticsearch as a docker container.
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.5.1
docker run -p 9200:9200 -p 9300:9300 -e "http.cors.enabled=true" -e "http.cors.allow-origin=*" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.5.1
  1. Clone the dynamic-org-chart-scripts repository to a folder on your computer.
git clone https://github.com/DSD-ESDC-EDSC/dynamic-org-chart-scripts
  1. In a new terminal, create and activate the virtual environment in the root of the dynamic-org-chart-scripts folder.
conda env create -f environment.yml
conda activate ./venv
  1. Start the main script in the process-geds-data repository by running the following command from within the virtual environment in the project root:
python start.py

Data

The data produced by the scripts in this repository can be broken into two types: flat and hierarchical. Flat data are written into the SQL tables described below.

SQL Tables

employees

column name type description
PK employee_id INTEGER
last_name TEXT
first_name TEXT
job_title_en TEXT
job_title_fr TEXT
phone_number TEXT
email TEXT
address_en TEXT
address_fr TEXT
province_en TEXT
province_fr TEXT
city_en TEXT
city_fr TEXT
postal_code TEXT
FK org_id INTEGER
FK dept_id INTEGER

organizations

column name type description
PK org_id INTEGER
org_name_en TEXT
org_name_fr TEXT
FK dept_id INTEGER
org_chart_path TEXT An array (serialized to string) describing the tree traversal required to arrive at the organization in the org chart.

departments

column name type description
PK dept_id INTEGER
department_en TEXT
department_fr TEXT
org_chart_en TEXT A JSON (serialized to string) describing the english org chart for the department.
org_chart_fr TEXT A JSON (serialized to string) describing the french org chart for the department.

Hierarchical Data

Using a column from the csv extracted from GEDS ("Organization Structure (EN/FR)"), it is possible to extract a hierarchical structure, which yields an organizational structure for the Government of Canada, as described in GEDS. The format of the data is as follows:

{
  "name": "Employment and Social Development Canada",
  "_children": [
    {
      "name": "Deputy Minister of Employment and Social Development Canada",
      "_children": [
        {
          "name": "Strategic and Service Policy Branch",
          "_children": [
            {
              ...
            }
          ]
        }
      ]
    }
  ]
}

Each node has associated with it a name and _children property. name, of course, is the name of the organizational unit being described, and _children is an array of nodes that are direct descendents of the current node. Due to the recursive nature of this structure, it is possible to search for specific business units. Since this linear search through the tree is slow, the path to each organization (starting at the root of its tree) is identified offline during the creation of the database tables (stored in the org_chart_path column of the organizations table). This way, as soon as an organization name or person is searched for (e.g. using ElasticSearch), the path to arrive at that organization or person is instantly available, without requiring a linear search through the tree at the time the request is made.

Folder Organization

data

For illustrative purposes, a small csv with fake data is included to simulate what raw data on employees of an organization might look like.

schedule

config

Configurations are stored here using *.cfg files, and parsed with Python's configparser library.

main

The code below runs a scheduled job using Python's apscheduler library. In practice, this job will likely be run every 24 hours to pick up the latest posting of GEDS data.

Every function is called from main. The call graph below outlines which functions are called and in which order in development vs. production. The main difference between the two is that the production graph will not fetch the GEDS data from the url every time it is ran; instead, it will store a local copy as a csv file and use that instead to avoid unnecessarily downloading the same file many times while testing during development.

Dev Module Dependency Graph call graph

Production Module Dependency Graph call graph

test

Elasticsearch

TODO: need to test that the org chart paths generated agree with the paths to nodes in the org chart tree structure.

This repository assumes there exists an instance of Elasticsearch that it can upload data to via Elasticsearch's bulk api. This repository makes use of the Python client for Elasticsearch.

Note: Elasticsearch 6.0.0+ removed support for multiple mapping types. As an alternative to multiple mapping types, this repository uses a single index for each type of document. In this case, the two types are employee and organization. As such, there are two Elasticsearch indices named employee and organization.

Set up Elasticsearch with Docker

If you have Docker installed on your system, you can get up-and-running with Elasticsearch in only a few steps. These steps are outlined briefly below, but see here for more information.

  1. Pull the Elasticsearch image from the Elastic docker registry.
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.5.1
  1. Start a single node cluster of elasticsearch on your local machine (host defaults to localhost). Note that the arguments -e "http.cors.enabled=true" -e "http.cors.allow-origin=*" should only be used in development.
docker run -p 9200:9200 -p 9300:9300 -e "http.cors.enabled=true" -e "http.cors.allow-origin=*" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.5.1

Note: there are xms and xmx flags that control the minimum/maximum heap size for JVM.

Postman Collection

TODO: create this and document it

Resources

  1. SQLalchemy guide ch 1
  2. sqlalchemy schema
  3. pandas with sqlalchemy
  4. python es client tutorial
  5. python es client documentation
  6. python es client bulk helpers
  7. SO post on using Elasticsearch with human names

License

Unless otherwise noted, program source code of this project is covered under Crown Copyright, Government of Canada, and is distributed under the MIT License.

The Canada wordmark and related graphics associated with this distribution are protected under trademark law and copyright law. No permission is granted to use them outside the parameters of the Government of Canada's corporate identity program. For more information, see Federal identity requirements.

Attribution

This project would not be possible without the availability and use of open source software. Acknowledgement and attribution to the open source tools used, along with their corresponding open licenses (where one was found), can be found in the ATTRIBUTION.md file in this repository. Users are advised to consult original sources for official information, especially if they plan on re-distributing all or parts of these code artifacts.

How to Contribute

Instructions for how to contribute can be found in the CONTRIBUTING.md file.

About

Some preprocessing scripts to prepare data for the Dynamic Org Chat project

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published