Skip to content

Data Conjurer Reference

Tao Dong edited this page May 22, 2024 · 3 revisions

Concepts

The data conjurer tool, designed for efficient data generating, separates the input into two components: the reusable schema defining database structures, and the adaptable plan specifying desired output. The following table is a quick reference between the terminology used between the tool and MySQL database

Terminology MySQL Description
schema schema Map to a MySQL Database Schema
entity table Map to a MySQL Table
property column Map to a MySQL Column
index index Map to a database index, such as primary key, unique index
reference foreign key Map to a database foreign key
entity entry data row Map to a predefined data row

Define Schema

Schema

A database schema is defined through a YAML file. It contains two required properties:

  1. name - The name of the schema; it doesn't have to match the actual database schema name, but it is recommended to do so.
  2. entities - A set of entity definitions. At least one entity is required. Entities with the same name are considered duplicates, which will cause an error.

The following is a sample schema with name "text" and an entity "country"

name: test
entities:
  - name: country
    properties:
      - name: "id"
        type: "sequence"
        index:
          id: 0

This schema definition is equivalent to the a MySQL schema "test" with the following structure

CREATE TABLE `country` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
);

Entity

Entity presents a definition of a database table. It contains two properties: name and properties. Both properties are required for a valid definition.

  1. name - The name of the entity; it doesn't have to match the actual database table name, but it is recommended to do so.
  2. Properties - A set of property definitions. Each property can be mapped to a database column. Following is a sample entity definition
  - name: country
    properties:
      - name: "id"
        type: "sequence"
        index:
          id: 0
      - name: "name"
        type: "text"
        constraints:
          - type: size
            min: 0
            max: 20
      - name: "code"
        type: "text"
        constraints:
          - type: length
            max: 2

The above entity definition can be used to generate data for the following MySQL table

CREATE TABLE `country` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(20) NOT NULL,
  `code` char(2) NOT NULL,
  PRIMARY KEY (`id`)
)

Property

Property is an abstraction of a database column. It has four properties:

  1. name - A (column) name to identify the property
  2. type - A predefined data type for generation
  3. index - A unique constraint enforced on one or multiple properties under same entity, can be used to define database primary key or unique index
  4. constraints - A list of Constraint definition to provide rules for the data generation
  5. reference - The way to identify a value is provided by another entity. It's how a database foreign key is defined.

Type

The following five property types can be used for this tool

Type Description
sequence An auto incremental number, by default starting at 1.
text Any text value
number Any decimal number value, such as 100, -54 or 99.99
date A date value, the output format is "yyyy-MM-dd"
datetime A date time value, the output format is "yyyy-MM-dd HH:mm:ss"
time A time value, the output format is "HH:mm:ss"

Index

An index definition contains three parts: id, type and qualifier.

  1. id - id should be a non-negative integer. The property is used as primary key when its value is 0 which will generate sequence starting with 1 when the property type is sequence
  2. type - An integer number, allowed value are 0(regular index), 1(set index), 2 (non circular index), 3 (ordered non circular index) and 4 (distinct element index).
  3. qualifier - An integer. When applying to non circular index or ordered non circular index, the allowed values are 1 and 2. When applying to distinct element value index, the allowed value is 1.

Regular Index

A regular index only enforce unique combination.

Sample regular index

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1

The above definition allows the following combination of product_id and order_id to be generated

[
  {"product_id": 1, "order_id": 1},
  {"product_id": 2, "order_id": 1},
  {"product_id": 1, "order_id": 2},
  {"product_id": 2, "order_id": 3},
  {"product_id": 3, "order_id": 2}
]

Set Index

A set index enforces the unique combination regardless of element order. Values of [1, 2] and [2, 1] are treated as the same value.

Sample set index

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
      type: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1
      type: 1

With above index definition, same five entries generated in the regular index example will be reduced to the following three

[
  {"product_id": 1, "order_id": 1},
  {"product_id": 2, "order_id": 1},
  {"product_id": 2, "order_id": 3}
]

Non Circular Index

A non circular index treat each index entry as a branch of a graph. Each branch has a direction from parent to child, a qualifier value 1 identifies a parent while 2 identifies a child. A newly generated index is accepted only the following two conditions are met:

  1. The new branch won't create a circle with existing values, for example [1, 1] creates a circle to itself, [1, 2] [2, 1] creates another circle.
  2. The new branch won't create a multiple parents for the same child. For example, [{"parent_id": 1, "child_id": 3}, {"parent_id": 2, "child_id": 3}] creates a multiple parents situation where child_id 3 has two parent_id 1 and 2.

A common usage of non circular index is to use with linked reference to create a tree structure, such as FB type comments.

Sample non circular index

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
      type: 2
      qualifier: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1
      type: 2
      qualifier: 2

In this case, the five generated index entries will be reduced to the following

[
  {"product_id": 2, "order_id": 1},
  {"product_id": 2, "order_id": 3}
]

Ordered Non Circular Index

An ordered non circular index is a non circular index with one extra constraint which enforces the property value of any parent id is less than the value of child id. Please make sure the properties values of marked with parent id and child id are comparable before apply it.

Example

When applying the following code, the value of product_id will be less than order_id.

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
      type: 3
      qualifier: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1
      type: 3
      qualifier: 2

Sample outputs

[
  {"product_id": 2, "order_id": 3}
]

Distinct Element Index

Distinct element index does not allow properties with same value to be created. To add a property to the check, qualifier is set to 1. If no properties with qualifier are found, distinct element index works same as regular index.

⚠️ The tool assumes that properties marked by qualifier have the same data type. To simplify the data type check, the string values of desired property are used to guarantee the unique.

Example

The following code will generate bi-direction links between from and to except no same value of from and to is allowed.

properties:
  - name: "from"
    type: "sequence"
    index:
      id: 2
      type: 4
      qualifier: 1
  - name: "to"
    type: "sequence"
    index:
      id: 2
      type: 4
      qualifier: 1

Sample outputs

[
  {"from": 2, "to": 3},
  {"from": 3, "to": 2}
]

Constraints

Constraints define rules when generating data, e.g. maximum or minimum length of a string. A constraint can be defined in schema or data plan. It's recommended to define database schema related constrains within schema and all other purpose constraints with data plan. The following table lists all supported constraints and their usages:

Name Compatible Type Description
chain number, datetime, date A Markov chain on records generated in sequence
category text A pre-formatted text, such as email
char_group text A predefined character group
correlation number, datetime, date, time An expression or function
duration date, datetime A date time range
interval sequence An interval of a sequence
length text A fixed length text value
precision number A decimal places of a number
range number A range for integral part of a number
size text A variant length text value

⚠️ Constraints applied to a non-compatible type are ignored

Chain

It enforces a Markov chain to all the targeted property value based on the record generating sequence. Chain is defined with four properties:

  1. type - the value needs to be "chain"
  2. seed - a positive number which defines the base value difference between two adjacent values.
  3. direction - how the value is changed. When set to 0, the next element can either increase or decrease the previous value, when set to 1, value always increases, and when set to -1, value always decreases
  4. style - defines how the seed is used to generate next number: same value of seed is applied when set to 0, a random number between 0 and seed is applied when set to 1, and a Gaussian random with μ = seed for value 2.

⚠️ When using Gaussian random, the values used will be tween 0 and 2 * seed to avoid possible value overflow.

Example

The following code generates chain of values for date beginning at 12/05/2023 00:00:00 with 8 seconds incremental value.

- name: "date"
  type: "datetime"
  constraints:
    - type: "duration"
      start:
        year: 2023
        month: 12
        day: 5
    - type: "chain"
      seed: 8000
      direction: 1
      style: 0

Category

Category is a constraint that a predefined data provider is used to generate certain desired text value, such as name or address. The tool provides the following categories

Category Description Qualifiers
Name A person name firstname, lastname
Address An address street, city, country, state, zip
Email An email
Id Id formartedℹ️

⚠️ Category is an exclusive constraint! When a category is applied, all the other constraints defined for the same property will be ignored.

ℹ️ Before version 1.1, If a qualifier is used for Id, the tool will process the qualifier and replace any ? sign with a random letter and any # sign with a random number. For example id-??.### may result as id-fd.872. Otherwise an UUID is generated. Starting from version 1.1, id format is defined using format constraint.

A category definition includes four properties

Properties Description Value
type constraint type Fixed value:"category"
name name of the provider name, address or email
qualifier It used to extract part of the output. For example, if name generated "Mark Twain", only "Mark" will be used when qualifier is set to firstname Please refer to previous table for allowed qualifies of each category
locale Desired local String value using BCP 47 format. When generating country name, the locale needs to have a country part after the language such as "en-US", "ja-JP"
compoundId integer type compound id. default value is 0 If an entity has two properties with same category name and compound id, the same value will be applied to both properties. Otherwise a different values will be applied

Examples

  1. The following code generates email values for "username"
- name: "username"
  constraints:
    - type: "category"
      name: "email"
  1. The following code generates Chinese character first names for "firstname"
- name: "first_name"
  constraints:
    - type: "category"
      name: "name"
      qualifier: "firstname"
      locale: "zh-CN"
  1. The following code generates US zip codes for "zip"
- name: "zip"
  constraints:
    - type: "category"
      name: "address"
      qualifier: "zip"
      locale: "en-US"
  1. The following code generates an UUID for id property and a formatted id for refId property

Before Version 1.1

properties:
  - name: "id"
    type: "text"
    constraints:
      - type: "category"
        name: "id"
  - name: "refId"
    type: "text"
    constraints:
      - type: "category"
        name: "id"
        qualifier: "ref-##?##"
        compoundId: 1

After Version 1.1

properties:
  - name: "id"
    type: "text"
    constraints:
      - type: "category"
        name: "id"
  - name: "refId"
    type: "text"
    constraints:
      - type: "category"
        name: "id"
        compoundId: 1
      - type: "format"
        format: "ref-##?##"        

Character Group

By default, texts generated by the tool are using ASCII alpha numerals characters. By applying a character group (char_group) constraint allows text generated using different character sets. The tool supports the following character groups:

Character Group Description
arabic_numerals It represents a number between 0 and 9.
ascii_alpha_numerals It represents a letter between a and Z or a number between 0 and 9.
ascii_letters It represents a letter between a and Z.
ascii_lowercase_letters It represents a letter between a and z.
ascii_uppercase_letters It represents a letter between A and Z.
digits It presents any digit character in unicode
letters It presents any letter character in unicode
ea_hiragana It presents Japanese Hiragana characters
ea_katakana It presents Japanese Katakana characters
ea_cjk_a It presents east asian CJK group A characters

⚠️ If a Character Group of unicode is used, the text length is calculated differently from acii characters.

Character group includes two properties type and groups. Type value has to be "char_group" and the groups is an array with any combination of supported character group. For example, the following code generates texts combining Hiragana, Katakana and numbers for "comment"

- name: "comment"
  constraints:
    - type: "char_group"
      groups:
        - "ea_hiragana"
        - "ea_katakana"
        - "arabic_numerals"

Correlation

Correlation instructs the tool to generate value by applying rules to existing values of other properties of same entity. Correlation works for number, date and datetime type properties. A correlation has three properties:

  1. type - the value needs to be "correlation"
  2. properties - A set of properties used in calculation
  3. formula - A formula for calculation

⚠️ When using correlation, the property name can't contain mathematical operators such as dash, you need to use alias feature if your column name includes them.

The expression evaluator of project EvalEx is used to process formulas, you may need to refer to its document for supported functions. For date or datetime, we provide two extra functions:

Function Description Example
TIME_AFTER Generate a date/datetime after a given value but before 10 years from now TIME_AFTER(p1)
PAST_TIME_AFTER Generate a date/datetime after a given value but before current time PAST_TIME_AFTER(p1)

Examples

  1. The following code applies the sum of "p1" and "p2" as the value of property "p3"
- name: "p3"
  constraints:
    - type: "correlation"
      properties: ["p1", "p2"]
      formula: "p1 + p2"
  1. The following code generate a past date time later than the value of "user_created"
- name: "created"
  constraints:
    - type: "correlation"
      properties: ["user_created"]
      formula: "PAST_TIME_AFTER(user_created)"

Duration

Duration defines a date range where the generated date or date time needs to be within. Duration has three properties:

  1. type - the value should be "duration"
  2. start - the earliest time mark.
  3. end - the latest time mark.

A time mark consists of six properties: year, month (1 - 12), day, hour (0 - 23), minute and second, those are used to define the time boundary with the most accuracy at second level.

Examples

  1. The following code generates a date time between 03/01/2023 00:00:00 and 11/10/2023 14:05:30
- name: "created"
  type: "datetime"
  constraints:
    - type: duration
      start:
        year: 2023
        month: 3
        day: 1
      end:
        year: 2023
        month: 11
        day: 10
        hour: 14
        minute: 5
        second: 30
  1. The following code generates a date between 04/22/2023 and today
- name: "created"
  type: "date"
  constraints:
    - type: duration
      start:
        year: 2023
        month: 4
        day: 22

Interval

An interval defines the base and leap when generate a sequence. It has three properties:

  1. type - value is "interval"
  2. base - the starting integer value of the sequence, default is 1
  3. leap - integer to define the value difference between two sequence number. It's required.

The following code generates a sequence starts with 3 and increasing 2 each time. The output is 3, 5, 7, 9 ...

- name: "id"
type: "sequence"
constraints:
  - type: "interval"
    base: 3
    leap: 2

Length

Length instructs tool to generate fixed length text. It has two properties:

  1. type - value is "length"
  2. max - the character size needed

The following code generates a two letter text for "code"

- name: "code"
  type: "text"
  constraints:
    - type: length
      max: 2

Precision

Precision is used to define the decimal places of a generated number. A number type property without precision constraint, only integer will be generated. It has two properties:

  1. type - value should be "precision"
  2. max - maximum decimal places. Its value is an integer between 0 and 10

The following code generates a number of two digits for "price"

- name: "price"
  type: "number"
  constraints:
    - type: "precision"
      max: 2

Range

Range constraint is to define a number value range. It has three properties which one of "min" and "max" could be optional.

  1. type - its value should be "range"
  2. min - an inclusive minimum value
  3. max - an exclusive maximum value

the following code generates an integer between 1 and 998 for "price"

- name: "price"
  type: "number"
  constraints:
    - type: "range"
        min: 1
        max: 999

Size

Constraint size is used to generate text within given length range. It contains three properties, in those one of "min" and "max" may be optional.

  1. type - its value is "size"
  2. min - An inclusive minimum length
  3. max - An exclusive maximum length

The following code generates a random text of size between 2 and 19 for "comment"

      - name: "comment"
        type: "text"
        constraints:
          - type: size
            min: 2
            max: 20

Reference

Instead of generating values, a reference applies an existing property value from another entity. It can be used to define a foreign key reference between two tables. Reference has three properties:

  1. entity - required. It defines the target entity
  2. property - required. It defines the target property
  3. linked - optional. When reference multiple properties of the same entity, it allows values used are from the same record.

Examples

  1. The following code creates a foreign key to product.id for product_id property
- name: "product_id"
  type: "sequence"
  reference:
    entity: "product"
    property: "id"
  1. The following code populates values of user_id and user_created using same record under usr_user
- name: "order_detail"
  properties:
    - name: "user_id"
      type: "sequence"
      reference:
        entity: "usr_user"
        property: "id"
        linked: "id"
    - name: "user_created"
      type: "datetime"
      reference:
        entity: "usr_user"
        property: "created"
        linked: "id"

Define Data Generation Plan

A data generation plan instructs the tool that how the data is generated against targeted schema. Defining using a YAML format, it contains two parts:

  1. plan - A required property which defines how many records are generated for each entity and any extra data requirements
  2. output - An optional property to tailor the output data, such as hide columns or alias

Sample data plan yaml layout

plan:
	...
output:
	...

Data Plan

Data plan has three first level elements

  1. name - name of the data plan, it can be any text
  2. schema - the targeted schema, its value has to match the name defined in the schema yaml
  3. data - a list of entity data definitions. Each element in this list matches one entity in schema. If there is no match element for a defined entity, no data is generated for that entity.

Example

  1. The following code defines a data plan named as hw-plan which generates data against schema test. It requests 10 records for country table and 100 records for city table.
plan:
  name: "hw-plan"
  schema: "test"
  data:
    - entity: "country"
      count: 10
    - entity: "city"
      count: 100

Entity Data

An entity data element provides guidance to the tool on how to generate data for each entity. It has the following properties:

  1. entity - The name of targeted entity
  2. dataId - An optional but unique non-negative number, default is 0. It's used when multiple batches of data with different rules are needed for the same entity. For example, if there is a user table with a country column, we can use two entity data definitions, one entity data generates 100 US user and the other generates 50 UK user.
  3. count - A positive number to define how many records are needed
  4. properties - A list of property input control element which provides extra details how the data is required for each property
  5. entries - It defines user provided entries used in data generation

Examples

  1. The following code create two batch of record for the city, the first one contains 10 records and the second contains 5 records
- entity: "city"
  count: 10
- entity: "city"
  dataId: 1
  count: 5
Property Input Control

Property input control is used to list extra specifications of property data generation. It has five properties:

  1. name - target property name. It's required.
  2. values - a list of property value distribution, for example, 30 percent of the value generated would be null
  3. defaultValue - a String presentation of a default value for target property
  4. referenceStrategy - if the target property is a reference property. It defines how the referenced values are applied. The available strategies are either random or loop. If it's not set, random strategy is used.
  5. constraints - a list of extra constraints to be applied to this property. For details of defining constraints please refer to Constraints section.
Default Value

A predefined default value will be applied with the following rules:

  1. If the target property has a given value in data.entries, the default value is ignored when generating those entries.
  2. If values are defined, the default value is ignored when generating all the records within the percentage defined in values.
  3. The default value will be applied to all records not covered by rule 1 and 2.
Values

Values are a list of percentage distribution will provide granular control to applying predefined values. It has two properties

  1. values - a set of String presentation of the desired values. A number value needs to be enclosed within double quotes. If there are more than one value given, the values are randomly applied.
  2. weight - a double value to define the record percentage to be applied. Each weight value should in the range of (0, 1] and the total weight of all values shouldn't be more than 1. For example, if the weight is 0.1 and total records generated are 100, there are about 10 records will apply the given value. However the weight is not a precisely guaranteed.

Examples

  1. The following code assigns the value "CLOSED" to 10% of the generated records, "RETURNED" to 15% of the records, and "PROCESSING" to the remaining records in the "status" column.
properties:
  - name: "status"
    values:
      - values:
          - "CLOSED"
        weight: 0.1
      - values:
          - "RETURNED"
        weight: 0.15
    defaultValue: "PROCESSING"
  1. The following code assigns the value "CLOSED" or "RETURNED" to 25% of the records and "PROCESSING" to the remaining records in the "status" column.
properties:
  - name: "status"
    values:
      - values:
          - "CLOSED"
          - "RETURNED"
        weight: 0.25
    defaultValue: "PROCESSING"
Entries

Entries allows user to apply predefined records to the generation process. All the records defined under entries will be applied orderly prior to any data generation. For example, if 11 records are defined under entries, and the desired record number is 10, the first 10 records will be used. However if the desired number is 100, the tool will apply all 11 records first then generate the remaining 89 records.

Entries contains the following two properties:

  1. properties - a list of property names which values will be assigned. The tool will generate values for properties not listed when applying the records.
  2. values - a list of records to be used. Each records defined should have the same number of value defined in properties.

Example The following code uses the three records defined under entries first then generates 7 remaining for product table.

- entity: "product"
  count: 10
  entries:
    properties:
      - "name"
      - "desc"
      - "sku"
      - "category_id"
    values:
      - ["tv_1", "desc_1", "elect00001", "1"]
      - ["tv_2", "desc_2", "elect00002", "1"]
      - ["hp_win", "a computer", "comp000001", "2"]

Output Control

The output section enables two post data generation operations to tailor the final output files. For each entity, it allows to hide certain properties or output properties using alias. Hereto entities under output is a list of entities to be altered.

For each entity element, there are two sub-properties:

  1. name - name of the targeted entity
  2. properties - a list of property output control

Property Output Control

Property output control has three properties:

  1. name - name of the target property
  2. hide - optional boolean value, default is false. When set, the target property is ignored in the output SQL statement.
  3. alias - optional String value. When set, the given alias will be used in the output SQL statement.

Examples The following code hides created_date when generating insert statement for post table and uses user instead of user_id as the column name when creating insert SQL statement for comment table.

output:
    entities:
      - name: "post"
        properties:
          - name: "created_date"
            hide: true
      - name: "comment"
        properties:
          - name: "user_id"
            alias: "user"

Miscellaneous

Key Words

Null Value

The key word used to define a null value is <?null?>. The tool will covert this key word into null when output sql statements.

Example The following code sets null to product_category.desc about 70% records generated.

- entity: "product_category"
  count: 50
  properties:
    - name: "desc"
      values:
        - values:
            - "<?null?>"
          weight: 0.7