Data Conjurer Reference

Concepts

The data conjurer tool, designed for efficient data generating, separates the input into two components: the reusable schema defining database structures, and the adaptable plan specifying desired output. The following table is a quick reference between the terminology used between the tool and MySQL database

Terminology	MySQL	Description
schema	schema	Map to a MySQL Database Schema
entity	table	Map to a MySQL Table
property	column	Map to a MySQL Column
index	index	Map to a database index, such as primary key, unique index
reference	foreign key	Map to a database foreign key
entity entry	data row	Map to a predefined data row

Define Schema

Schema

A database schema is defined through a YAML file. It contains two required properties:

name - The name of the schema; it doesn't have to match the actual database schema name, but it is recommended to do so.
entities - A set of entity definitions. At least one entity is required. Entities with the same name are considered duplicates, which will cause an error.

The following is a sample schema with name "text" and an entity "country"

name: test
entities:
  - name: country
    properties:
      - name: "id"
        type: "sequence"
        index:
          id: 0

This schema definition is equivalent to the a MySQL schema "test" with the following structure

CREATE TABLE `country` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
);

Entity

Entity presents a definition of a database table. It contains two properties: name and properties. Both properties are required for a valid definition.

name - The name of the entity; it doesn't have to match the actual database table name, but it is recommended to do so.
Properties - A set of property definitions. Each property can be mapped to a database column. Following is a sample entity definition

  - name: country
    properties:
      - name: "id"
        type: "sequence"
        index:
          id: 0
      - name: "name"
        type: "text"
        constraints:
          - type: size
            min: 0
            max: 20
      - name: "code"
        type: "text"
        constraints:
          - type: length
            max: 2

The above entity definition can be used to generate data for the following MySQL table

CREATE TABLE `country` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(20) NOT NULL,
  `code` char(2) NOT NULL,
  PRIMARY KEY (`id`)
)

Property

Property is an abstraction of a database column. It has four properties:

name - A (column) name to identify the property
type - A predefined data type for generation
index - A unique constraint enforced on one or multiple properties under same entity, can be used to define database primary key or unique index
constraints - A list of Constraint definition to provide rules for the data generation
reference - The way to identify a value is provided by another entity. It's how a database foreign key is defined.

Type

The following five property types can be used for this tool

Type	Description
sequence	An auto incremental number, by default starting at 1.
text	Any text value
number	Any decimal number value, such as 100, -54 or 99.99
date	A date value, the output format is "yyyy-MM-dd"
datetime	A date time value, the output format is "yyyy-MM-dd HH:mm:ss"
time	A time value, the output format is "HH:mm:ss"

Index

An index definition contains three parts: id, type and qualifier.

id - id should be a non-negative integer. The property is used as primary key when its value is 0 which will generate sequence starting with 1 when the property type is sequence
type - An integer number, allowed value are 0(regular index), 1(set index), 2 (non circular index), 3 (ordered non circular index) and 4 (distinct element index).
qualifier - An integer. When applying to non circular index or ordered non circular index, the allowed values are 1 and 2. When applying to distinct element value index, the allowed value is 1.

Regular Index

A regular index only enforce unique combination.

Sample regular index

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1

The above definition allows the following combination of product_id and order_id to be generated

[
  {"product_id": 1, "order_id": 1},
  {"product_id": 2, "order_id": 1},
  {"product_id": 1, "order_id": 2},
  {"product_id": 2, "order_id": 3},
  {"product_id": 3, "order_id": 2}
]

Set Index

A set index enforces the unique combination regardless of element order. Values of [1, 2] and [2, 1] are treated as the same value.

Sample set index

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
      type: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1
      type: 1

With above index definition, same five entries generated in the regular index example will be reduced to the following three

[
  {"product_id": 1, "order_id": 1},
  {"product_id": 2, "order_id": 1},
  {"product_id": 2, "order_id": 3}
]

Non Circular Index

A non circular index treat each index entry as a branch of a graph. Each branch has a direction from parent to child, a qualifier value 1 identifies a parent while 2 identifies a child. A newly generated index is accepted only the following two conditions are met:

The new branch won't create a circle with existing values, for example [1, 1] creates a circle to itself, [1, 2] [2, 1] creates another circle.
The new branch won't create a multiple parents for the same child. For example, [{"parent_id": 1, "child_id": 3}, {"parent_id": 2, "child_id": 3}] creates a multiple parents situation where child_id 3 has two parent_id 1 and 2.

A common usage of non circular index is to use with linked reference to create a tree structure, such as FB type comments.

Sample non circular index

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
      type: 2
      qualifier: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1
      type: 2
      qualifier: 2

In this case, the five generated index entries will be reduced to the following

[
  {"product_id": 2, "order_id": 1},
  {"product_id": 2, "order_id": 3}
]

Ordered Non Circular Index

An ordered non circular index is a non circular index with one extra constraint which enforces the property value of any parent id is less than the value of child id. Please make sure the properties values of marked with parent id and child id are comparable before apply it.

Example

When applying the following code, the value of product_id will be less than order_id.

properties:
  - name: "product_id"
    type: "sequence"
    index:
      id: 1
      type: 3
      qualifier: 1
  - name: "order_id"
    type: "sequence"
    index:
      id: 1
      type: 3
      qualifier: 2

Sample outputs

[
  {"product_id": 2, "order_id": 3}
]

Distinct Element Index

Distinct element index does not allow properties with same value to be created. To add a property to the check, qualifier is set to 1. If no properties with qualifier are found, distinct element index works same as regular index.

⚠️ The tool assumes that properties marked by qualifier have the same data type. To simplify the data type check, the string values of desired property are used to guarantee the unique.

Example

The following code will generate bi-direction links between from and to except no same value of from and to is allowed.

properties:
  - name: "from"
    type: "sequence"
    index:
      id: 2
      type: 4
      qualifier: 1
  - name: "to"
    type: "sequence"
    index:
      id: 2
      type: 4
      qualifier: 1

Sample outputs

[
  {"from": 2, "to": 3},
  {"from": 3, "to": 2}
]

Constraints

Constraints define rules when generating data, e.g. maximum or minimum length of a string. A constraint can be defined in schema or data plan. It's recommended to define database schema related constrains within schema and all other purpose constraints with data plan. The following table lists all supported constraints and their usages:

Name	Compatible Type	Description
chain	number, datetime, date	A Markov chain on records generated in sequence
category	text	A pre-formatted text, such as email
char_group	text	A predefined character group
correlation	number, datetime, date, time	An expression or function
duration	date, datetime	A date time range
interval	sequence	An interval of a sequence
length	text	A fixed length text value
precision	number	A decimal places of a number
range	number	A range for integral part of a number
size	text	A variant length text value

⚠️ Constraints applied to a non-compatible type are ignored

Chain

It enforces a Markov chain to all the targeted property value based on the record generating sequence. Chain is defined with four properties:

type - the value needs to be "chain"
seed - a positive number which defines the base value difference between two adjacent values.
direction - how the value is changed. When set to 0, the next element can either increase or decrease the previous value, when set to 1, value always increases, and when set to -1, value always decreases
style - defines how the seed is used to generate next number: same value of seed is applied when set to 0, a random number between 0 and seed is applied when set to 1, and a Gaussian random with μ = seed for value 2.

⚠️ When using Gaussian random, the values used will be tween 0 and 2 * seed to avoid possible value overflow.

Example

The following code generates chain of values for date beginning at 12/05/2023 00:00:00 with 8 seconds incremental value.

- name: "date"
  type: "datetime"
  constraints:
    - type: "duration"
      start:
        year: 2023
        month: 12
        day: 5
    - type: "chain"
      seed: 8000
      direction: 1
      style: 0

Character Group

By default, texts generated by the tool are using ASCII alpha numerals characters. By applying a character group (char_group) constraint allows text generated using different character sets. The tool supports the following character groups:

Character Group	Description
arabic_numerals	It represents a number between 0 and 9.
ascii_alpha_numerals	It represents a letter between a and Z or a number between 0 and 9.
ascii_letters	It represents a letter between a and Z.
ascii_lowercase_letters	It represents a letter between a and z.
ascii_uppercase_letters	It represents a letter between A and Z.
digits	It presents any digit character in unicode
letters	It presents any letter character in unicode
ea_hiragana	It presents Japanese Hiragana characters
ea_katakana	It presents Japanese Katakana characters
ea_cjk_a	It presents east asian CJK group A characters

⚠️ If a Character Group of unicode is used, the text length is calculated differently from acii characters.

Character group includes two properties type and groups. Type value has to be "char_group" and the groups is an array with any combination of supported character group. For example, the following code generates texts combining Hiragana, Katakana and numbers for "comment"

- name: "comment"
  constraints:
    - type: "char_group"
      groups:
        - "ea_hiragana"
        - "ea_katakana"
        - "arabic_numerals"

Correlation

Correlation instructs the tool to generate value by applying rules to existing values of other properties of same entity. Correlation works for number, date and datetime type properties. A correlation has three properties:

type - the value needs to be "correlation"
properties - A set of properties used in calculation
formula - A formula for calculation

⚠️ When using correlation, the property name can't contain mathematical operators such as dash, you need to use alias feature if your column name includes them.

The expression evaluator of project EvalEx is used to process formulas, you may need to refer to its document for supported functions. For date or datetime, we provide two extra functions:

Function	Description	Example
TIME_AFTER	Generate a date/datetime after a given value but before 10 years from now	TIME_AFTER(p1)
PAST_TIME_AFTER	Generate a date/datetime after a given value but before current time	PAST_TIME_AFTER(p1)

Examples

The following code applies the sum of "p1" and "p2" as the value of property "p3"

- name: "p3"
  constraints:
    - type: "correlation"
      properties: ["p1", "p2"]
      formula: "p1 + p2"

The following code generate a past date time later than the value of "user_created"

- name: "created"
  constraints:
    - type: "correlation"
      properties: ["user_created"]
      formula: "PAST_TIME_AFTER(user_created)"

Duration

Duration defines a date range where the generated date or date time needs to be within. Duration has three properties:

type - the value should be "duration"
start - the earliest time mark.
end - the latest time mark.

A time mark consists of six properties: year, month (1 - 12), day, hour (0 - 23), minute and second, those are used to define the time boundary with the most accuracy at second level.

Examples

The following code generates a date time between 03/01/2023 00:00:00 and 11/10/2023 14:05:30

- name: "created"
  type: "datetime"
  constraints:
    - type: duration
      start:
        year: 2023
        month: 3
        day: 1
      end:
        year: 2023
        month: 11
        day: 10
        hour: 14
        minute: 5
        second: 30

The following code generates a date between 04/22/2023 and today

- name: "created"
  type: "date"
  constraints:
    - type: duration
      start:
        year: 2023
        month: 4
        day: 22

Interval

An interval defines the base and leap when generate a sequence. It has three properties:

type - value is "interval"
base - the starting integer value of the sequence, default is 1
leap - integer to define the value difference between two sequence number. It's required.

The following code generates a sequence starts with 3 and increasing 2 each time. The output is 3, 5, 7, 9 ...

- name: "id"
type: "sequence"
constraints:
  - type: "interval"
    base: 3
    leap: 2

Length

Length instructs tool to generate fixed length text. It has two properties:

type - value is "length"
max - the character size needed

The following code generates a two letter text for "code"

- name: "code"
  type: "text"
  constraints:
    - type: length
      max: 2

Precision

Precision is used to define the decimal places of a generated number. A number type property without precision constraint, only integer will be generated. It has two properties:

type - value should be "precision"
max - maximum decimal places. Its value is an integer between 0 and 10

The following code generates a number of two digits for "price"

- name: "price"
  type: "number"
  constraints:
    - type: "precision"
      max: 2

Range

Range constraint is to define a number value range. It has three properties which one of "min" and "max" could be optional.

type - its value should be "range"
min - an inclusive minimum value
max - an exclusive maximum value

the following code generates an integer between 1 and 998 for "price"

- name: "price"
  type: "number"
  constraints:
    - type: "range"
        min: 1
        max: 999

Size

Constraint size is used to generate text within given length range. It contains three properties, in those one of "min" and "max" may be optional.

type - its value is "size"
min - An inclusive minimum length
max - An exclusive maximum length

The following code generates a random text of size between 2 and 19 for "comment"

      - name: "comment"
        type: "text"
        constraints:
          - type: size
            min: 2
            max: 20

Reference

Instead of generating values, a reference applies an existing property value from another entity. It can be used to define a foreign key reference between two tables. Reference has three properties:

entity - required. It defines the target entity
property - required. It defines the target property
linked - optional. When reference multiple properties of the same entity, it allows values used are from the same record.

Examples

The following code creates a foreign key to product.id for product_id property

- name: "product_id"
  type: "sequence"
  reference:
    entity: "product"
    property: "id"

The following code populates values of user_id and user_created using same record under usr_user

- name: "order_detail"
  properties:
    - name: "user_id"
      type: "sequence"
      reference:
        entity: "usr_user"
        property: "id"
        linked: "id"
    - name: "user_created"
      type: "datetime"
      reference:
        entity: "usr_user"
        property: "created"
        linked: "id"

Define Data Generation Plan

A data generation plan instructs the tool that how the data is generated against targeted schema. Defining using a YAML format, it contains two parts:

plan - A required property which defines how many records are generated for each entity and any extra data requirements
output - An optional property to tailor the output data, such as hide columns or alias

Sample data plan yaml layout

plan:
	...
output:
	...

Data Plan

Data plan has three first level elements

name - name of the data plan, it can be any text
schema - the targeted schema, its value has to match the name defined in the schema yaml
data - a list of entity data definitions. Each element in this list matches one entity in schema. If there is no match element for a defined entity, no data is generated for that entity.

Example

The following code defines a data plan named as hw-plan which generates data against schema test. It requests 10 records for country table and 100 records for city table.

plan:
  name: "hw-plan"
  schema: "test"
  data:
    - entity: "country"
      count: 10
    - entity: "city"
      count: 100

Entity Data

An entity data element provides guidance to the tool on how to generate data for each entity. It has the following properties:

entity - The name of targeted entity
dataId - An optional but unique non-negative number, default is 0. It's used when multiple batches of data with different rules are needed for the same entity. For example, if there is a user table with a country column, we can use two entity data definitions, one entity data generates 100 US user and the other generates 50 UK user.
count - A positive number to define how many records are needed
properties - A list of property input control element which provides extra details how the data is required for each property
entries - It defines user provided entries used in data generation

Examples

The following code create two batch of record for the city, the first one contains 10 records and the second contains 5 records

- entity: "city"
  count: 10
- entity: "city"
  dataId: 1
  count: 5

Property Input Control

Property input control is used to list extra specifications of property data generation. It has five properties:

name - target property name. It's required.
values - a list of property value distribution, for example, 30 percent of the value generated would be null
defaultValue - a String presentation of a default value for target property
referenceStrategy - if the target property is a reference property. It defines how the referenced values are applied. The available strategies are either random or loop. If it's not set, random strategy is used.
constraints - a list of extra constraints to be applied to this property. For details of defining constraints please refer to Constraints section.

Default Value

A predefined default value will be applied with the following rules:

If the target property has a given value in data.entries, the default value is ignored when generating those entries.
If values are defined, the default value is ignored when generating all the records within the percentage defined in values.
The default value will be applied to all records not covered by rule 1 and 2.

Values

Values are a list of percentage distribution will provide granular control to applying predefined values. It has two properties

values - a set of String presentation of the desired values. A number value needs to be enclosed within double quotes. If there are more than one value given, the values are randomly applied.
weight - a double value to define the record percentage to be applied. Each weight value should in the range of (0, 1] and the total weight of all values shouldn't be more than 1. For example, if the weight is 0.1 and total records generated are 100, there are about 10 records will apply the given value. However the weight is not a precisely guaranteed.

Examples

The following code assigns the value "CLOSED" to 10% of the generated records, "RETURNED" to 15% of the records, and "PROCESSING" to the remaining records in the "status" column.

properties:
  - name: "status"
    values:
      - values:
          - "CLOSED"
        weight: 0.1
      - values:
          - "RETURNED"
        weight: 0.15
    defaultValue: "PROCESSING"

The following code assigns the value "CLOSED" or "RETURNED" to 25% of the records and "PROCESSING" to the remaining records in the "status" column.

properties:
  - name: "status"
    values:
      - values:
          - "CLOSED"
          - "RETURNED"
        weight: 0.25
    defaultValue: "PROCESSING"

Entries

Entries allows user to apply predefined records to the generation process. All the records defined under entries will be applied orderly prior to any data generation. For example, if 11 records are defined under entries, and the desired record number is 10, the first 10 records will be used. However if the desired number is 100, the tool will apply all 11 records first then generate the remaining 89 records.

Entries contains the following two properties:

properties - a list of property names which values will be assigned. The tool will generate values for properties not listed when applying the records.
values - a list of records to be used. Each records defined should have the same number of value defined in properties.

Example The following code uses the three records defined under entries first then generates 7 remaining for product table.

- entity: "product"
  count: 10
  entries:
    properties:
      - "name"
      - "desc"
      - "sku"
      - "category_id"
    values:
      - ["tv_1", "desc_1", "elect00001", "1"]
      - ["tv_2", "desc_2", "elect00002", "1"]
      - ["hp_win", "a computer", "comp000001", "2"]

Output Control

The output section enables two post data generation operations to tailor the final output files. For each entity, it allows to hide certain properties or output properties using alias. Hereto entities under output is a list of entities to be altered.

For each entity element, there are two sub-properties:

name - name of the targeted entity
properties - a list of property output control

Property Output Control

Property output control has three properties:

name - name of the target property
hide - optional boolean value, default is false. When set, the target property is ignored in the output SQL statement.
alias - optional String value. When set, the given alias will be used in the output SQL statement.

Examples The following code hides created_date when generating insert statement for post table and uses user instead of user_id as the column name when creating insert SQL statement for comment table.

output:
    entities:
      - name: "post"
        properties:
          - name: "created_date"
            hide: true
      - name: "comment"
        properties:
          - name: "user_id"
            alias: "user"

Miscellaneous

Key Words

Null Value

The key word used to define a null value is <?null?>. The tool will covert this key word into null when output sql statements.

Example The following code sets null to product_category.desc about 70% records generated.

- entity: "product_category"
  count: 50
  properties:
    - name: "desc"
      values:
        - values:
            - "<?null?>"
          weight: 0.7

Category	Description	Qualifiers
Name	A person name	firstname, lastname
Address	An address	street, city, country, state, zip
Email	An email
Id	Id	formartedℹ️

Properties	Description	Value
type	constraint type	Fixed value:"category"
name	name of the provider	name, address or email
qualifier	It used to extract part of the output. For example, if name generated "Mark Twain", only "Mark" will be used when qualifier is set to firstname	Please refer to previous table for allowed qualifies of each category
locale	Desired local	String value using BCP 47 format. When generating country name, the locale needs to have a country part after the language such as "en-US", "ja-JP"
compoundId	integer type compound id. default value is 0	If an entity has two properties with same category name and compound id, the same value will be applied to both properties. Otherwise a different values will be applied

Data Conjurer Reference

Concepts

Define Schema

Schema

Entity

Property

Type

Index

Regular Index

Set Index

Non Circular Index

Ordered Non Circular Index

Distinct Element Index

Constraints

Chain

Category

Character Group

Correlation

Duration

Interval

Length

Precision

Range

Size

Reference

Define Data Generation Plan

Data Plan

Entity Data

Property Input Control

Default Value

Values

Entries

Output Control

Property Output Control

Miscellaneous

Key Words

Null Value

Clone this wiki locally