-
Notifications
You must be signed in to change notification settings - Fork 0
Data Conjurer Reference
The data conjurer tool, designed for efficient data generating, separates the input into two components: the reusable schema defining database structures, and the adaptable plan specifying desired output. The following table is a quick reference between the terminology used between the tool and MySQL database
Terminology | MySQL | Description |
---|---|---|
schema | schema | Map to a MySQL Database Schema |
entity | table | Map to a MySQL Table |
property | column | Map to a MySQL Column |
index | index | Map to a database index, such as primary key, unique index |
reference | foreign key | Map to a database foreign key |
entity entry | data row | Map to a predefined data row |
A database schema is defined through a YAML file. It contains two required properties:
- name - The name of the schema; it doesn't have to match the actual database schema name, but it is recommended to do so.
- entities - A set of entity definitions. At least one entity is required. Entities with the same name are considered duplicates, which will cause an error.
The following is a sample schema with name "text" and an entity "country"
name: test
entities:
- name: country
properties:
- name: "id"
type: "sequence"
index:
id: 0
This schema definition is equivalent to the a MySQL schema "test" with the following structure
CREATE TABLE `country` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
);
Entity presents a definition of a database table. It contains two properties: name and properties. Both properties are required for a valid definition.
- name - The name of the entity; it doesn't have to match the actual database table name, but it is recommended to do so.
- Properties - A set of property definitions. Each property can be mapped to a database column. Following is a sample entity definition
- name: country
properties:
- name: "id"
type: "sequence"
index:
id: 0
- name: "name"
type: "text"
constraints:
- type: size
min: 0
max: 20
- name: "code"
type: "text"
constraints:
- type: length
max: 2
The above entity definition can be used to generate data for the following MySQL table
CREATE TABLE `country` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(20) NOT NULL,
`code` char(2) NOT NULL,
PRIMARY KEY (`id`)
)
Property is an abstraction of a database column. It has four properties:
- name - A (column) name to identify the property
- type - A predefined data type for generation
- index - A unique constraint enforced on one or multiple properties under same entity, can be used to define database primary key or unique index
-
constraints - A list of
Constraint
definition to provide rules for the data generation - reference - The way to identify a value is provided by another entity. It's how a database foreign key is defined.
The following five property types can be used for this tool
Type | Description |
---|---|
sequence | An auto incremental number, by default starting at 1. |
text | Any text value |
number | Any decimal number value, such as 100, -54 or 99.99 |
date | A date value, the output format is "yyyy-MM-dd" |
datetime | A date time value, the output format is "yyyy-MM-dd HH:mm:ss" |
time | A time value, the output format is "HH:mm:ss" |
An index definition contains three parts: id, type and qualifier.
-
id - id should be a non-negative integer. The property is used as primary key when its value is 0 which will generate sequence starting with 1 when the property type is
sequence
- type - An integer number, allowed value are 0(regular index), 1(set index), 2 (non circular index), 3 (ordered non circular index) and 4 (distinct element index).
-
qualifier - An integer. When applying to
non circular index
orordered non circular index
, the allowed values are 1 and 2. When applying todistinct element value index
, the allowed value is 1.
A regular index only enforce unique combination.
Sample regular index
properties:
- name: "product_id"
type: "sequence"
index:
id: 1
- name: "order_id"
type: "sequence"
index:
id: 1
The above definition allows the following combination of product_id and order_id to be generated
[
{"product_id": 1, "order_id": 1},
{"product_id": 2, "order_id": 1},
{"product_id": 1, "order_id": 2},
{"product_id": 2, "order_id": 3},
{"product_id": 3, "order_id": 2}
]
A set index enforces the unique combination regardless of element order. Values of [1, 2] and [2, 1] are treated as the same value.
Sample set index
properties:
- name: "product_id"
type: "sequence"
index:
id: 1
type: 1
- name: "order_id"
type: "sequence"
index:
id: 1
type: 1
With above index definition, same five entries generated in the regular index example will be reduced to the following three
[
{"product_id": 1, "order_id": 1},
{"product_id": 2, "order_id": 1},
{"product_id": 2, "order_id": 3}
]
A non circular index treat each index entry as a branch of a graph. Each branch has a direction from parent to child, a qualifier value 1 identifies a parent while 2 identifies a child. A newly generated index is accepted only the following two conditions are met:
- The new branch won't create a circle with existing values, for example [1, 1] creates a circle to itself, [1, 2] [2, 1] creates another circle.
- The new branch won't create a multiple parents for the same child. For example, [{"parent_id": 1, "child_id": 3}, {"parent_id": 2, "child_id": 3}] creates a multiple parents situation where child_id 3 has two parent_id 1 and 2.
A common usage of non circular index is to use with linked reference to create a tree structure, such as FB type comments.
Sample non circular index
properties:
- name: "product_id"
type: "sequence"
index:
id: 1
type: 2
qualifier: 1
- name: "order_id"
type: "sequence"
index:
id: 1
type: 2
qualifier: 2
In this case, the five generated index entries will be reduced to the following
[
{"product_id": 2, "order_id": 1},
{"product_id": 2, "order_id": 3}
]
An ordered non circular index is a non circular index with one extra constraint which enforces the property value of any parent id is less than the value of child id. Please make sure the properties values of marked with parent id and child id are comparable before apply it.
Example
When applying the following code, the value of product_id
will be less than order_id
.
properties:
- name: "product_id"
type: "sequence"
index:
id: 1
type: 3
qualifier: 1
- name: "order_id"
type: "sequence"
index:
id: 1
type: 3
qualifier: 2
Sample outputs
[
{"product_id": 2, "order_id": 3}
]
Distinct element index does not allow properties with same value to be created. To add a property to the check, qualifier
is set to 1. If no properties with qualifier
are found, distinct element index works same as regular index.
qualifier
have the same data type. To simplify the data type check, the string values of desired property are used to guarantee the unique.
Example
The following code will generate bi-direction links between from
and to
except no same value of from
and to
is allowed.
properties:
- name: "from"
type: "sequence"
index:
id: 2
type: 4
qualifier: 1
- name: "to"
type: "sequence"
index:
id: 2
type: 4
qualifier: 1
Sample outputs
[
{"from": 2, "to": 3},
{"from": 3, "to": 2}
]
Constraints define rules when generating data, e.g. maximum or minimum length of a string. A constraint can be defined in schema or data plan. It's recommended to define database schema related constrains within schema and all other purpose constraints with data plan. The following table lists all supported constraints and their usages:
Name | Compatible Type | Description |
---|---|---|
chain | number, datetime, date | A Markov chain on records generated in sequence |
category | text | A pre-formatted text, such as email |
char_group | text | A predefined character group |
correlation | number, datetime, date, time | An expression or function |
duration | date, datetime | A date time range |
interval | sequence | An interval of a sequence |
length | text | A fixed length text value |
precision | number | A decimal places of a number |
range | number | A range for integral part of a number |
size | text | A variant length text value |
It enforces a Markov chain to all the targeted property value based on the record generating sequence. Chain is defined with four properties:
- type - the value needs to be "chain"
- seed - a positive number which defines the base value difference between two adjacent values.
- direction - how the value is changed. When set to 0, the next element can either increase or decrease the previous value, when set to 1, value always increases, and when set to -1, value always decreases
-
style - defines how the seed is used to generate next number: same value of seed is applied when set to 0, a random number between 0 and
seed
is applied when set to 1, and a Gaussian random withμ = seed
for value 2.
2 * seed
to avoid possible value overflow.
Example
The following code generates chain of values for date
beginning at 12/05/2023 00:00:00 with 8 seconds incremental value.
- name: "date"
type: "datetime"
constraints:
- type: "duration"
start:
year: 2023
month: 12
day: 5
- type: "chain"
seed: 8000
direction: 1
style: 0
Category is a constraint that a predefined data provider is used to generate certain desired text value, such as name or address. The tool provides the following categories
Category | Description | Qualifiers |
---|---|---|
Name | A person name | firstname, lastname |
Address | An address | street, city, country, state, zip |
An email | ||
Id | Id | formartedℹ️ |
ℹ️ Before version 1.1, If a qualifier is used for Id, the tool will process the qualifier and replace any ?
sign with a random letter and any #
sign with a random number. For example id-??.###
may result as id-fd.872
. Otherwise an UUID is generated. Starting from version 1.1, id format is defined using format
constraint.
A category definition includes four properties
Properties | Description | Value |
---|---|---|
type | constraint type | Fixed value:"category" |
name | name of the provider | name, address or email |
qualifier | It used to extract part of the output. For example, if name generated "Mark Twain", only "Mark" will be used when qualifier is set to firstname | Please refer to previous table for allowed qualifies of each category |
locale | Desired local | String value using BCP 47 format. When generating country name, the locale needs to have a country part after the language such as "en-US", "ja-JP" |
compoundId | integer type compound id. default value is 0 | If an entity has two properties with same category name and compound id, the same value will be applied to both properties. Otherwise a different values will be applied |
Examples
- The following code generates email values for "username"
- name: "username"
constraints:
- type: "category"
name: "email"
- The following code generates Chinese character first names for "firstname"
- name: "first_name"
constraints:
- type: "category"
name: "name"
qualifier: "firstname"
locale: "zh-CN"
- The following code generates US zip codes for "zip"
- name: "zip"
constraints:
- type: "category"
name: "address"
qualifier: "zip"
locale: "en-US"
- The following code generates an UUID for
id
property and a formatted id forrefId
property
Before Version 1.1
properties:
- name: "id"
type: "text"
constraints:
- type: "category"
name: "id"
- name: "refId"
type: "text"
constraints:
- type: "category"
name: "id"
qualifier: "ref-##?##"
compoundId: 1
After Version 1.1
properties:
- name: "id"
type: "text"
constraints:
- type: "category"
name: "id"
- name: "refId"
type: "text"
constraints:
- type: "category"
name: "id"
compoundId: 1
- type: "format"
format: "ref-##?##"
By default, texts generated by the tool are using ASCII alpha numerals characters. By applying a character group (char_group) constraint allows text generated using different character sets. The tool supports the following character groups:
Character Group | Description |
---|---|
arabic_numerals | It represents a number between 0 and 9. |
ascii_alpha_numerals | It represents a letter between a and Z or a number between 0 and 9. |
ascii_letters | It represents a letter between a and Z. |
ascii_lowercase_letters | It represents a letter between a and z. |
ascii_uppercase_letters | It represents a letter between A and Z. |
digits | It presents any digit character in unicode |
letters | It presents any letter character in unicode |
ea_hiragana | It presents Japanese Hiragana characters |
ea_katakana | It presents Japanese Katakana characters |
ea_cjk_a | It presents east asian CJK group A characters |
Character group includes two properties type and groups. Type value has to be "char_group" and the groups is an array with any combination of supported character group. For example, the following code generates texts combining Hiragana, Katakana and numbers for "comment"
- name: "comment"
constraints:
- type: "char_group"
groups:
- "ea_hiragana"
- "ea_katakana"
- "arabic_numerals"
Correlation instructs the tool to generate value by applying rules to existing values of other properties of same entity. Correlation works for number, date and datetime type properties. A correlation has three properties:
- type - the value needs to be "correlation"
- properties - A set of properties used in calculation
- formula - A formula for calculation
The expression evaluator of project EvalEx is used to process formulas, you may need to refer to its document for supported functions. For date or datetime, we provide two extra functions:
Function | Description | Example |
---|---|---|
TIME_AFTER | Generate a date/datetime after a given value but before 10 years from now | TIME_AFTER(p1) |
PAST_TIME_AFTER | Generate a date/datetime after a given value but before current time | PAST_TIME_AFTER(p1) |
Examples
- The following code applies the sum of "p1" and "p2" as the value of property "p3"
- name: "p3"
constraints:
- type: "correlation"
properties: ["p1", "p2"]
formula: "p1 + p2"
- The following code generate a past date time later than the value of "user_created"
- name: "created"
constraints:
- type: "correlation"
properties: ["user_created"]
formula: "PAST_TIME_AFTER(user_created)"
Duration defines a date range where the generated date or date time needs to be within. Duration has three properties:
- type - the value should be "duration"
- start - the earliest time mark.
- end - the latest time mark.
A time mark consists of six properties: year, month (1 - 12), day, hour (0 - 23), minute and second, those are used to define the time boundary with the most accuracy at second level.
Examples
- The following code generates a date time between 03/01/2023 00:00:00 and 11/10/2023 14:05:30
- name: "created"
type: "datetime"
constraints:
- type: duration
start:
year: 2023
month: 3
day: 1
end:
year: 2023
month: 11
day: 10
hour: 14
minute: 5
second: 30
- The following code generates a date between 04/22/2023 and today
- name: "created"
type: "date"
constraints:
- type: duration
start:
year: 2023
month: 4
day: 22
An interval defines the base and leap when generate a sequence. It has three properties:
- type - value is "interval"
- base - the starting integer value of the sequence, default is 1
- leap - integer to define the value difference between two sequence number. It's required.
The following code generates a sequence starts with 3 and increasing 2 each time. The output is 3, 5, 7, 9 ...
- name: "id"
type: "sequence"
constraints:
- type: "interval"
base: 3
leap: 2
Length instructs tool to generate fixed length text. It has two properties:
- type - value is "length"
- max - the character size needed
The following code generates a two letter text for "code"
- name: "code"
type: "text"
constraints:
- type: length
max: 2
Precision is used to define the decimal places of a generated number. A number type property without precision constraint, only integer will be generated. It has two properties:
- type - value should be "precision"
- max - maximum decimal places. Its value is an integer between 0 and 10
The following code generates a number of two digits for "price"
- name: "price"
type: "number"
constraints:
- type: "precision"
max: 2
Range constraint is to define a number value range. It has three properties which one of "min" and "max" could be optional.
- type - its value should be "range"
- min - an inclusive minimum value
- max - an exclusive maximum value
the following code generates an integer between 1 and 998 for "price"
- name: "price"
type: "number"
constraints:
- type: "range"
min: 1
max: 999
Constraint size is used to generate text within given length range. It contains three properties, in those one of "min" and "max" may be optional.
- type - its value is "size"
- min - An inclusive minimum length
- max - An exclusive maximum length
The following code generates a random text of size between 2 and 19 for "comment"
- name: "comment"
type: "text"
constraints:
- type: size
min: 2
max: 20
Instead of generating values, a reference applies an existing property value from another entity. It can be used to define a foreign key reference between two tables. Reference has three properties:
- entity - required. It defines the target entity
- property - required. It defines the target property
- linked - optional. When reference multiple properties of the same entity, it allows values used are from the same record.
Examples
- The following code creates a foreign key to
product.id
forproduct_id
property
- name: "product_id"
type: "sequence"
reference:
entity: "product"
property: "id"
- The following code populates values of
user_id
anduser_created
using same record underusr_user
- name: "order_detail"
properties:
- name: "user_id"
type: "sequence"
reference:
entity: "usr_user"
property: "id"
linked: "id"
- name: "user_created"
type: "datetime"
reference:
entity: "usr_user"
property: "created"
linked: "id"
A data generation plan instructs the tool that how the data is generated against targeted schema. Defining using a YAML format, it contains two parts:
- plan - A required property which defines how many records are generated for each entity and any extra data requirements
- output - An optional property to tailor the output data, such as hide columns or alias
Sample data plan yaml layout
plan:
...
output:
...
Data plan has three first level elements
- name - name of the data plan, it can be any text
- schema - the targeted schema, its value has to match the name defined in the schema yaml
- data - a list of entity data definitions. Each element in this list matches one entity in schema. If there is no match element for a defined entity, no data is generated for that entity.
Example
- The following code defines a data plan named as
hw-plan
which generates data against schematest
. It requests 10 records forcountry
table and 100 records forcity
table.
plan:
name: "hw-plan"
schema: "test"
data:
- entity: "country"
count: 10
- entity: "city"
count: 100
An entity data element provides guidance to the tool on how to generate data for each entity. It has the following properties:
- entity - The name of targeted entity
-
dataId - An optional but unique non-negative number, default is 0. It's used when multiple batches of data with different rules are needed for the same entity. For example, if there is a
user
table with acountry
column, we can use two entity data definitions, one entity data generates 100US
user and the other generates 50UK
user. - count - A positive number to define how many records are needed
- properties - A list of property input control element which provides extra details how the data is required for each property
- entries - It defines user provided entries used in data generation
Examples
- The following code create two batch of record for the city, the first one contains 10 records and the second contains 5 records
- entity: "city"
count: 10
- entity: "city"
dataId: 1
count: 5
Property input control is used to list extra specifications of property data generation. It has five properties:
- name - target property name. It's required.
- values - a list of property value distribution, for example, 30 percent of the value generated would be null
- defaultValue - a String presentation of a default value for target property
-
referenceStrategy - if the target property is a reference property. It defines how the referenced values are applied. The available strategies are either
random
orloop
. If it's not set, random strategy is used. - constraints - a list of extra constraints to be applied to this property. For details of defining constraints please refer to Constraints section.
A predefined default value will be applied with the following rules:
- If the target property has a given value in
data.entries
, the default value is ignored when generating those entries. - If
values
are defined, the default value is ignored when generating all the records within the percentage defined invalues
. - The default value will be applied to all records not covered by rule 1 and 2.
Values are a list of percentage distribution will provide granular control to applying predefined values. It has two properties
- values - a set of String presentation of the desired values. A number value needs to be enclosed within double quotes. If there are more than one value given, the values are randomly applied.
- weight - a double value to define the record percentage to be applied. Each weight value should in the range of (0, 1] and the total weight of all values shouldn't be more than 1. For example, if the weight is 0.1 and total records generated are 100, there are about 10 records will apply the given value. However the weight is not a precisely guaranteed.
Examples
- The following code assigns the value "CLOSED" to 10% of the generated records, "RETURNED" to 15% of the records, and "PROCESSING" to the remaining records in the "status" column.
properties:
- name: "status"
values:
- values:
- "CLOSED"
weight: 0.1
- values:
- "RETURNED"
weight: 0.15
defaultValue: "PROCESSING"
- The following code assigns the value "CLOSED" or "RETURNED" to 25% of the records and "PROCESSING" to the remaining records in the "status" column.
properties:
- name: "status"
values:
- values:
- "CLOSED"
- "RETURNED"
weight: 0.25
defaultValue: "PROCESSING"
Entries allows user to apply predefined records to the generation process. All the records defined under entries
will be applied orderly prior to any data generation. For example, if 11 records are defined under entries
, and the desired record number is 10, the first 10 records will be used. However if the desired number is 100, the tool will apply all 11 records first then generate the remaining 89 records.
Entries contains the following two properties:
- properties - a list of property names which values will be assigned. The tool will generate values for properties not listed when applying the records.
-
values - a list of records to be used. Each records defined should have the same number of value defined in
properties
.
Example
The following code uses the three records defined under entries first then generates 7 remaining for product
table.
- entity: "product"
count: 10
entries:
properties:
- "name"
- "desc"
- "sku"
- "category_id"
values:
- ["tv_1", "desc_1", "elect00001", "1"]
- ["tv_2", "desc_2", "elect00002", "1"]
- ["hp_win", "a computer", "comp000001", "2"]
The output
section enables two post data generation operations to tailor the final output files. For each entity, it allows to hide certain properties or output properties using alias. Hereto entities
under output
is a list of entities to be altered.
For each entity element, there are two sub-properties:
- name - name of the targeted entity
- properties - a list of property output control
Property output control has three properties:
- name - name of the target property
- hide - optional boolean value, default is false. When set, the target property is ignored in the output SQL statement.
- alias - optional String value. When set, the given alias will be used in the output SQL statement.
Examples
The following code hides created_date
when generating insert statement for post
table and uses user
instead of user_id
as the column name when creating insert SQL statement for comment
table.
output:
entities:
- name: "post"
properties:
- name: "created_date"
hide: true
- name: "comment"
properties:
- name: "user_id"
alias: "user"
The key word used to define a null value is <?null?>
. The tool will covert this key word into null
when output sql statements.
Example
The following code sets null
to product_category.desc
about 70% records generated.
- entity: "product_category"
count: 50
properties:
- name: "desc"
values:
- values:
- "<?null?>"
weight: 0.7