From 0aa04a7bd8764866bcee12abc330e0d32ad1edcf Mon Sep 17 00:00:00 2001 From: Flook Peter Date: Mon, 13 Nov 2023 14:50:26 +0800 Subject: [PATCH] Clean up layout of guide grid cards --- docs/setup/guide/index.md | 108 ++++++++++++++-------------------- site/search/search_index.json | 2 +- site/setup/guide/index.html | 28 ++------- site/sitemap.xml.gz | Bin 524 -> 524 bytes 4 files changed, 49 insertions(+), 89 deletions(-) diff --git a/docs/setup/guide/index.md b/docs/setup/guide/index.md index 188f1396..3cd072fd 100644 --- a/docs/setup/guide/index.md +++ b/docs/setup/guide/index.md @@ -12,73 +12,53 @@ the trial can be found [**here**](../../get-started/docker.md#paid-version-trial ## Scenarios -!!! note "Free tier scenarios" - -
- - - __[First Data Generation]__ - If you are new, this is the place to start - - __[Multiple Records Per Column Value]__ - How you can generate multiple records per set of columns - - __[Foreign Keys Across Data Sources]__ - Generate matching values across generated data sets - - __[Data Validations]__ - (Soon to document) Run data validations after generating data - -
- - [First Data Generation]: scenario/first-data-generation.md - [Multiple Records Per Column Value]: scenario/records-per-column.md - [Foreign Keys Across Data Sources]: scenario/batch-and-event.md - [Data Validations]: scenario/first-data-generation.md - -!!! example "Paid tier scenarios" - -
- - - __[Auto Generate From Data Connection]__ - Automatically generating data from just defining data sources - - __[Delete Generated Data]__ - Delete the generated data whilst leaving other data - - __[Generate Batch and Event Data]__ - Generate matching batch and event data - -
- - [Auto Generate From Data Connection]: scenario/auto-generate-connection.md - [Delete Generated Data]: scenario/delete-generated-data.md - [Generate Batch and Event Data]: scenario/batch-and-event.md +
+ +- __[First Data Generation]__ - If you are new, this is the place to start +- __[Multiple Records Per Column Value]__ - How you can generate multiple records per set of columns +- __[Foreign Keys Across Data Sources]__ - Generate matching values across generated data sets +- __[Data Validations]__ - (Soon to document) Run data validations after generating data +- __[Auto Generate From Data Connection]__ - Automatically generating data from just defining data sources +- __[Delete Generated Data]__ - Delete the generated data whilst leaving other data +- __[Generate Batch and Event Data]__ - Generate matching batch and event data + +
+ + [First Data Generation]: scenario/first-data-generation.md + [Multiple Records Per Column Value]: scenario/records-per-column.md + [Foreign Keys Across Data Sources]: scenario/batch-and-event.md + [Data Validations]: scenario/first-data-generation.md + [Auto Generate From Data Connection]: scenario/auto-generate-connection.md + [Delete Generated Data]: scenario/delete-generated-data.md + [Generate Batch and Event Data]: scenario/batch-and-event.md ## Data Sources -!!! note "Free tier data sources" - -
- - - __[Files (CSV, JSON, ORC, Parquet)]__ - Generate data for files with separator (comma separated by default) - - __[Files (Fixed width)]__ - (Soon to document) A variant of CSV but with no separator - - __[Postgres]__ - JDBC Postgres tables - -
- - [Files (CSV, JSON, ORC, Parquet)]: scenario/first-data-generation.md - [Files (Fixed width)]: scenario/first-data-generation.md - [Postgres]: scenario/first-data-generation.md - -!!! example "Paid tier data sources" - -
- - - __[Cassandra]__ - Cassandra tables - - __[Kafka]__ - Kafka topics - - __[Solace]__ - Solace messages - - __[Marquez]__ - Generate data based on metadata in Marquez - - __[OpenMetadata]__ - Generate data based on metadata in OpenMetadata - - __[HTTP]__ - HTTP requests - - __[MySql]__ - (Soon to document) JDBC MySql tables - -
- - [Cassandra]: data-source/cassandra.md - [Kafka]: data-source/kafka.md - [Solace]: data-source/solace.md - [Marquez]: data-source/marquez-metadata-source.md - [OpenMetadata]: data-source/open-metadata-source.md - [HTTP]: data-source/http.md - [MySql]: data-source/cassandra.md +
+ +- __[Files (CSV, JSON, ORC, Parquet)]__ - Generate data for popular file formats +- __[Postgres]__ - JDBC Postgres tables +- __[Cassandra]__ - Cassandra tables +- __[Kafka]__ - Kafka topics +- __[Solace]__ - Solace messages +- __[Marquez]__ - Generate data based on metadata in Marquez +- __[OpenMetadata]__ - Generate data based on metadata in OpenMetadata +- __[HTTP]__ - HTTP requests +- __[Files (Fixed width)]__ - (Soon to document) A variant of CSV but with no separator +- __[MySql]__ - (Soon to document) JDBC MySql tables + +
+ + [Files (CSV, JSON, ORC, Parquet)]: scenario/first-data-generation.md + [Files (Fixed width)]: scenario/first-data-generation.md + [Postgres]: scenario/first-data-generation.md + [Cassandra]: data-source/cassandra.md + [Kafka]: data-source/kafka.md + [Solace]: data-source/solace.md + [Marquez]: data-source/marquez-metadata-source.md + [OpenMetadata]: data-source/open-metadata-source.md + [HTTP]: data-source/http.md + [MySql]: data-source/cassandra.md ## YAML Files diff --git a/site/search/search_index.json b/site/search/search_index.json index 18b85bc9..da104358 100644 --- a/site/search/search_index.json +++ b/site/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"Data Caterer is a metadata-driven data generation and testing tool that aids in creating production-like data across both batch and event data systems. Run data validations to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle it

Try now

Data testing is difficult and fragmented Current solutions only cover half the story

Try now

What you need is a reliable tool that can handle changes to your data landscape

With Data Caterer, you get:

Try now

"},{"location":"#tech-summary","title":"Tech Summary","text":"

Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to get into details? Checkout the setup pages here to get code examples and guides that will take you through scenarios and data sources.

Main features include:

Check other run configurations here.

"},{"location":"#what-is-it","title":"What is it","text":""},{"location":"#what-it-is-not","title":"What it is not","text":"

Try now

Data Catering vs Other tools vs In-house

Data Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it

"},{"location":"about/","title":"About","text":"

Hi, my name is Peter. I am a independent Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.

I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.

Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.

"},{"location":"about/#contact","title":"Contact","text":"

Please contact Peter Flook via Slack or via email peter.flook@data.catering if you have any questions or queries.

"},{"location":"about/#terms-of-service","title":"Terms of service","text":"

Terms of service can be found here.

"},{"location":"about/#privacy-policy","title":"Privacy policy","text":"

Privacy policy can be found here.

"},{"location":"pricing/","title":"Pricing","text":"

To have access to the paid features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.

"},{"location":"pricing/#paid-features","title":"Paid Features","text":""},{"location":"pricing/#pricing-table","title":"Pricing Table","text":""},{"location":"pricing/#manage-subscription","title":"Manage Subscription","text":"

Manage via this link

"},{"location":"pricing/#contact","title":"Contact","text":"

Please contact Peter Flook via Slack or via email peter.flook@data.catering if you have any questions or queries.

"},{"location":"get-started/docker/","title":"Run Data Caterer","text":""},{"location":"get-started/docker/#quick-start","title":"Quick start","text":"

Ensure you have docker installed and running.

git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example && ./run.sh\n#check results under docker/sample/report/index.html folder\n
"},{"location":"get-started/docker/#report","title":"Report","text":"

Check the report generated under docker/data/custom/report/index.html.

Sample report can also be seen here

"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"

30 day trial of the paid version can be accessed via these steps:

  1. Join the Slack Data Catering Slack group here
  2. Get an API_KEY by using slash command /token in the Slack group (will only be visible to you)
  3. git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example && export DATA_CATERING_API_KEY=<insert api key>\n./run.sh\n

If you want to check how long your trial has left, you can check back in the Slack group or type /token again.

"},{"location":"get-started/docker/#guided-tour","title":"Guided tour","text":"

Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.

"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"

Last updated September 25, 2023

"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"

Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.

"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"

For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.

"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"

Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:

"},{"location":"legal/privacy-policy/#we-are-accountable-to-you","title":"We are accountable to you","text":"

Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.

"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"

Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:

Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).

"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"

Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.

We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.

Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.

We also receive and store certain types of information whenever you interact with us.

"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"

All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).

"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"

Peter John Flook does not disclose personal information to any organization or person for any reason except the following:

We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.

Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.

"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"

Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.

You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.

"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"

We take steps to safeguard your personal information, regardless of the format in which it is held, including:

physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.

"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"

Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.

These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.

"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"

In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.

"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"

We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.

"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"

You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:

inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.

"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"

Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.

"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"

Last updated: September 25, 2023

Please read these terms and conditions carefully before using Our Service.

"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"

The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.

"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"

For the purposes of these Terms and Conditions:

"},{"location":"legal/terms-of-service/#acknowledgment","title":"Acknowledgment","text":"

These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.

Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.

By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.

You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.

Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.

"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"

Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.

The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.

We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.

"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"

We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.

Upon termination, Your right to use the Service will cease immediately.

"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"

Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.

To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.

Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.

"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"

The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.

Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.

Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.

"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"

The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.

"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"

If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.

"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"

If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.

"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"

You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.

"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"

If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.

"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"

Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.

"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"

These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.

"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"

We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.

By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.

"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"

If you have any questions about these Terms and Conditions, You can contact us:

"},{"location":"setup/","title":"Setup","text":"

All the configurations and customisation related to Data Caterer can be found under here.

"},{"location":"setup/#guide","title":"Guide","text":"

If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.

"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":""},{"location":"setup/#high-level-run-configurations","title":"High Level Run Configurations","text":""},{"location":"setup/configuration/","title":"Configuration","text":"

A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.

These configurations are defined from within your Java or Scala class via configuration or for YAML file setup, application.conf file as seen here.

"},{"location":"setup/configuration/#flags","title":"Flags","text":"

Flags are used to control which processes are executed when you run Data Caterer.

Config Default Paid Description enableGenerateData true N Enable/disable data generation enableCount true N Count the number of records generated. Can be disabled to improve performance enableFailOnError true N Whilst saving generated data, if there is an error, it will stop any further data from being generated enableSaveReports true N Enable/disable HTML reports summarising data generated, metadata of data generated (if enableSinkMetadata is enabled) and validation results (if enableValidation is enabled). Sample here enableSinkMetadata true N Run data profiling for the generated data. Shown in HTML reports if enableSaveSinkMetadata is enabled enableValidation false N Run validations as described in plan. Results can be viewed from logs or from HTML report if enableSaveSinkMetadata is enabled. Sample here enableGeneratePlanAndTasks false Y Enable/disable plan and task auto generation based off data source connections enableRecordTracking false Y Enable/disable which data records have been generated for any data source enableDeleteGeneratedRecords false Y Delete all generated records based off record tracking (if enableRecordTracking has been set to true) enableGenerateValidations false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf
configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n
configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n
flags {\n  enableCount = false\n  enableCount = ${?ENABLE_COUNT}\n  enableGenerateData = true\n  enableGenerateData = ${?ENABLE_GENERATE_DATA}\n  enableFailOnError = true\n  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n  enableGeneratePlanAndTasks = false\n  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n  enableRecordTracking = false\n  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n  enableDeleteGeneratedRecords = false\n  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n  enableGenerateValidations = false\n  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n}\n
"},{"location":"setup/configuration/#folders","title":"Folders","text":"

Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.

These folder pathways can be defined as a cloud storage pathway (i.e. s3a://my-bucket/task).

Config Default Paid Description planFilePath /opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data taskFolderPath /opt/app/task N Task folder path that contains all the task files (can have nested directories) validationFolderPath /opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) generatedReportsFolderPath /opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed generatedPlanAndTaskFolderPath /tmp Y Folder path where generated plan and task files will be saved recordTrackingFolderPath /opt/app/record-tracking Y Where record tracking parquet files get saved JavaScalaapplication.conf
configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\");\n
configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n
folders {\n  planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n  planFilePath = ${?PLAN_FILE_PATH}\n  taskFolderPath = \"/opt/app/custom/task\"\n  taskFolderPath = ${?TASK_FOLDER_PATH}\n  validationFolderPath = \"/opt/app/custom/validation\"\n  validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n  generatedReportsFolderPath = \"/opt/app/custom/report\"\n  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n  generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n  recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n}\n
"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"

When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if enableGeneratePlanAndTasks or 2) if enableSinkMetadata are enabled.

During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.

Config Default Paid Description numRecordsFromDataSource 10000 Y Number of records read in from the data source that could be used for data profiling numRecordsForAnalysis 10000 Y Number of records used for data profiling from the records gathered in numRecordsFromDataSource oneOfMinCount 1000 Y Minimum number of records required before considering if a field can be of type oneOf oneOfDistinctCountVsCountThreshold 0.2 Y Threshold ratio to determine if a field is of type oneOf (i.e. a field called status that only contains open or closed. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as oneOf) numGeneratedSamples 10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf
configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n
configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n
metadata {\n  numRecordsFromDataSource = 10000\n  numRecordsForAnalysis = 10000\n  oneOfMinCount = 1000\n  oneOfDistinctCountVsCountThreshold = 0.2\n  numGeneratedSamples = 10\n}\n
"},{"location":"setup/configuration/#generation","title":"Generation","text":"

When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.

Config Default Paid Description numRecordsPerBatch 100000 N Number of records across all data sources to generate per batch numRecordsPerStep N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) ScalaScalaapplication.conf
configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n
configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n
generation {\n  numRecordsPerBatch = 100000\n  numRecordsPerStep = 1000\n}\n
"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"

Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your specifications via configuration as seen here.

JavaScalaapplication.conf
configuration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n
configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -> \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -> \"10g\")\n
runtime {\n  master = \"local[*]\"\n  master = ${?DATA_CATERER_MASTER}\n  config {\n    \"spark.driver.cores\" = \"5\"\n    \"spark.driver.memory\" = \"10g\"\n  }\n}\n
"},{"location":"setup/advanced/advanced/","title":"Advanced use cases","text":""},{"location":"setup/advanced/advanced/#special-data-formats","title":"Special data formats","text":"

There are many options available for you to use when you have a scenario when data has to be a certain format.

  1. Create expression datafaker
    1. Can be used to create names, addresses, or anything that can be found under here
  2. Create regex
"},{"location":"setup/advanced/advanced/#foreign-keys-across-data-sets","title":"Foreign keys across data sets","text":"

Details for how you can configure foreign keys can be found here.

"},{"location":"setup/advanced/advanced/#edge-cases","title":"Edge cases","text":"

For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:

JavaScalaYAML
field()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n

If you want to know all the possible edge cases for each data type, can check the documentation here.

"},{"location":"setup/advanced/advanced/#scenario-testing","title":"Scenario testing","text":"

You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the status column in the account data to only generate open accounts and define a foreign key between Postgres and parquet to ensure the same account_id is being used. Then in the parquet task, define 1 to 10 transactions per account_id to be generated.

Postgres account generation example task Parquet transaction generation example task Plan

"},{"location":"setup/advanced/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/advanced/#data-source","title":"Data source","text":"

If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.

JavaScalaYAML
var csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n
val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/advanced/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"

You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the application.conf file where you can set something like the below:

JavaScalaYAML
configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n
configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/connection/connection/","title":"Data Source Connections","text":"

Details of all the connection configuration supported can be found in the below subsections for each type of connection.

These configurations can be done via API or from configuration. Examples of both are shown for each data source below.

"},{"location":"setup/connection/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Paid Database Postgres, MySQL, Cassandra N (Postgres), Y (rest) File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/connection/#api","title":"API","text":"

All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.

"},{"location":"setup/connection/connection/#configuration-file","title":"Configuration file","text":"

All connection details follow the same pattern.

<connection format> {\n    <connection name> {\n        <key> = <value>\n    }\n}\n

Overriding configuration

When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:

url = \"localhost\"\nurl = ${?POSTGRES_URL}\n

The above defines that if there is a system property or environment variable named POSTGRES_URL, then that value will be used for the url, otherwise, it will default to localhost.

"},{"location":"setup/connection/connection/#data-sources","title":"Data sources","text":"

To find examples of a task for each type of data source, please check out this page.

"},{"location":"setup/connection/connection/#file","title":"File","text":"

Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.

"},{"location":"setup/connection/connection/#csv","title":"CSV","text":"JavaScalaapplication.conf
csv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?CSV_PATH}\n  }\n}\n

Other available configuration for CSV can be found here

"},{"location":"setup/connection/connection/#json","title":"JSON","text":"JavaScalaapplication.conf
json(\"customer_transactions\", \"/data/customer/transaction\")\n
json(\"customer_transactions\", \"/data/customer/transaction\")\n
json {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?JSON_PATH}\n  }\n}\n

Other available configuration for JSON can be found here

"},{"location":"setup/connection/connection/#orc","title":"ORC","text":"JavaScalaapplication.conf
orc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?ORC_PATH}\n  }\n}\n

Other available configuration for ORC can be found here

"},{"location":"setup/connection/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.conf
parquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?PARQUET_PATH}\n  }\n}\n

Other available configuration for Parquet can be found here

"},{"location":"setup/connection/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.conf
delta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?DELTA_PATH}\n  }\n}\n
"},{"location":"setup/connection/connection/#rmdbs","title":"RMDBS","text":"

Follows the same configuration used by Spark as found here. Sample can be found below

JavaScalaapplication.conf
postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n
postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n
jdbc {\n    customer_postgres {\n        url = \"jdbc:postgresql://localhost:5432/customer\"\n        url = ${?POSTGRES_URL}\n        user = \"postgres\"\n        user = ${?POSTGRES_USERNAME}\n        password = \"postgres\"\n        password = ${?POSTGRES_PASSWORD}\n        driver = \"org.postgresql.Driver\"\n    }\n}\n

Ensure that the user has write permission, so it is able to save the table to the target tables.

SQL Permission Statements
GRANT INSERT ON <schema>.<table> TO <user>;\n
"},{"location":"setup/connection/connection/#postgres","title":"Postgres","text":"

Can see example API or Config definition for Postgres connection above.

"},{"location":"setup/connection/connection/#permissions","title":"Permissions","text":"

Following permissions are required when generating plan and tasks:

SQL Permission Statements
GRANT SELECT ON information_schema.tables TO < user >;\nGRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\nGRANT SELECT ON information_schema.table_constraints TO < user >;\nGRANT SELECT ON information_schema.constraint_column_usage TO < user >;\n
"},{"location":"setup/connection/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf
mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n
mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n
jdbc {\n    customer_mysql {\n        url = \"jdbc:mysql://localhost:3306/customer\"\n        user = \"root\"\n        password = \"root\"\n        driver = \"com.mysql.cj.jdbc.Driver\"\n    }\n}\n
"},{"location":"setup/connection/connection/#permissions_1","title":"Permissions","text":"

Following permissions are required when generating plan and tasks:

SQL Permission Statements
GRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.statistics TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\n
"},{"location":"setup/connection/connection/#cassandra","title":"Cassandra","text":"

Follows same configuration as defined by the Spark Cassandra Connector as found here

JavaScalaapplication.conf
cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap.of()                #optional additional connection options\n)\n
cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap()                #optional additional connection options\n)\n
org.apache.spark.sql.cassandra {\n    customer_cassandra {\n        spark.cassandra.connection.host = \"localhost\"\n        spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n        spark.cassandra.connection.port = \"9042\"\n        spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n        spark.cassandra.auth.username = \"cassandra\"\n        spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n        spark.cassandra.auth.password = \"cassandra\"\n        spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n    }\n}\n
"},{"location":"setup/connection/connection/#permissions_2","title":"Permissions","text":"

Ensure that the user has write permission, so it is able to save the table to the target tables.

CQL Permission Statements
GRANT INSERT ON <schema>.<table> TO <user>;\n

Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true) as it will gather metadata information about tables and columns from the below tables.

CQL Permission Statements
GRANT SELECT ON system_schema.tables TO <user>;\nGRANT SELECT ON system_schema.columns TO <user>;\n
"},{"location":"setup/connection/connection/#kafka","title":"Kafka","text":"

Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here

JavaScalaapplication.conf
kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n
kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n
kafka {\n    customer_kafka {\n        kafka.bootstrap.servers = \"localhost:9092\"\n        kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n    }\n}\n

When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.

"},{"location":"setup/connection/connection/#jms","title":"JMS","text":"

Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.

JavaScalaapplication.conf
solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n
solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n
jms {\n    customer_solace {\n        initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n        connectionFactory = \"/jms/cf/default\"\n        url = \"smf://localhost:55555\"\n        url = ${?SOLACE_URL}\n        user = \"admin\"\n        user = ${?SOLACE_USER}\n        password = \"admin\"\n        password = ${?SOLACE_PASSWORD}\n        vpnName = \"default\"\n        vpnName = ${?SOLACE_VPN}\n    }\n}\n
"},{"location":"setup/connection/connection/#http","title":"HTTP","text":"

Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.

JavaScalaapplication.conf
http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n
http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n
http {\n    customer_api {\n        user = \"admin\"\n        user = ${?HTTP_USER}\n        password = \"admin\"\n        password = ${?HTTP_PASSWORD}\n    }\n}\n
"},{"location":"setup/deployment/deployment/","title":"Deployment","text":"

Two main ways to deploy and run Data Caterer:

"},{"location":"setup/deployment/deployment/#docker","title":"Docker","text":"

To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.

Then you can run the following:

./gradlew clean build\ndocker build -t <my_image_name>:<my_image_tag> .\n
"},{"location":"setup/deployment/deployment/#helm","title":"Helm","text":"

Link to sample helm on GitHub here

Update the configuration to your own data connections and configuration or own image created from above.

git clone git@github.com:pflooky/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n
"},{"location":"setup/foreign-key/foreign-key/","title":"Foreign Keys","text":"

Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.

"},{"location":"setup/foreign-key/foreign-key/#single-column","title":"Single column","text":"

Define a column in one data source to match against another column. Below example shows a postgres data source with two tables, accounts and transactions that have a foreign key for account_id.

JavaScalaYAML
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -> \"account_id\")\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n
"},{"location":"setup/foreign-key/foreign-key/#multiple-columns","title":"Multiple columns","text":"

You may have a scenario where multiple columns need to be aligned. From the same example, we want account_id and name from accounts to match with account_id and full_name to match in transactions respectively.

JavaScalaYAML
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -> List(\"account_id\", \"full_name\"))\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n
"},{"location":"setup/foreign-key/foreign-key/#nested-column","title":"Nested column","text":"

Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.

In the example below, the nested customer_details.name field inside the json task needs to match with name from postgres. A new field in the json called _txn_name is used as a temporary column to facilitate the foreign key definition.

JavaScalaYAML
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true)       #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true)       #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -> List(\"account_id\", \"_txn_name\"))\n)\n
---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n
"},{"location":"setup/generator/count/","title":"Record Count","text":"

There are options related to controlling the number of records generated that can help in generating the scenarios or data required.

"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"

Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n
"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"

As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n
"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"

When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.

One example of this would be when generating transactions relating to a customer, a customer may be defined by columns account_id, name. A number of transactions would be generated per account_id, name.

You can also use a combination of the above two methods to generate the number of records per column.

"},{"location":"setup/generator/count/#records","title":"Records","text":"

When defining a base number of records within the perColumn configuration, it translates to creating (count.records * count.recordsPerColumn) records. This is a fixed number of records that will be generated each time, with no variation between runs.

In the example below, we have count.records = 1000 and count.recordsPerColumn = 2. Which means that 1000 * 2 = 2000 records will be generated in total.

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n
"},{"location":"setup/generator/count/#generated","title":"Generated","text":"

You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.

In the example below, it will generate between (count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000 and (count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000 records.

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n
"},{"location":"setup/generator/generator/","title":"Data Generators","text":""},{"location":"setup/generator/generator/#data-types","title":"Data Types","text":"

Below is a list of all supported data types for generating data:

Data Type Spark Data Type Options Description string StringType minLen, maxLen, expression, enableNull integer IntegerType min, max, stddev, mean long LongType min, max, stddev, mean short ShortType min, max, stddev, mean decimal(precision, scale) DecimalType(precision, scale) min, max, stddev, mean double DoubleType min, max, stddev, mean float FloatType min, max, stddev, mean date DateType min, max, enableNull timestamp TimestampType min, max, enableNull boolean BooleanType binary BinaryType minLen, maxLen, enableNull byte ByteType array ArrayType arrayMinLen, arrayMaxLen, arrayType _ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/generator/#options","title":"Options","text":""},{"location":"setup/generator/generator/#all-data-types","title":"All data types","text":"

Some options are available to use for all types of data generators. Below is the list along with example and descriptions:

Option Default Example Description enableEdgeCase false enableEdgeCase: \"true\" Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) edgeCaseProbability 0.0 edgeCaseProb: \"0.1\" Probability of generating a random edge case value if enableEdgeCase is true isUnique false isUnique: \"true\" Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data seed seed: \"1\" Defines the random seed for generating data for that particular column. It will override any seed defined at a global level sql sql: \"CASE WHEN amount < 10 THEN true ELSE false END\" Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/generator/#string","title":"String","text":"Option Default Example Description minLen 1 minLen: \"2\" Ensures that all generated strings have at least length minLen maxLen 10 maxLen: \"15\" Ensures that all generated strings have at most length maxLen expression expression: \"#{Name.name}\"expression:\"#{Address.city}/#{Demographic.maritalStatus}\" Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format #{<faker expression name>} enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")

"},{"location":"setup/generator/generator/#sample","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n
"},{"location":"setup/generator/generator/#numeric","title":"Numeric","text":"

For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).

"},{"location":"setup/generator/generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Description min 0 min: \"2\" Ensures that all generated values are greater than or equal to min max 1000 max: \"25\" Ensures that all generated values are less than or equal to max stddev 1.0 stddev: \"2.0\" Standard deviation for normal distributed data mean max - min mean: \"5.0\" Mean for normal distributed data

Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)

"},{"location":"setup/generator/generator/#sample_1","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n
"},{"location":"setup/generator/generator/#decimal","title":"Decimal","text":"Option Default Example Description min 0 min: \"2\" Ensures that all generated values are greater than or equal to min max 1000 max: \"25\" Ensures that all generated values are less than or equal to max stddev 1.0 stddev: \"2.0\" Standard deviation for normal distributed data mean max - min mean: \"5.0\" Mean for normal distributed data numericPrecision 10 precision: \"25\" The maximum number of digits numericScale 0 scale: \"25\" The number of digits on the right side of the decimal point (has to be less than or equal to precision)

Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)

"},{"location":"setup/generator/generator/#sample_2","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n
"},{"location":"setup/generator/generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description min 0.0 min: \"2.1\" Ensures that all generated values are greater than or equal to min max 1000.0 max: \"25.9\" Ensures that all generated values are less than or equal to max stddev 1.0 stddev: \"2.0\" Standard deviation for normal distributed data mean max - min mean: \"5.0\" Mean for normal distributed data

Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)

"},{"location":"setup/generator/generator/#sample_3","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n
"},{"location":"setup/generator/generator/#date","title":"Date","text":"Option Default Example Description min now() - 365 days min: \"2023-01-31\" Ensures that all generated values are greater than or equal to min max now() max: \"2023-12-31\" Ensures that all generated values are less than or equal to max enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)

"},{"location":"setup/generator/generator/#sample_4","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n
"},{"location":"setup/generator/generator/#timestamp","title":"Timestamp","text":"Option Default Example Description min now() - 365 days min: \"2023-01-31 23:10:10\" Ensures that all generated values are greater than or equal to min max now() max: \"2023-12-31 23:10:10\" Ensures that all generated values are less than or equal to max enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)

"},{"location":"setup/generator/generator/#sample_5","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n
"},{"location":"setup/generator/generator/#binary","title":"Binary","text":"Option Default Example Description minLen 1 minLen: \"2\" Ensures that all generated array of bytes have at least length minLen maxLen 20 maxLen: \"15\" Ensures that all generated array of bytes have at most length maxLen enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)

"},{"location":"setup/generator/generator/#sample_6","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n
"},{"location":"setup/generator/generator/#array","title":"Array","text":"Option Default Example Description arrayMinLen 0 arrayMinLen: \"2\" Ensures that all generated arrays have at least length arrayMinLen arrayMaxLen 5 arrayMaxLen: \"15\" Ensures that all generated arrays have at most length arrayMaxLen arrayType arrayType: \"double\" Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true"},{"location":"setup/generator/generator/#sample_7","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array<double>\"\n
"},{"location":"setup/generator/report/","title":"Report","text":"

Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much data was generated, where it was generated, validation results and any associated metadata.

"},{"location":"setup/generator/report/#sample","title":"Sample","text":"

Once run, it will produce a report like this.

"},{"location":"setup/guide/","title":"Guides","text":"

Below are a list of guides you can follow to create your data generation for your use case.

For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.

"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"

Free tier scenarios

Paid tier scenarios

"},{"location":"setup/guide/#data-sources","title":"Data Sources","text":"

Free tier data sources

Paid tier data sources

"},{"location":"setup/guide/#yaml-files","title":"YAML Files","text":""},{"location":"setup/guide/#base-concept","title":"Base Concept","text":"

The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.

"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"

Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2

"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"

Basic configuration

"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"

To see how it runs against different data sources, you can run using docker-compose and set DATA_SOURCE like below

./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n

Can set it to one of the following:

"},{"location":"setup/guide/data-source/cassandra/","title":"Cassandra","text":"

Info

Writing data to Cassandra is a paid feature. Try the free trial here.

Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.

"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/cassandra/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n

If you already have a Cassandra instance running, you can skip to this step.

"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"

Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.

cd docker\ndocker-compose up -d cassandra\n
"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"

Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.

CQL Permission Statements
GRANT INSERT ON <schema>.<table> TO data_caterer_user;\n

Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true) as it will gather metadata information about tables and columns from the below tables.

CQL Permission Statements
GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n
"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"

Within our class, we can start by defining the connection properties to connect to Cassandra.

JavaScala
var accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap.of()                //optional additional connection options\n)\n

Additional options such as SSL configuration, etc can be found here.

val accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap()                   //optional additional connection options\n)\n

Additional options such as SSL configuration, etc can be found here.

"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"

Let's create a task for inserting data into the account.accounts and account.account_status_history tables as defined underdocker/data/cql/customer.cql. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:

docker exec host.docker.internal cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n

Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.

CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n    amount double,\n    created_by text,\n    name text,\n    open_time timestamp,\n    status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n    eod_date date,\n    status text,\n    updated_by text,\n    updated_time timestamp,\n    PRIMARY KEY (account_id, eod_date)\n)...\n

Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the text fields do not have a data type defined. This is because the default data type is StringType which corresponds to text in Cassandra.

JavaScala
{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"

We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.

"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"

account_id follows a particular pattern that where it starts with ACC and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.

JavaScala
field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"

amount the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between 1 and 1000.

JavaScala
field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"

name is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression to generate real looking name. All possible faker expressions can be found here

JavaScala
field().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"

open_time is a timestamp that we want to have a value greater than a specific date. We can define a min date by using java.sql.Date like below.

JavaScala
field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"

status is a field that can only obtain one of four values, open, closed, suspended or pending.

JavaScala
field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"

created_by is a field that is based on the status field where it follows the logic: if status is open or closed, then it is created_by eod else created_by event. This can be achieved by defining a SQL expression like below.

JavaScala
field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n

Putting it all the fields together, our class should now look like this.

JavaScala
var accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the accountTask, we have to call execute . So our full plan run will look like this.

JavaScala
public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n

Your output should look like this.

 count\n-------\n  1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id  | amount    | created_by         | name                   | open_time                       | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK |          Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 |  46.99177 |             VH88H9 |       Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 |      open\n ACC50587836 |  774.9872 |         GENANwPm t |           Sang Monahan | 2023-03-21 00:16:53.308000+0000 |    closed\n ACC67619387 | 452.86706 |       5msTpcBLStTH |         Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 |  14.69298 |           WDmOh7NT |          Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 |  51.26492 |          J8jAKzvj2 |           Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 |   SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 |    closed\n ACC20642011 | 658.40713 |          clyZRD4fI |  Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 |      open\n ACC74962085 | 970.98218 |       ZLETTSnj4NpD |          Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 |   pending\n ACC72848439 | 481.64267 |                 cc |        Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed.

"},{"location":"setup/guide/data-source/http/","title":"HTTP Source","text":"

Info

Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.

Creating a data generator based on an OpenAPI/Swagger document.

"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/http/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"

We will be using the http-bin docker image to help simulate a service with HTTP endpoints.

Start it via:

cd docker\ndocker-compose up -d http\ndocker ps\n
"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"

We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under docker/mount/http/petstore.json in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.

We have kept the following endpoints to test out:

JavaScala
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n

The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.

"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"

Let's try run and see what happens.

cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n

It should look something like this.

172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n

Looks like we have some data now. But we can do better and add some enhancements to it.

"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"

The four different requests that get sent could have the same id passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular id value. We note that the id value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.

To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.

HTTP Type Column Prefix Example Request Body bodyContent bodyContent.id Path Parameter pathParam pathParamid Query Parameter queryParam queryParamid Header header headerContent_Type

Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of {http method}{http path}. For example, POST/pets. Let's apply this knowledge to link all the id values together.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n
val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n

Let's test it out by running it again

./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n

Now we have the same id values being produced across the POST, DELETE and GET requests! What if we knew that the id values should follow a particular pattern?

"},{"location":"setup/guide/data-source/http/#custom-metadata","title":"Custom metadata","text":"

So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the id column for the POST request and it will proliferate to the other endpoints as well. Given the id column is a nested column as noted in the foreign key, we can alter its metadata via the following:

JavaScala
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n

We first get the column bodyContent, then get the nested schema and get the column id and add metadata stating that id should follow the patter ID[0-9]{8}.

Let's try run again, and hopefully we should see some proper ID values.

./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n

Great! Now we have replicated a production-like flow of HTTP requests.

"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"

If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).

"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"

By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.

JavaScala
import com.github.pflooky.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n
import com.github.pflooky.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -> \"1\"))\n...\n

Check out the full example under AdvancedHttpPlanRun in the example repo.

"},{"location":"setup/guide/data-source/kafka/","title":"Kafka","text":"

Info

Writing data to Kafka is a paid feature. Try the free trial here.

Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.

"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/kafka/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n

If you already have a Kafka instance running, you can skip to this step.

"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"

Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.

cd docker\ndocker-compose up -d kafka\n
"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"

Within our class, we can start by defining the connection properties to connect to Kafka.

JavaScala
var accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap.of()          //optional additional connection options\n);\n

Additional options can be found here.

val accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap()             //optional additional connection options\n)\n

Additional options can be found here.

"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"

Let's create a task for inserting data into the account-topic that is already defined underdocker/data/kafka/setup_kafka.sh. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:

docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n

Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the text fields do not have a data type defined. This is because the default data type is StringType.

JavaScala
{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),  can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n
val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType),  can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n
"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"

The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value

Whilst, the other fields are optional: - key - partition - headers

"},{"location":"setup/guide/data-source/kafka/#headers","title":"headers","text":"

headers follows a particular pattern that where it is of type array<struct<key: string,value: binary>>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the value part, it refers to content.account_id where content is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.

JavaScala
field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"

transactions is an array that contains an inner structure of txn_date and amount. The size of the array generated can be controlled via arrayMinLength and arrayMaxLength.

JavaScala
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"

details is another example of a nested schema structure where it also has a nested structure itself in updated_by. One thing to note here is the first_txn_date field has a reference to the content.transactions array where it will sort the array by txn_date and get the first element.

JavaScala
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the kafkaTask, we have to call execute .

"},{"location":"setup/guide/data-source/kafka/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n

Your output should look like this.

{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed.

"},{"location":"setup/guide/data-source/marquez-metadata-source/","title":"Metadata Source","text":"

Info

Generating data based on an external metadata source is a paid feature. Try the free trial here.

Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).

"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/marquez-metadata-source/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"

You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.

The command that was run for this example to help with setup of dummy data was ./docker/up.sh -a 5001 -m 5002 --seed.

Check that the following url shows some data like below once you click on food_delivery from the ns drop down in the top right corner.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#postgres-setup","title":"Postgres Setup","text":"

Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:

docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"

We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a namespace, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific namespace and dataset.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala
var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n
val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -> \"overwrite\", \"header\" -> \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n

The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the food_delivery namespace and public.categories dataset to retrieve the schema information from.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#multiple-schemas","title":"Multiple Schemas","text":"JavaScala
var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n
val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n

We now have pointed this Postgres instance to produce multiple schemas that are defined under the food_delivery namespace. Also note that we are using database food_delivery in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#run","title":"Run","text":"

Let's try run and see what happens.

cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n

It should look something like this.

 order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |         customer_email         |                     customer_address                     | menu_id | restaurant_id |                        restaurant_address\n   | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n    38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com       | 5018 Lang Dam, Gaylordfurt, MO 35172                     |   59841 |         30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 |        55697 |       36370 |       21574 |   88022 |     16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11  | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 |   66195 |         42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n|        26516 |       81335 |       87615 |   27433 |     45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com       | Apt. 385 99701 Lemke Place, New Irvin, RI 73305          |   66427 |         44438 | 1309 Danny Cape, Weimanntown, AL 15865\n|        41686 |       36508 |       34498 |   24191 |     92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86  | isabelle.ohara@hotmail.com     | 2225 Evie Lane, South Ardella, SD 90805                  |   27106 |         25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n|        94205 |       66207 |       81051 |   52553 |     27483\n

You can also try query some other tables. Let's also check what is in the CSV file.

$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n

Looks like we have some data now. But we can do better and add some enhancements to it.

What if we wanted the same records in Postgres public.delivery_7_days to also show up in the CSV file? That's where we can use a foreign key definition.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#foreign-key","title":"Foreign Key","text":"

We can take a look at the report (under docker/sample/report/index.html) to see what we need to do to create the foreign key. From the overview, you should see under Tasks there is a my_postgres task which has food_delivery_public.delivery_7_days as a step. Click on the link for food_delivery_public.delivery_7_days and it will take us to a page where we can find out about the columns used in this table. Click on the Fields button on the far right to see.

We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will take all the fields.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n
val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n

Notice how we have defined the csvTask and foreignCols as the main foreign key but for postgresTask, we had to define it as a foreignField. This is because postgresTask has multiple tables within it, and we only want to define our foreign key with respect to the public.delivery_7_days table. We use the step name (can be seen from the report) to specify the table to target.

To test this out, we will truncate the public.delivery_7_days table in Postgres first, and then try run again.

docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n
 order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |        customer_email        |\ncustomer_address                     | menu_id | restaurant_id |                   restaurant_address                   | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n    53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com  | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 |   40412 |         70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 |       44210 |       83966 |   78614 |     77449\n

Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.

$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n

Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate data.

Check out the full example under AdvancedMetadataSourcePlanRun in the example repo.

"},{"location":"setup/guide/data-source/open-metadata-source/","title":"OpenMetadata Source","text":"

Info

Generating data based on an external metadata source is a paid feature. Try the free trial here.

Creating a data generator for a JSON file based on metadata stored in OpenMetadata.

"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/open-metadata-source/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"

You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.

If that page becomes outdated or the link doesn't work, below are the commands I used to run it:

mkdir openmetadata-docker && cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml > docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n

Check that the following url works and login with admin:admin. Then you should see some data like below:

"},{"location":"setup/guide/data-source/open-metadata-source/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"

We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.

"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala
import com.github.pflooky.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\",                                                              //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),                                        //auth type\nMap.of(                                                                                   //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",                                        //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count().records(10));\n
import com.github.pflooky.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\",                                                  //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,                                        //auth type\nMap(                                                                          //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -> \"abc123\",                                        //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -> \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count.records(10))\n

The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the sample_data.ecommerce_db.shopify.raw_customer table. You can check out the schema here to see what it looks like.

"},{"location":"setup/guide/data-source/open-metadata-source/#run","title":"Run","text":"

Let's try run and see what happens.

cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n

It should look something like this.

{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E  EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n

Looks like we have some data now. But we can do better and add some enhancements to it.

"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"

We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.

Let's make the platform field a choice field that can only be a set of certain values and the nested field customer.sex is also from a predefined set of values.

JavaScala
var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n

Let's test it out by running it again

./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n
{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n

Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data.

"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"

Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be incorporated into your Data Caterer job as well by enabling data validations via enableGenerateValidations in configuration.

JavaScala
var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n
val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n

Check out the full example under AdvancedOpenMetadataSourcePlanRun in the example repo.

"},{"location":"setup/guide/data-source/solace/","title":"Solace","text":"

Info

Writing data to Solace is a paid feature. Try the free trial here.

Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.

"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/solace/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n

If you already have a Solace instance running, you can skip to this step.

"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"

Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.

cd docker\ndocker-compose up -d solace\n

Open up localhost:8080 and login with admin:admin and check there is the default VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under docker/data/solace/setup_solace.sh and change the host to localhost.

"},{"location":"setup/guide/data-source/solace/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"

Within our class, we can start by defining the connection properties to connect to Solace.

JavaScala
var accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap.of()                            //optional additional connection options\n);\n

Additional connection options can be found here.

val accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap()                               //optional additional connection options\n)\n

Additional connection options can be found here.

"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"

Let's create a task for inserting data into the rest_test_queue or rest_test_topic that is already created for us from this step.

Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the text fields do not have a data type defined. This is because the default data type is StringType.

JavaScala
{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),   //can define message JMS priority here\nfield().name(\"headers\")                                     //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n
val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType),  //can define message JMS priority here\nfield.name(\"headers\")                           //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n
"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"

The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:

Whilst, the other fields are optional:

"},{"location":"setup/guide/data-source/solace/#headers","title":"headers","text":"

headers follows a particular pattern that where it is of type HeaderType.getType which behind the scenes, translates toarray<struct<key: string,value: binary>>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in thevalue part, it refers to content.account_id where content is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.

JavaScala
field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"

transactions is an array that contains an inner structure of txn_date and amount. The size of the array generated can be controlled via arrayMinLength and arrayMaxLength.

JavaScala
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"

details is another example of a nested schema structure where it also has a nested structure itself in updated_by. One thing to note here is the first_txn_date field has a reference to the content.transactions array where it will sort the array by txn_date and get the first element.

JavaScala
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the kafkaTask, we have to call execute.

"},{"location":"setup/guide/data-source/solace/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n

Your output should look like this.

Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed. Or view the sample report found here.

"},{"location":"setup/guide/scenario/auto-generate-connection/","title":"Auto Generate From Data Connection","text":"

Info

Auto data generation from data connection is a paid feature. Try the free trial here.

Creating a data generator based on only a data connection to Postgres.

"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n

In the above code, we note the following:

  1. Data source configuration to a Postgres data source called my_postgres
  2. We have enabled the flag enableGeneratePlanAndTasks which tells Data Caterer to go to my_postgres and generate data for all the tables found under the database customer (which is defined in the connection string).
  3. The config generatedPlanAndTaskFolderPath defines where the metadata that is gathered from my_postgres should be saved at so that we could re-use it later.
  4. enableUniqueCheck is set to true to ensure that generated data is unique based on primary key or foreign key definitions.

Note

Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into account, so generated data may fail to insert depending on the data source restrictions

"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"

If you don't have your own Postgres up and running, you can set up and run an instance configured in the docker folder via.

cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n

This will create the tables found under docker/data/sql/postgres/customer.sql. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping.

"},{"location":"setup/guide/scenario/auto-generate-connection/#run","title":"Run","text":"

Let's try run.

cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n

It should look something like this.

   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n

The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.

Also check the HTML report that gets generated under docker/sample/report/index.html. You can see a summary of what was generated along with other metadata.

You can now look to play around with other tables or data sources and auto generate for them.

"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"

If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.

"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"

As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the history and audit schemas. Also, any table with the name balances or transactions in any schema will also not have data generated.

JavaScala
var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n
val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -> \"history, audit\",\n\"filterOutTable\" -> \"balances, transactions\")\n)\n)\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"

You can control the record count per sub data source via numRecordsPerStep.

JavaScala
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"

Info

Generating event data is a paid feature. Try the free trial here.

Creating a data generator for Kafka topic with matching records in a CSV file.

"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/batch-and-event/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"

If you don't have your own Kafka up and running, you can set up and run an instance configured in the docker folder via.

cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n

Let's create a task for inserting data into the account-topic that is already defined underdocker/data/kafka/setup_kafka.sh.

"},{"location":"setup/guide/scenario/batch-and-event/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n

We will borrow the Kafka task that is already defined under the class AdvancedKafkaPlanRun or AdvancedKafkaJavaPlanRun. You can go through the Kafka guide here for more details.

"},{"location":"setup/guide/scenario/batch-and-event/#schema","title":"Schema","text":"

Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.

JavaScala
var kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n
val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n

This is a simple schema where we want to use the values and metadata that is already defined in the kafkaTask to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.

"},{"location":"setup/guide/scenario/batch-and-event/#foreign-keys","title":"Foreign Keys","text":"

From the above CSV schema, we see note the following against the Kafka schema:

Our foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n
val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -> List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n
"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"

Let's try run.

cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n

It should look something like this.

{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n

Let's also check if there is a corresponding record in the CSV file.

$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n

Great! The account, year, name and payload look to all match up.

"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"

You may notice that the events are generated first, then the CSV file. This is because as part of the execute function, we passed in the kafkaTask first, before the csvTask. You can change the order of execution by passing in csvTask before kafkaTask into the execute function.

"},{"location":"setup/guide/scenario/delete-generated-data/","title":"Delete Generated Data","text":"

Info

Delete generated data is a paid feature. Try the free trial here.

Creating a data generator for Postgres and delete the generated data after using it.

"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/delete-generated-data/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n

In the above code we note the following:

  1. We have defined a Postgres connection called my_postgres
  2. enableGeneratePlanAndTasks is enabled to auto generate data for all tables under customer database
  3. enableRecordTracking is enabled to ensure that all generated records are tracked. This will get used when we want to delete data afterwards
  4. enableDeleteGeneratedRecords is disabled for now. We want to see the generated data first and delete sometime after
  5. generatedPlanAndTaskFolderPath is the folder path where we saved the metadata we have gathered from my_postgres
  6. recordTrackingFolderPath is the folder path where record tracking is maintained. We need to persist this data to ensure it is still available when we want to delete data
"},{"location":"setup/guide/scenario/delete-generated-data/#postgres-setup","title":"Postgres Setup","text":"

If you don't have your own Postgres up and running, you can set up and run an instance configured in the docker folder via.

cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n

This will create the tables found under docker/data/sql/postgres/customer.sql. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping.

"},{"location":"setup/guide/scenario/delete-generated-data/#run","title":"Run","text":"

Let's try run.

cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n

It should look something like this.

   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n

The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.

Check the number of records via:

docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n
"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"

We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.

.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false)  //we need to explicitly disable generating data\n

Enable delete generated records and disable generating data.

Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.

docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n

We now should have 1001 records in our account.accounts table. Let's delete the generated data now.

./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n

You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably and also be able to clean it up.

"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"

Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - recordTrackingFolderPath needs to be set to the same value

"},{"location":"setup/guide/scenario/delete-generated-data/#define-record-count","title":"Define record count","text":"

You can control the record count per sub data source via numRecordsPerStep.

JavaScala
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"

Creating a data generator for a CSV file.

"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/first-data-generation/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"

When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.

JavaScala
csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\")          //optional additional options\n)\n

Other additional options for CSV can be found here

csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -> \"true\")           //optional additional options\n)\n

Other additional options for CSV can be found here

"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"

Our CSV file that we generate should adhere to a defined schema where we can also define data types.

Let's define each field along with their corresponding data type. You will notice that the string fields do not have a data type defined. This is because the default data type is StringType.

JavaScala
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"

We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.

"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":" JavaScala
field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":" JavaScala
field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":" JavaScala
field().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":" JavaScala
field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":" JavaScala
field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":" JavaScala
field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n

Putting it all the fields together, our class should now look like this.

JavaScala
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"

We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the accountTask level like below. If you want to generate more records, set it to the value you want.

JavaScala
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n
"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the accountTask, we have to call execute . So our full plan run will look like this.

JavaScala
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n

Your output should look like this.

account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed.

"},{"location":"setup/guide/scenario/first-data-generation/#join-with-another-csv","title":"Join With Another CSV","text":"

Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.

We can define our schema the same way along with any additional metadata.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"

Usually, for a given account_id, full_name, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count function.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n
"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"

Above, you will notice that we are generating 5 records per account_id, full_name. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n

Here we set the minimum number of records per column to be 0 and the maximum to 5.

"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"

In this scenario, we want to match the account_id in account to match the same column values in transaction. We also want to match name in account to full_name in transaction. This can be done via plan configuration like below.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n
val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),  //the task and columns we want linked\nList(transactionTask -> List(\"account_id\", \"full_name\"))  //list of other tasks and their respective column names we want matched\n)\n

Now, stitching it all together for the execute function, our final plan should look like this.

JavaScala
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -> List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n

Let's try run again.

#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n

It should look something like this.

ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n

Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the DocumentationJavaPlanRun.java or DocumentationPlanRun.scala files as well to check that your plan is the same.

We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.

"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"

In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.

"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"

First, we define our connection properties for Postgres. You can check out the full options available here.

JavaScala
var postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\");\n
val postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\")\n

We can connect and access the data inside the table account.transactions. Now to define our data validations.

"},{"location":"setup/guide/scenario/first-data-generation/#validations","title":"Validations","text":"

For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.

JavaScala
var postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n
val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"

For all values in the name column, we check if they match the regex [A-Z][a-z]+ [A-Z][a-z]+. As we know in the real world, names do not always follow the same pattern, so we allow for an errorThreshold before marking the validation as failed. Here, we define the errorThreshold to be 0.2, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.

"},{"location":"setup/guide/scenario/first-data-generation/#balance_1","title":"balance","text":"

We check that all balance values are greater than or equal to 0. This time, we have a slightly different errorThreshold as it is set to 10, which means, if the number of errors is greater than 10, then fail the validation.

"},{"location":"setup/guide/scenario/first-data-generation/#expr","title":"expr","text":"

Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use expr to define a SQL expression that returns a boolean. In this scenario, we are checking if the status column has value closed, then the close_date should be not null, otherwise, close_date is null.

"},{"location":"setup/guide/scenario/first-data-generation/#unique","title":"unique","text":"

We check whether the combination of account_id and name are unique within the dataset. You can define one or more columns for unique validations.

"},{"location":"setup/guide/scenario/first-data-generation/#groupby","title":"groupBy","text":"

There may be some business rule that states the number of login_retry should be less than 10 for each account. We can check this via a group by validation where we group by the account_id, name, take the maximum value for login_retry per account_id,name combination, then check if it is less than 10.

You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.

"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"

Creating a data generator for a CSV file where there are multiple records per column values.

"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/records-per-column/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask: ConnectionTaskBuilder[FileBuilder] = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n
"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"

By default, tasks will generate 1000 records. You can alter this value via the count configuration which can be applied to individual tasks. For example, in Scala, csv(...).count(count.records(100)) to generate only 100 records.

"},{"location":"setup/guide/scenario/records-per-column/#records-per-column","title":"Records Per Column","text":"

In this scenario, for a given account_id, full_name, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count function.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n

This will generate 1000 * 5 = 5000 records as the default number of records is set (1000) and per account_id, full_name from the initial 1000 records, 5 records will be generated.

"},{"location":"setup/guide/scenario/records-per-column/#random-records-per-column","title":"Random Records Per Column","text":"

Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n

Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set standardDeviation and mean for the number of records generated per column to follow a normal distribution.

"},{"location":"setup/guide/scenario/records-per-column/#run","title":"Run","text":"

Let's try run.

#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n

It should look something like this.

ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n

You can now look to play around with other count configurations found here.

"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"

Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).

"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"

Ensure all data in column is equal to certain value. Value can be of any data type.

JavaScalaYAML
validation().col(\"year\").isEqual(2021)\n
validation.col(\"year\").isEqual(2021)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n
"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"

Ensure all data in column is not equal to certain value. Value can be of any data type.

JavaScalaYAML
validation().col(\"year\").isNotEqual(2021)\n
validation.col(\"year\").isNotEqual(2021)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n
"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"

Ensure all data in column is null.

JavaScalaYAML
validation().col(\"year\").isNull()\n
validation.col(\"year\").isNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"

Ensure all data in column is not null.

JavaScalaYAML
validation().col(\"year\").isNotNull()\n
validation.col(\"year\").isNotNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"

Ensure all data in column is contains certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"name\").contains(\"peter\")\n
validation.col(\"name\").contains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"

Ensure all data in column does not contain certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"name\").notContains(\"peter\")\n
validation.col(\"name\").notContains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#unqiue","title":"Unqiue","text":"

Ensure all data in column is unique.

JavaScalaYAML
validation().unique(\"account_id\", \"name\")\n
validation.unique(\"account_id\", \"name\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n
"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"

Ensure all data in column is less than certain value.

JavaScalaYAML
validation().col(\"amount\").lessThan(100)\n
validation.col(\"amount\").lessThan(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount < 100\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"

Ensure all data in column is less than or equal to certain value.

JavaScalaYAML
validation().col(\"amount\").lessThanOrEqual(100)\n
validation.col(\"amount\").lessThanOrEqual(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount <= 100\"\n
"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"

Ensure all data in column is greater than certain value.

JavaScalaYAML
validation().col(\"amount\").greaterThan(100)\n
validation.col(\"amount\").greaterThan(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount > 100\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"

Ensure all data in column is greater than or equal to certain value.

JavaScalaYAML
validation().col(\"amount\").greaterThanOrEqual(100)\n
validation.col(\"amount\").greaterThanOrEqual(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount >= 100\"\n
"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"

Ensure all data in column is between two values.

JavaScalaYAML
validation().col(\"amount\").between(100, 200)\n
validation.col(\"amount\").between(100, 200)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n
"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"

Ensure all data in column is not between two values.

JavaScalaYAML
validation().col(\"amount\").notBetween(100, 200)\n
validation.col(\"amount\").notBetween(100, 200)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n
"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"

Ensure all data in column is in set of defined values.

JavaScalaYAML
validation().col(\"status\").in(\"open\", \"closed\")\n
validation.col(\"status\").in(\"open\", \"closed\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n
"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"

Ensure all data in column matches certain regex expression.

JavaScalaYAML
validation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n
"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"

Ensure all data in column does not match certain regex expression.

JavaScalaYAML
validation().col(\"account_id\").notMatches(\"^acc.*\")\n
validation.col(\"account_id\").notMatches(\"^acc.*\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n
"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"

Ensure all data in column starts with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").startsWith(\"ACC\")\n
validation.col(\"account_id\").startsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"

Ensure all data in column does not start with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").notStartsWith(\"ACC\")\n
validation.col(\"account_id\").notStartsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"

Ensure all data in column ends with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").endsWith(\"ACC\")\n
validation.col(\"account_id\").endsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"

Ensure all data in column does not end with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").notEndsWith(\"ACC\")\n
validation.col(\"account_id\").notEndsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"

Ensure all data in column has certain size. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").size(5)\n
validation.col(\"transactions\").size(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n
"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"

Ensure all data in column does not have certain size. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").notSize(5)\n
validation.col(\"transactions\").notSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"

Ensure all data in column has size less than certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").lessThanSize(5)\n
validation.col(\"transactions\").lessThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) < 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"

Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").lessThanOrEqualSize(5)\n
validation.col(\"transactions\").lessThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) <= 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"

Ensure all data in column has size greater than certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").greaterThanSize(5)\n
validation.col(\"transactions\").greaterThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) > 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"

Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").greaterThanOrEqualSize(5)\n
validation.col(\"transactions\").greaterThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) >= 5\"\n
"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"

Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).

JavaScalaYAML
validation().col(\"credit_card\").luhnCheck()\n
validation.col(\"credit_card\").luhnCheck\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n
"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"

Ensure all data in column has certain data type.

JavaScalaYAML
validation().col(\"id\").hasType(\"string\")\n
validation.col(\"id\").hasType(\"string\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n
"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"

Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.

For example, CASE WHEN status == 'open' THEN balance > 0 ELSE balance == 0 END would check all rows with status open to have balance greater than 0, otherwise, check the balance is 0.

JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount < 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.expr(\"amount < 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n
"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"

If you want to run aggregations based on a particular set of columns, you can do so via group by validations. An example would be checking that the sum of amount is less than 1000 per account_id, year. The validations applied can be one of the validations from above.

"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"

Check the sum of a columns values for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"

Check the count for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"

Check the min for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"

Check the max for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"

Check the average for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
"},{"location":"setup/validation/validation/","title":"Validations","text":"

Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.

Currently, SQL expression validations are supported (can see here for reference what other expressions are valid), but will later be extended out to supported other validations such as aggregates (group by account_number, sum of amounts should be greater than 100), ordering (transaction dates should be in descending order), relationships (at least one transaction per account_number) or data profiling (how close produced data profile is to expected data profile).

"},{"location":"setup/validation/validation/#define-validations","title":"Define Validations","text":"

Full example validation can be found below. For more details, check out each of the subsections defined further below.

JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n)  .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/validation/#wait-condition","title":"Wait Condition","text":"

Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:

"},{"location":"setup/validation/validation/#pause","title":"Pause","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.pause(1))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\");\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\")\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date > DATE('2023-01-01')\"\n
"},{"location":"setup/validation/validation/#webhook","title":"Webhook","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202));  //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\"));  //use connection configuration from existing 'my_http' connection definition\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\"))  //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n
"},{"location":"setup/validation/validation/#file-available","title":"File available","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n
"},{"location":"setup/validation/validation/#report","title":"Report","text":"

Once run, it will produce a report like this.

"},{"location":"use-case/business-value/","title":"Business Value","text":"

Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.

Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"

I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.

The companies/products not shown below either have:

"},{"location":"use-case/comparison/#data-generation","title":"Data Generation","text":"Tool Description Cost Pros Cons Clearbox AI Python based data generation tool via ML Unclear Python SDK UI interface Detect private data Report generation Batch data only No data clean up Limited/no documentation Curiosity Software Platform solution for test data management Unclear Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing No quick start No SDK Many components that may not be required No event generation support DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear Python SDK Report generation Data quality checks Business logic constraints No data connection support No data clean up No foreign key support Datafaker Realistic data generation library Free SDK for many languages Simple, easy to use Extensible Open source Generate realistic values No data connection support No data clean up No validation No foreign key support DBLDatagen Python based data generation tool Free Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries Limited support if issues Code required No data clean up No data validation Gatling HTTP API load testing tool Free (Open Source)Gatling Enterprise, usage based, starts from \u20ac89 per month, 1 user, 6.25 hours of testing Kotlin, Java & Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD CLI & Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage Howso Python based data generation tool via ML Unclear Python SDK Playground to try Open source library Customisable scenarios No support for data sources No data validation No data clean up Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD Report generation Non-technical users can use UI Customisable scenarios Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK Octopize Python based data generation tool via ML Unclear Python & R SDK Report generation API for metadata Customisable scenarios Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples Synthesized Python based data generation tool via ML Unclear CLI & Python SDK API for metadata IDE setup Data quality checks Not sure what is SDK & TDK Charge by usage No report of what was generated No relationships between data sources Tonic Platform solution for generating data Unclear UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic YData Python based data generation tool via ML. Platform solution as well Unclear Python SDK Open source Detect private data Compare datasets Report generation No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support"},{"location":"use-case/comparison/#use-of-ml-models","title":"Use of ML models","text":"

You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.

Pros

Cons

"},{"location":"use-case/roadmap/","title":"Roadmap","text":""},{"location":"use-case/use-case/","title":"Use cases","text":""},{"location":"use-case/use-case/#replicate-production-in-lower-environment","title":"Replicate production in lower environment","text":"

Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:

  1. Generates data with the latest schema changes and production like field values
  2. Run as a job on a daily/regular basis to replicate production traffic or data flows
  3. Validate data to ensure your system runs as expected
  4. Clean up data to avoid build up of generated data

"},{"location":"use-case/use-case/#local-development","title":"Local development","text":"

Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:

  1. Fewer assumptions or ambiguities when the developer codes
  2. Direct feedback loop in local computer rather than waiting for test environment for more reliable test data
  3. No domain expertise required to understand the data
  4. Easy for new developers to be onboarded and developing/testing code for jobs/services
"},{"location":"use-case/use-case/#systemintegration-testing","title":"System/integration testing","text":"

When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.

"},{"location":"use-case/use-case/#scenario-testing","title":"Scenario testing","text":"

If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (enableEdgeCases flag within <field>.generator.options, see more here).

"},{"location":"use-case/use-case/#data-debugging","title":"Data debugging","text":"

When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.

"},{"location":"use-case/use-case/#data-profiling","title":"Data profiling","text":"

When using Data Caterer with the feature flag enableGeneratePlanAndTasks enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable enableGenerateData) so that you can focus on the profile of the data you are utilising. This can be run against your production data sources to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data Caterer as no direct production connections need to be maintained to generate data in other environments (which can lead to serious concerns about data security as seen here).

"},{"location":"use-case/use-case/#schema-gathering","title":"Schema gathering","text":"

When using Data Caterer with the feature flag enableGeneratePlanAndTasks enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"Data Caterer is a metadata-driven data generation and testing tool that aids in creating production-like data across both batch and event data systems. Run data validations to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle it

Try now

Data testing is difficult and fragmented Current solutions only cover half the story

Try now

What you need is a reliable tool that can handle changes to your data landscape

With Data Caterer, you get:

Try now

"},{"location":"#tech-summary","title":"Tech Summary","text":"

Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to get into details? Checkout the setup pages here to get code examples and guides that will take you through scenarios and data sources.

Main features include:

Check other run configurations here.

"},{"location":"#what-is-it","title":"What is it","text":""},{"location":"#what-it-is-not","title":"What it is not","text":"

Try now

Data Catering vs Other tools vs In-house

Data Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it

"},{"location":"about/","title":"About","text":"

Hi, my name is Peter. I am a independent Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.

I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.

Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.

"},{"location":"about/#contact","title":"Contact","text":"

Please contact Peter Flook via Slack or via email peter.flook@data.catering if you have any questions or queries.

"},{"location":"about/#terms-of-service","title":"Terms of service","text":"

Terms of service can be found here.

"},{"location":"about/#privacy-policy","title":"Privacy policy","text":"

Privacy policy can be found here.

"},{"location":"pricing/","title":"Pricing","text":"

To have access to the paid features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.

"},{"location":"pricing/#paid-features","title":"Paid Features","text":""},{"location":"pricing/#pricing-table","title":"Pricing Table","text":""},{"location":"pricing/#manage-subscription","title":"Manage Subscription","text":"

Manage via this link

"},{"location":"pricing/#contact","title":"Contact","text":"

Please contact Peter Flook via Slack or via email peter.flook@data.catering if you have any questions or queries.

"},{"location":"get-started/docker/","title":"Run Data Caterer","text":""},{"location":"get-started/docker/#quick-start","title":"Quick start","text":"

Ensure you have docker installed and running.

git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example && ./run.sh\n#check results under docker/sample/report/index.html folder\n
"},{"location":"get-started/docker/#report","title":"Report","text":"

Check the report generated under docker/data/custom/report/index.html.

Sample report can also be seen here

"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"

30 day trial of the paid version can be accessed via these steps:

  1. Join the Slack Data Catering Slack group here
  2. Get an API_KEY by using slash command /token in the Slack group (will only be visible to you)
  3. git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example && export DATA_CATERING_API_KEY=<insert api key>\n./run.sh\n

If you want to check how long your trial has left, you can check back in the Slack group or type /token again.

"},{"location":"get-started/docker/#guided-tour","title":"Guided tour","text":"

Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.

"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"

Last updated September 25, 2023

"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"

Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.

"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"

For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.

"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"

Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:

"},{"location":"legal/privacy-policy/#we-are-accountable-to-you","title":"We are accountable to you","text":"

Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.

"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"

Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:

Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).

"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"

Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.

We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.

Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.

We also receive and store certain types of information whenever you interact with us.

"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"

All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).

"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"

Peter John Flook does not disclose personal information to any organization or person for any reason except the following:

We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.

Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.

"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"

Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.

You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.

"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"

We take steps to safeguard your personal information, regardless of the format in which it is held, including:

physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.

"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"

Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.

These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.

"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"

In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.

"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"

We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.

"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"

You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:

inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.

"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"

Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.

"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"

Last updated: September 25, 2023

Please read these terms and conditions carefully before using Our Service.

"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"

The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.

"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"

For the purposes of these Terms and Conditions:

"},{"location":"legal/terms-of-service/#acknowledgment","title":"Acknowledgment","text":"

These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.

Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.

By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.

You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.

Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.

"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"

Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.

The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.

We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.

"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"

We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.

Upon termination, Your right to use the Service will cease immediately.

"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"

Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.

To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.

Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.

"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"

The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.

Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.

Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.

"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"

The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.

"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"

If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.

"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"

If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.

"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"

You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.

"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"

If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.

"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"

Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.

"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"

These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.

"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"

We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.

By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.

"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"

If you have any questions about these Terms and Conditions, You can contact us:

"},{"location":"setup/","title":"Setup","text":"

All the configurations and customisation related to Data Caterer can be found under here.

"},{"location":"setup/#guide","title":"Guide","text":"

If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.

"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":""},{"location":"setup/#high-level-run-configurations","title":"High Level Run Configurations","text":""},{"location":"setup/configuration/","title":"Configuration","text":"

A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.

These configurations are defined from within your Java or Scala class via configuration or for YAML file setup, application.conf file as seen here.

"},{"location":"setup/configuration/#flags","title":"Flags","text":"

Flags are used to control which processes are executed when you run Data Caterer.

Config Default Paid Description enableGenerateData true N Enable/disable data generation enableCount true N Count the number of records generated. Can be disabled to improve performance enableFailOnError true N Whilst saving generated data, if there is an error, it will stop any further data from being generated enableSaveReports true N Enable/disable HTML reports summarising data generated, metadata of data generated (if enableSinkMetadata is enabled) and validation results (if enableValidation is enabled). Sample here enableSinkMetadata true N Run data profiling for the generated data. Shown in HTML reports if enableSaveSinkMetadata is enabled enableValidation false N Run validations as described in plan. Results can be viewed from logs or from HTML report if enableSaveSinkMetadata is enabled. Sample here enableGeneratePlanAndTasks false Y Enable/disable plan and task auto generation based off data source connections enableRecordTracking false Y Enable/disable which data records have been generated for any data source enableDeleteGeneratedRecords false Y Delete all generated records based off record tracking (if enableRecordTracking has been set to true) enableGenerateValidations false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf
configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n
configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n
flags {\n  enableCount = false\n  enableCount = ${?ENABLE_COUNT}\n  enableGenerateData = true\n  enableGenerateData = ${?ENABLE_GENERATE_DATA}\n  enableFailOnError = true\n  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n  enableGeneratePlanAndTasks = false\n  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n  enableRecordTracking = false\n  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n  enableDeleteGeneratedRecords = false\n  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n  enableGenerateValidations = false\n  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n}\n
"},{"location":"setup/configuration/#folders","title":"Folders","text":"

Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.

These folder pathways can be defined as a cloud storage pathway (i.e. s3a://my-bucket/task).

Config Default Paid Description planFilePath /opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data taskFolderPath /opt/app/task N Task folder path that contains all the task files (can have nested directories) validationFolderPath /opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) generatedReportsFolderPath /opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed generatedPlanAndTaskFolderPath /tmp Y Folder path where generated plan and task files will be saved recordTrackingFolderPath /opt/app/record-tracking Y Where record tracking parquet files get saved JavaScalaapplication.conf
configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\");\n
configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n
folders {\n  planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n  planFilePath = ${?PLAN_FILE_PATH}\n  taskFolderPath = \"/opt/app/custom/task\"\n  taskFolderPath = ${?TASK_FOLDER_PATH}\n  validationFolderPath = \"/opt/app/custom/validation\"\n  validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n  generatedReportsFolderPath = \"/opt/app/custom/report\"\n  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n  generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n  recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n}\n
"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"

When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if enableGeneratePlanAndTasks or 2) if enableSinkMetadata are enabled.

During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.

Config Default Paid Description numRecordsFromDataSource 10000 Y Number of records read in from the data source that could be used for data profiling numRecordsForAnalysis 10000 Y Number of records used for data profiling from the records gathered in numRecordsFromDataSource oneOfMinCount 1000 Y Minimum number of records required before considering if a field can be of type oneOf oneOfDistinctCountVsCountThreshold 0.2 Y Threshold ratio to determine if a field is of type oneOf (i.e. a field called status that only contains open or closed. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as oneOf) numGeneratedSamples 10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf
configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n
configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n
metadata {\n  numRecordsFromDataSource = 10000\n  numRecordsForAnalysis = 10000\n  oneOfMinCount = 1000\n  oneOfDistinctCountVsCountThreshold = 0.2\n  numGeneratedSamples = 10\n}\n
"},{"location":"setup/configuration/#generation","title":"Generation","text":"

When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.

Config Default Paid Description numRecordsPerBatch 100000 N Number of records across all data sources to generate per batch numRecordsPerStep N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) ScalaScalaapplication.conf
configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n
configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n
generation {\n  numRecordsPerBatch = 100000\n  numRecordsPerStep = 1000\n}\n
"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"

Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your specifications via configuration as seen here.

JavaScalaapplication.conf
configuration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n
configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -> \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -> \"10g\")\n
runtime {\n  master = \"local[*]\"\n  master = ${?DATA_CATERER_MASTER}\n  config {\n    \"spark.driver.cores\" = \"5\"\n    \"spark.driver.memory\" = \"10g\"\n  }\n}\n
"},{"location":"setup/advanced/advanced/","title":"Advanced use cases","text":""},{"location":"setup/advanced/advanced/#special-data-formats","title":"Special data formats","text":"

There are many options available for you to use when you have a scenario when data has to be a certain format.

  1. Create expression datafaker
    1. Can be used to create names, addresses, or anything that can be found under here
  2. Create regex
"},{"location":"setup/advanced/advanced/#foreign-keys-across-data-sets","title":"Foreign keys across data sets","text":"

Details for how you can configure foreign keys can be found here.

"},{"location":"setup/advanced/advanced/#edge-cases","title":"Edge cases","text":"

For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:

JavaScalaYAML
field()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n

If you want to know all the possible edge cases for each data type, can check the documentation here.

"},{"location":"setup/advanced/advanced/#scenario-testing","title":"Scenario testing","text":"

You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the status column in the account data to only generate open accounts and define a foreign key between Postgres and parquet to ensure the same account_id is being used. Then in the parquet task, define 1 to 10 transactions per account_id to be generated.

Postgres account generation example task Parquet transaction generation example task Plan

"},{"location":"setup/advanced/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/advanced/#data-source","title":"Data source","text":"

If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.

JavaScalaYAML
var csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n
val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/advanced/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"

You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the application.conf file where you can set something like the below:

JavaScalaYAML
configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n
configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/connection/connection/","title":"Data Source Connections","text":"

Details of all the connection configuration supported can be found in the below subsections for each type of connection.

These configurations can be done via API or from configuration. Examples of both are shown for each data source below.

"},{"location":"setup/connection/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Paid Database Postgres, MySQL, Cassandra N (Postgres), Y (rest) File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/connection/#api","title":"API","text":"

All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.

"},{"location":"setup/connection/connection/#configuration-file","title":"Configuration file","text":"

All connection details follow the same pattern.

<connection format> {\n    <connection name> {\n        <key> = <value>\n    }\n}\n

Overriding configuration

When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:

url = \"localhost\"\nurl = ${?POSTGRES_URL}\n

The above defines that if there is a system property or environment variable named POSTGRES_URL, then that value will be used for the url, otherwise, it will default to localhost.

"},{"location":"setup/connection/connection/#data-sources","title":"Data sources","text":"

To find examples of a task for each type of data source, please check out this page.

"},{"location":"setup/connection/connection/#file","title":"File","text":"

Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.

"},{"location":"setup/connection/connection/#csv","title":"CSV","text":"JavaScalaapplication.conf
csv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?CSV_PATH}\n  }\n}\n

Other available configuration for CSV can be found here

"},{"location":"setup/connection/connection/#json","title":"JSON","text":"JavaScalaapplication.conf
json(\"customer_transactions\", \"/data/customer/transaction\")\n
json(\"customer_transactions\", \"/data/customer/transaction\")\n
json {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?JSON_PATH}\n  }\n}\n

Other available configuration for JSON can be found here

"},{"location":"setup/connection/connection/#orc","title":"ORC","text":"JavaScalaapplication.conf
orc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?ORC_PATH}\n  }\n}\n

Other available configuration for ORC can be found here

"},{"location":"setup/connection/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.conf
parquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?PARQUET_PATH}\n  }\n}\n

Other available configuration for Parquet can be found here

"},{"location":"setup/connection/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.conf
delta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?DELTA_PATH}\n  }\n}\n
"},{"location":"setup/connection/connection/#rmdbs","title":"RMDBS","text":"

Follows the same configuration used by Spark as found here. Sample can be found below

JavaScalaapplication.conf
postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n
postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n
jdbc {\n    customer_postgres {\n        url = \"jdbc:postgresql://localhost:5432/customer\"\n        url = ${?POSTGRES_URL}\n        user = \"postgres\"\n        user = ${?POSTGRES_USERNAME}\n        password = \"postgres\"\n        password = ${?POSTGRES_PASSWORD}\n        driver = \"org.postgresql.Driver\"\n    }\n}\n

Ensure that the user has write permission, so it is able to save the table to the target tables.

SQL Permission Statements
GRANT INSERT ON <schema>.<table> TO <user>;\n
"},{"location":"setup/connection/connection/#postgres","title":"Postgres","text":"

Can see example API or Config definition for Postgres connection above.

"},{"location":"setup/connection/connection/#permissions","title":"Permissions","text":"

Following permissions are required when generating plan and tasks:

SQL Permission Statements
GRANT SELECT ON information_schema.tables TO < user >;\nGRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\nGRANT SELECT ON information_schema.table_constraints TO < user >;\nGRANT SELECT ON information_schema.constraint_column_usage TO < user >;\n
"},{"location":"setup/connection/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf
mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n
mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n
jdbc {\n    customer_mysql {\n        url = \"jdbc:mysql://localhost:3306/customer\"\n        user = \"root\"\n        password = \"root\"\n        driver = \"com.mysql.cj.jdbc.Driver\"\n    }\n}\n
"},{"location":"setup/connection/connection/#permissions_1","title":"Permissions","text":"

Following permissions are required when generating plan and tasks:

SQL Permission Statements
GRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.statistics TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\n
"},{"location":"setup/connection/connection/#cassandra","title":"Cassandra","text":"

Follows same configuration as defined by the Spark Cassandra Connector as found here

JavaScalaapplication.conf
cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap.of()                #optional additional connection options\n)\n
cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap()                #optional additional connection options\n)\n
org.apache.spark.sql.cassandra {\n    customer_cassandra {\n        spark.cassandra.connection.host = \"localhost\"\n        spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n        spark.cassandra.connection.port = \"9042\"\n        spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n        spark.cassandra.auth.username = \"cassandra\"\n        spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n        spark.cassandra.auth.password = \"cassandra\"\n        spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n    }\n}\n
"},{"location":"setup/connection/connection/#permissions_2","title":"Permissions","text":"

Ensure that the user has write permission, so it is able to save the table to the target tables.

CQL Permission Statements
GRANT INSERT ON <schema>.<table> TO <user>;\n

Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true) as it will gather metadata information about tables and columns from the below tables.

CQL Permission Statements
GRANT SELECT ON system_schema.tables TO <user>;\nGRANT SELECT ON system_schema.columns TO <user>;\n
"},{"location":"setup/connection/connection/#kafka","title":"Kafka","text":"

Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here

JavaScalaapplication.conf
kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n
kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n
kafka {\n    customer_kafka {\n        kafka.bootstrap.servers = \"localhost:9092\"\n        kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n    }\n}\n

When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.

"},{"location":"setup/connection/connection/#jms","title":"JMS","text":"

Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.

JavaScalaapplication.conf
solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n
solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n
jms {\n    customer_solace {\n        initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n        connectionFactory = \"/jms/cf/default\"\n        url = \"smf://localhost:55555\"\n        url = ${?SOLACE_URL}\n        user = \"admin\"\n        user = ${?SOLACE_USER}\n        password = \"admin\"\n        password = ${?SOLACE_PASSWORD}\n        vpnName = \"default\"\n        vpnName = ${?SOLACE_VPN}\n    }\n}\n
"},{"location":"setup/connection/connection/#http","title":"HTTP","text":"

Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.

JavaScalaapplication.conf
http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n
http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n
http {\n    customer_api {\n        user = \"admin\"\n        user = ${?HTTP_USER}\n        password = \"admin\"\n        password = ${?HTTP_PASSWORD}\n    }\n}\n
"},{"location":"setup/deployment/deployment/","title":"Deployment","text":"

Two main ways to deploy and run Data Caterer:

"},{"location":"setup/deployment/deployment/#docker","title":"Docker","text":"

To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.

Then you can run the following:

./gradlew clean build\ndocker build -t <my_image_name>:<my_image_tag> .\n
"},{"location":"setup/deployment/deployment/#helm","title":"Helm","text":"

Link to sample helm on GitHub here

Update the configuration to your own data connections and configuration or own image created from above.

git clone git@github.com:pflooky/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n
"},{"location":"setup/foreign-key/foreign-key/","title":"Foreign Keys","text":"

Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.

"},{"location":"setup/foreign-key/foreign-key/#single-column","title":"Single column","text":"

Define a column in one data source to match against another column. Below example shows a postgres data source with two tables, accounts and transactions that have a foreign key for account_id.

JavaScalaYAML
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -> \"account_id\")\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n
"},{"location":"setup/foreign-key/foreign-key/#multiple-columns","title":"Multiple columns","text":"

You may have a scenario where multiple columns need to be aligned. From the same example, we want account_id and name from accounts to match with account_id and full_name to match in transactions respectively.

JavaScalaYAML
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -> List(\"account_id\", \"full_name\"))\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n
"},{"location":"setup/foreign-key/foreign-key/#nested-column","title":"Nested column","text":"

Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.

In the example below, the nested customer_details.name field inside the json task needs to match with name from postgres. A new field in the json called _txn_name is used as a temporary column to facilitate the foreign key definition.

JavaScalaYAML
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true)       #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true)       #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -> List(\"account_id\", \"_txn_name\"))\n)\n
---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n
"},{"location":"setup/generator/count/","title":"Record Count","text":"

There are options related to controlling the number of records generated that can help in generating the scenarios or data required.

"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"

Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n
"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"

As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n
"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"

When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.

One example of this would be when generating transactions relating to a customer, a customer may be defined by columns account_id, name. A number of transactions would be generated per account_id, name.

You can also use a combination of the above two methods to generate the number of records per column.

"},{"location":"setup/generator/count/#records","title":"Records","text":"

When defining a base number of records within the perColumn configuration, it translates to creating (count.records * count.recordsPerColumn) records. This is a fixed number of records that will be generated each time, with no variation between runs.

In the example below, we have count.records = 1000 and count.recordsPerColumn = 2. Which means that 1000 * 2 = 2000 records will be generated in total.

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n
"},{"location":"setup/generator/count/#generated","title":"Generated","text":"

You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.

In the example below, it will generate between (count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000 and (count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000 records.

JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n
"},{"location":"setup/generator/generator/","title":"Data Generators","text":""},{"location":"setup/generator/generator/#data-types","title":"Data Types","text":"

Below is a list of all supported data types for generating data:

Data Type Spark Data Type Options Description string StringType minLen, maxLen, expression, enableNull integer IntegerType min, max, stddev, mean long LongType min, max, stddev, mean short ShortType min, max, stddev, mean decimal(precision, scale) DecimalType(precision, scale) min, max, stddev, mean double DoubleType min, max, stddev, mean float FloatType min, max, stddev, mean date DateType min, max, enableNull timestamp TimestampType min, max, enableNull boolean BooleanType binary BinaryType minLen, maxLen, enableNull byte ByteType array ArrayType arrayMinLen, arrayMaxLen, arrayType _ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/generator/#options","title":"Options","text":""},{"location":"setup/generator/generator/#all-data-types","title":"All data types","text":"

Some options are available to use for all types of data generators. Below is the list along with example and descriptions:

Option Default Example Description enableEdgeCase false enableEdgeCase: \"true\" Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) edgeCaseProbability 0.0 edgeCaseProb: \"0.1\" Probability of generating a random edge case value if enableEdgeCase is true isUnique false isUnique: \"true\" Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data seed seed: \"1\" Defines the random seed for generating data for that particular column. It will override any seed defined at a global level sql sql: \"CASE WHEN amount < 10 THEN true ELSE false END\" Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/generator/#string","title":"String","text":"Option Default Example Description minLen 1 minLen: \"2\" Ensures that all generated strings have at least length minLen maxLen 10 maxLen: \"15\" Ensures that all generated strings have at most length maxLen expression expression: \"#{Name.name}\"expression:\"#{Address.city}/#{Demographic.maritalStatus}\" Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format #{<faker expression name>} enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")

"},{"location":"setup/generator/generator/#sample","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n
"},{"location":"setup/generator/generator/#numeric","title":"Numeric","text":"

For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).

"},{"location":"setup/generator/generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Description min 0 min: \"2\" Ensures that all generated values are greater than or equal to min max 1000 max: \"25\" Ensures that all generated values are less than or equal to max stddev 1.0 stddev: \"2.0\" Standard deviation for normal distributed data mean max - min mean: \"5.0\" Mean for normal distributed data

Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)

"},{"location":"setup/generator/generator/#sample_1","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n
"},{"location":"setup/generator/generator/#decimal","title":"Decimal","text":"Option Default Example Description min 0 min: \"2\" Ensures that all generated values are greater than or equal to min max 1000 max: \"25\" Ensures that all generated values are less than or equal to max stddev 1.0 stddev: \"2.0\" Standard deviation for normal distributed data mean max - min mean: \"5.0\" Mean for normal distributed data numericPrecision 10 precision: \"25\" The maximum number of digits numericScale 0 scale: \"25\" The number of digits on the right side of the decimal point (has to be less than or equal to precision)

Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)

"},{"location":"setup/generator/generator/#sample_2","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n
"},{"location":"setup/generator/generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description min 0.0 min: \"2.1\" Ensures that all generated values are greater than or equal to min max 1000.0 max: \"25.9\" Ensures that all generated values are less than or equal to max stddev 1.0 stddev: \"2.0\" Standard deviation for normal distributed data mean max - min mean: \"5.0\" Mean for normal distributed data

Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)

"},{"location":"setup/generator/generator/#sample_3","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n
"},{"location":"setup/generator/generator/#date","title":"Date","text":"Option Default Example Description min now() - 365 days min: \"2023-01-31\" Ensures that all generated values are greater than or equal to min max now() max: \"2023-12-31\" Ensures that all generated values are less than or equal to max enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)

"},{"location":"setup/generator/generator/#sample_4","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n
"},{"location":"setup/generator/generator/#timestamp","title":"Timestamp","text":"Option Default Example Description min now() - 365 days min: \"2023-01-31 23:10:10\" Ensures that all generated values are greater than or equal to min max now() max: \"2023-12-31 23:10:10\" Ensures that all generated values are less than or equal to max enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)

"},{"location":"setup/generator/generator/#sample_5","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n
"},{"location":"setup/generator/generator/#binary","title":"Binary","text":"Option Default Example Description minLen 1 minLen: \"2\" Ensures that all generated array of bytes have at least length minLen maxLen 20 maxLen: \"15\" Ensures that all generated array of bytes have at most length maxLen enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true

Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)

"},{"location":"setup/generator/generator/#sample_6","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n
"},{"location":"setup/generator/generator/#array","title":"Array","text":"Option Default Example Description arrayMinLen 0 arrayMinLen: \"2\" Ensures that all generated arrays have at least length arrayMinLen arrayMaxLen 5 arrayMaxLen: \"15\" Ensures that all generated arrays have at most length arrayMaxLen arrayType arrayType: \"double\" Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct enableNull false enableNull: \"true\" Enable/disable null values being generated nullProbability 0.0 nullProb: \"0.1\" Probability to generate null values if enableNull is true"},{"location":"setup/generator/generator/#sample_7","title":"Sample","text":"JavaScalaYAML
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array<double>\"\n
"},{"location":"setup/generator/report/","title":"Report","text":"

Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much data was generated, where it was generated, validation results and any associated metadata.

"},{"location":"setup/generator/report/#sample","title":"Sample","text":"

Once run, it will produce a report like this.

"},{"location":"setup/guide/","title":"Guides","text":"

Below are a list of guides you can follow to create your data generation for your use case.

For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.

"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":""},{"location":"setup/guide/#data-sources","title":"Data Sources","text":""},{"location":"setup/guide/#yaml-files","title":"YAML Files","text":""},{"location":"setup/guide/#base-concept","title":"Base Concept","text":"

The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.

"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"

Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2

"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"

Basic configuration

"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"

To see how it runs against different data sources, you can run using docker-compose and set DATA_SOURCE like below

./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n

Can set it to one of the following:

"},{"location":"setup/guide/data-source/cassandra/","title":"Cassandra","text":"

Info

Writing data to Cassandra is a paid feature. Try the free trial here.

Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.

"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/cassandra/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n

If you already have a Cassandra instance running, you can skip to this step.

"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"

Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.

cd docker\ndocker-compose up -d cassandra\n
"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"

Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.

CQL Permission Statements
GRANT INSERT ON <schema>.<table> TO data_caterer_user;\n

Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true) as it will gather metadata information about tables and columns from the below tables.

CQL Permission Statements
GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n
"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"

Within our class, we can start by defining the connection properties to connect to Cassandra.

JavaScala
var accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap.of()                //optional additional connection options\n)\n

Additional options such as SSL configuration, etc can be found here.

val accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap()                   //optional additional connection options\n)\n

Additional options such as SSL configuration, etc can be found here.

"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"

Let's create a task for inserting data into the account.accounts and account.account_status_history tables as defined underdocker/data/cql/customer.cql. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:

docker exec host.docker.internal cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n

Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.

CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n    amount double,\n    created_by text,\n    name text,\n    open_time timestamp,\n    status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n    eod_date date,\n    status text,\n    updated_by text,\n    updated_time timestamp,\n    PRIMARY KEY (account_id, eod_date)\n)...\n

Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the text fields do not have a data type defined. This is because the default data type is StringType which corresponds to text in Cassandra.

JavaScala
{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"

We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.

"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"

account_id follows a particular pattern that where it starts with ACC and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.

JavaScala
field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"

amount the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between 1 and 1000.

JavaScala
field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"

name is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression to generate real looking name. All possible faker expressions can be found here

JavaScala
field().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"

open_time is a timestamp that we want to have a value greater than a specific date. We can define a min date by using java.sql.Date like below.

JavaScala
field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"

status is a field that can only obtain one of four values, open, closed, suspended or pending.

JavaScala
field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"

created_by is a field that is based on the status field where it follows the logic: if status is open or closed, then it is created_by eod else created_by event. This can be achieved by defining a SQL expression like below.

JavaScala
field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n

Putting it all the fields together, our class should now look like this.

JavaScala
var accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the accountTask, we have to call execute . So our full plan run will look like this.

JavaScala
public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n

Your output should look like this.

 count\n-------\n  1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id  | amount    | created_by         | name                   | open_time                       | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK |          Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 |  46.99177 |             VH88H9 |       Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 |      open\n ACC50587836 |  774.9872 |         GENANwPm t |           Sang Monahan | 2023-03-21 00:16:53.308000+0000 |    closed\n ACC67619387 | 452.86706 |       5msTpcBLStTH |         Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 |  14.69298 |           WDmOh7NT |          Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 |  51.26492 |          J8jAKzvj2 |           Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 |   SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 |    closed\n ACC20642011 | 658.40713 |          clyZRD4fI |  Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 |      open\n ACC74962085 | 970.98218 |       ZLETTSnj4NpD |          Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 |   pending\n ACC72848439 | 481.64267 |                 cc |        Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed.

"},{"location":"setup/guide/data-source/http/","title":"HTTP Source","text":"

Info

Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.

Creating a data generator based on an OpenAPI/Swagger document.

"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/http/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"

We will be using the http-bin docker image to help simulate a service with HTTP endpoints.

Start it via:

cd docker\ndocker-compose up -d http\ndocker ps\n
"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"

We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under docker/mount/http/petstore.json in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.

We have kept the following endpoints to test out:

JavaScala
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n

The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.

"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"

Let's try run and see what happens.

cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n

It should look something like this.

172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n

Looks like we have some data now. But we can do better and add some enhancements to it.

"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"

The four different requests that get sent could have the same id passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular id value. We note that the id value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.

To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.

HTTP Type Column Prefix Example Request Body bodyContent bodyContent.id Path Parameter pathParam pathParamid Query Parameter queryParam queryParamid Header header headerContent_Type

Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of {http method}{http path}. For example, POST/pets. Let's apply this knowledge to link all the id values together.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n
val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n

Let's test it out by running it again

./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n

Now we have the same id values being produced across the POST, DELETE and GET requests! What if we knew that the id values should follow a particular pattern?

"},{"location":"setup/guide/data-source/http/#custom-metadata","title":"Custom metadata","text":"

So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the id column for the POST request and it will proliferate to the other endpoints as well. Given the id column is a nested column as noted in the foreign key, we can alter its metadata via the following:

JavaScala
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n

We first get the column bodyContent, then get the nested schema and get the column id and add metadata stating that id should follow the patter ID[0-9]{8}.

Let's try run again, and hopefully we should see some proper ID values.

./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n

Great! Now we have replicated a production-like flow of HTTP requests.

"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"

If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).

"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"

By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.

JavaScala
import com.github.pflooky.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n
import com.github.pflooky.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -> \"1\"))\n...\n

Check out the full example under AdvancedHttpPlanRun in the example repo.

"},{"location":"setup/guide/data-source/kafka/","title":"Kafka","text":"

Info

Writing data to Kafka is a paid feature. Try the free trial here.

Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.

"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/kafka/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n

If you already have a Kafka instance running, you can skip to this step.

"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"

Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.

cd docker\ndocker-compose up -d kafka\n
"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"

Within our class, we can start by defining the connection properties to connect to Kafka.

JavaScala
var accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap.of()          //optional additional connection options\n);\n

Additional options can be found here.

val accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap()             //optional additional connection options\n)\n

Additional options can be found here.

"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"

Let's create a task for inserting data into the account-topic that is already defined underdocker/data/kafka/setup_kafka.sh. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:

docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n

Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the text fields do not have a data type defined. This is because the default data type is StringType.

JavaScala
{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),  can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n
val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType),  can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n
"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"

The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value

Whilst, the other fields are optional: - key - partition - headers

"},{"location":"setup/guide/data-source/kafka/#headers","title":"headers","text":"

headers follows a particular pattern that where it is of type array<struct<key: string,value: binary>>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the value part, it refers to content.account_id where content is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.

JavaScala
field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"

transactions is an array that contains an inner structure of txn_date and amount. The size of the array generated can be controlled via arrayMinLength and arrayMaxLength.

JavaScala
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"

details is another example of a nested schema structure where it also has a nested structure itself in updated_by. One thing to note here is the first_txn_date field has a reference to the content.transactions array where it will sort the array by txn_date and get the first element.

JavaScala
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the kafkaTask, we have to call execute .

"},{"location":"setup/guide/data-source/kafka/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n

Your output should look like this.

{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed.

"},{"location":"setup/guide/data-source/marquez-metadata-source/","title":"Metadata Source","text":"

Info

Generating data based on an external metadata source is a paid feature. Try the free trial here.

Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).

"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/marquez-metadata-source/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"

You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.

The command that was run for this example to help with setup of dummy data was ./docker/up.sh -a 5001 -m 5002 --seed.

Check that the following url shows some data like below once you click on food_delivery from the ns drop down in the top right corner.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#postgres-setup","title":"Postgres Setup","text":"

Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:

docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"

We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a namespace, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific namespace and dataset.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala
var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n
val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -> \"overwrite\", \"header\" -> \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n

The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the food_delivery namespace and public.categories dataset to retrieve the schema information from.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#multiple-schemas","title":"Multiple Schemas","text":"JavaScala
var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n
val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n

We now have pointed this Postgres instance to produce multiple schemas that are defined under the food_delivery namespace. Also note that we are using database food_delivery in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#run","title":"Run","text":"

Let's try run and see what happens.

cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n

It should look something like this.

 order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |         customer_email         |                     customer_address                     | menu_id | restaurant_id |                        restaurant_address\n   | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n    38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com       | 5018 Lang Dam, Gaylordfurt, MO 35172                     |   59841 |         30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 |        55697 |       36370 |       21574 |   88022 |     16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11  | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 |   66195 |         42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n|        26516 |       81335 |       87615 |   27433 |     45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com       | Apt. 385 99701 Lemke Place, New Irvin, RI 73305          |   66427 |         44438 | 1309 Danny Cape, Weimanntown, AL 15865\n|        41686 |       36508 |       34498 |   24191 |     92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86  | isabelle.ohara@hotmail.com     | 2225 Evie Lane, South Ardella, SD 90805                  |   27106 |         25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n|        94205 |       66207 |       81051 |   52553 |     27483\n

You can also try query some other tables. Let's also check what is in the CSV file.

$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n

Looks like we have some data now. But we can do better and add some enhancements to it.

What if we wanted the same records in Postgres public.delivery_7_days to also show up in the CSV file? That's where we can use a foreign key definition.

"},{"location":"setup/guide/data-source/marquez-metadata-source/#foreign-key","title":"Foreign Key","text":"

We can take a look at the report (under docker/sample/report/index.html) to see what we need to do to create the foreign key. From the overview, you should see under Tasks there is a my_postgres task which has food_delivery_public.delivery_7_days as a step. Click on the link for food_delivery_public.delivery_7_days and it will take us to a page where we can find out about the columns used in this table. Click on the Fields button on the far right to see.

We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will take all the fields.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n
val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n

Notice how we have defined the csvTask and foreignCols as the main foreign key but for postgresTask, we had to define it as a foreignField. This is because postgresTask has multiple tables within it, and we only want to define our foreign key with respect to the public.delivery_7_days table. We use the step name (can be seen from the report) to specify the table to target.

To test this out, we will truncate the public.delivery_7_days table in Postgres first, and then try run again.

docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n
 order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |        customer_email        |\ncustomer_address                     | menu_id | restaurant_id |                   restaurant_address                   | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n    53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com  | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 |   40412 |         70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 |       44210 |       83966 |   78614 |     77449\n

Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.

$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n

Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate data.

Check out the full example under AdvancedMetadataSourcePlanRun in the example repo.

"},{"location":"setup/guide/data-source/open-metadata-source/","title":"OpenMetadata Source","text":"

Info

Generating data based on an external metadata source is a paid feature. Try the free trial here.

Creating a data generator for a JSON file based on metadata stored in OpenMetadata.

"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/open-metadata-source/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"

You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.

If that page becomes outdated or the link doesn't work, below are the commands I used to run it:

mkdir openmetadata-docker && cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml > docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n

Check that the following url works and login with admin:admin. Then you should see some data like below:

"},{"location":"setup/guide/data-source/open-metadata-source/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"

We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.

"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala
import com.github.pflooky.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\",                                                              //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),                                        //auth type\nMap.of(                                                                                   //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",                                        //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count().records(10));\n
import com.github.pflooky.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\",                                                  //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,                                        //auth type\nMap(                                                                          //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -> \"abc123\",                                        //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -> \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count.records(10))\n

The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the sample_data.ecommerce_db.shopify.raw_customer table. You can check out the schema here to see what it looks like.

"},{"location":"setup/guide/data-source/open-metadata-source/#run","title":"Run","text":"

Let's try run and see what happens.

cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n

It should look something like this.

{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E  EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n

Looks like we have some data now. But we can do better and add some enhancements to it.

"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"

We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.

Let's make the platform field a choice field that can only be a set of certain values and the nested field customer.sex is also from a predefined set of values.

JavaScala
var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n

Let's test it out by running it again

./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n
{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n

Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data.

"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"

Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be incorporated into your Data Caterer job as well by enabling data validations via enableGenerateValidations in configuration.

JavaScala
var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n
val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n

Check out the full example under AdvancedOpenMetadataSourcePlanRun in the example repo.

"},{"location":"setup/guide/data-source/solace/","title":"Solace","text":"

Info

Writing data to Solace is a paid feature. Try the free trial here.

Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.

"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":""},{"location":"setup/guide/data-source/solace/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n

If you already have a Solace instance running, you can skip to this step.

"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"

Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.

cd docker\ndocker-compose up -d solace\n

Open up localhost:8080 and login with admin:admin and check there is the default VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under docker/data/solace/setup_solace.sh and change the host to localhost.

"},{"location":"setup/guide/data-source/solace/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"

Within our class, we can start by defining the connection properties to connect to Solace.

JavaScala
var accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap.of()                            //optional additional connection options\n);\n

Additional connection options can be found here.

val accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap()                               //optional additional connection options\n)\n

Additional connection options can be found here.

"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"

Let's create a task for inserting data into the rest_test_queue or rest_test_topic that is already created for us from this step.

Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the text fields do not have a data type defined. This is because the default data type is StringType.

JavaScala
{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),   //can define message JMS priority here\nfield().name(\"headers\")                                     //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n
val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType),  //can define message JMS priority here\nfield.name(\"headers\")                           //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n
"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"

The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:

Whilst, the other fields are optional:

"},{"location":"setup/guide/data-source/solace/#headers","title":"headers","text":"

headers follows a particular pattern that where it is of type HeaderType.getType which behind the scenes, translates toarray<struct<key: string,value: binary>>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in thevalue part, it refers to content.account_id where content is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.

JavaScala
field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"

transactions is an array that contains an inner structure of txn_date and amount. The size of the array generated can be controlled via arrayMinLength and arrayMaxLength.

JavaScala
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"

details is another example of a nested schema structure where it also has a nested structure itself in updated_by. One thing to note here is the first_txn_date field has a reference to the content.transactions array where it will sort the array by txn_date and get the first element.

JavaScala
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the kafkaTask, we have to call execute.

"},{"location":"setup/guide/data-source/solace/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n

Your output should look like this.

Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed. Or view the sample report found here.

"},{"location":"setup/guide/scenario/auto-generate-connection/","title":"Auto Generate From Data Connection","text":"

Info

Auto data generation from data connection is a paid feature. Try the free trial here.

Creating a data generator based on only a data connection to Postgres.

"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n

In the above code, we note the following:

  1. Data source configuration to a Postgres data source called my_postgres
  2. We have enabled the flag enableGeneratePlanAndTasks which tells Data Caterer to go to my_postgres and generate data for all the tables found under the database customer (which is defined in the connection string).
  3. The config generatedPlanAndTaskFolderPath defines where the metadata that is gathered from my_postgres should be saved at so that we could re-use it later.
  4. enableUniqueCheck is set to true to ensure that generated data is unique based on primary key or foreign key definitions.

Note

Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into account, so generated data may fail to insert depending on the data source restrictions

"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"

If you don't have your own Postgres up and running, you can set up and run an instance configured in the docker folder via.

cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n

This will create the tables found under docker/data/sql/postgres/customer.sql. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping.

"},{"location":"setup/guide/scenario/auto-generate-connection/#run","title":"Run","text":"

Let's try run.

cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n

It should look something like this.

   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n

The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.

Also check the HTML report that gets generated under docker/sample/report/index.html. You can see a summary of what was generated along with other metadata.

You can now look to play around with other tables or data sources and auto generate for them.

"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"

If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.

"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"

As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the history and audit schemas. Also, any table with the name balances or transactions in any schema will also not have data generated.

JavaScala
var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n
val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -> \"history, audit\",\n\"filterOutTable\" -> \"balances, transactions\")\n)\n)\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"

You can control the record count per sub data source via numRecordsPerStep.

JavaScala
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"

Info

Generating event data is a paid feature. Try the free trial here.

Creating a data generator for Kafka topic with matching records in a CSV file.

"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/batch-and-event/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"

If you don't have your own Kafka up and running, you can set up and run an instance configured in the docker folder via.

cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n

Let's create a task for inserting data into the account-topic that is already defined underdocker/data/kafka/setup_kafka.sh.

"},{"location":"setup/guide/scenario/batch-and-event/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n

We will borrow the Kafka task that is already defined under the class AdvancedKafkaPlanRun or AdvancedKafkaJavaPlanRun. You can go through the Kafka guide here for more details.

"},{"location":"setup/guide/scenario/batch-and-event/#schema","title":"Schema","text":"

Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.

JavaScala
var kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n
val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n

This is a simple schema where we want to use the values and metadata that is already defined in the kafkaTask to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.

"},{"location":"setup/guide/scenario/batch-and-event/#foreign-keys","title":"Foreign Keys","text":"

From the above CSV schema, we see note the following against the Kafka schema:

Our foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n
val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -> List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n
"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"

Let's try run.

cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n

It should look something like this.

{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n

Let's also check if there is a corresponding record in the CSV file.

$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n

Great! The account, year, name and payload look to all match up.

"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"

You may notice that the events are generated first, then the CSV file. This is because as part of the execute function, we passed in the kafkaTask first, before the csvTask. You can change the order of execution by passing in csvTask before kafkaTask into the execute function.

"},{"location":"setup/guide/scenario/delete-generated-data/","title":"Delete Generated Data","text":"

Info

Delete generated data is a paid feature. Try the free trial here.

Creating a data generator for Postgres and delete the generated data after using it.

"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/delete-generated-data/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n

In the above code we note the following:

  1. We have defined a Postgres connection called my_postgres
  2. enableGeneratePlanAndTasks is enabled to auto generate data for all tables under customer database
  3. enableRecordTracking is enabled to ensure that all generated records are tracked. This will get used when we want to delete data afterwards
  4. enableDeleteGeneratedRecords is disabled for now. We want to see the generated data first and delete sometime after
  5. generatedPlanAndTaskFolderPath is the folder path where we saved the metadata we have gathered from my_postgres
  6. recordTrackingFolderPath is the folder path where record tracking is maintained. We need to persist this data to ensure it is still available when we want to delete data
"},{"location":"setup/guide/scenario/delete-generated-data/#postgres-setup","title":"Postgres Setup","text":"

If you don't have your own Postgres up and running, you can set up and run an instance configured in the docker folder via.

cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n

This will create the tables found under docker/data/sql/postgres/customer.sql. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping.

"},{"location":"setup/guide/scenario/delete-generated-data/#run","title":"Run","text":"

Let's try run.

cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n

It should look something like this.

   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n

The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.

Check the number of records via:

docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n
"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"

We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.

.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false)  //we need to explicitly disable generating data\n

Enable delete generated records and disable generating data.

Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.

docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n

We now should have 1001 records in our account.accounts table. Let's delete the generated data now.

./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n

You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably and also be able to clean it up.

"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"

Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - recordTrackingFolderPath needs to be set to the same value

"},{"location":"setup/guide/scenario/delete-generated-data/#define-record-count","title":"Define record count","text":"

You can control the record count per sub data source via numRecordsPerStep.

JavaScala
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"

Creating a data generator for a CSV file.

"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/first-data-generation/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"

When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.

JavaScala
csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\")          //optional additional options\n)\n

Other additional options for CSV can be found here

csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -> \"true\")           //optional additional options\n)\n

Other additional options for CSV can be found here

"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"

Our CSV file that we generate should adhere to a defined schema where we can also define data types.

Let's define each field along with their corresponding data type. You will notice that the string fields do not have a data type defined. This is because the default data type is StringType.

JavaScala
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"

We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.

"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":" JavaScala
field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":" JavaScala
field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":" JavaScala
field().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":" JavaScala
field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":" JavaScala
field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":" JavaScala
field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n

Putting it all the fields together, our class should now look like this.

JavaScala
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"

We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the accountTask level like below. If you want to generate more records, set it to the value you want.

JavaScala
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n
"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

JavaScala
var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"

To tell Data Caterer that we want to run with the configurations along with the accountTask, we have to call execute . So our full plan run will look like this.

JavaScala
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n

Your output should look like this.

account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n

Also check the HTML report, found at docker/sample/report/index.html, that gets generated to get an overview of what was executed.

"},{"location":"setup/guide/scenario/first-data-generation/#join-with-another-csv","title":"Join With Another CSV","text":"

Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.

We can define our schema the same way along with any additional metadata.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"

Usually, for a given account_id, full_name, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count function.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n
"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"

Above, you will notice that we are generating 5 records per account_id, full_name. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n

Here we set the minimum number of records per column to be 0 and the maximum to 5.

"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"

In this scenario, we want to match the account_id in account to match the same column values in transaction. We also want to match name in account to full_name in transaction. This can be done via plan configuration like below.

JavaScala
var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n
val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),  //the task and columns we want linked\nList(transactionTask -> List(\"account_id\", \"full_name\"))  //list of other tasks and their respective column names we want matched\n)\n

Now, stitching it all together for the execute function, our final plan should look like this.

JavaScala
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -> List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n

Let's try run again.

#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n

It should look something like this.

ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n

Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the DocumentationJavaPlanRun.java or DocumentationPlanRun.scala files as well to check that your plan is the same.

We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.

"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"

In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.

"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"

First, we define our connection properties for Postgres. You can check out the full options available here.

JavaScala
var postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\");\n
val postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\")\n

We can connect and access the data inside the table account.transactions. Now to define our data validations.

"},{"location":"setup/guide/scenario/first-data-generation/#validations","title":"Validations","text":"

For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.

JavaScala
var postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n
val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"

For all values in the name column, we check if they match the regex [A-Z][a-z]+ [A-Z][a-z]+. As we know in the real world, names do not always follow the same pattern, so we allow for an errorThreshold before marking the validation as failed. Here, we define the errorThreshold to be 0.2, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.

"},{"location":"setup/guide/scenario/first-data-generation/#balance_1","title":"balance","text":"

We check that all balance values are greater than or equal to 0. This time, we have a slightly different errorThreshold as it is set to 10, which means, if the number of errors is greater than 10, then fail the validation.

"},{"location":"setup/guide/scenario/first-data-generation/#expr","title":"expr","text":"

Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use expr to define a SQL expression that returns a boolean. In this scenario, we are checking if the status column has value closed, then the close_date should be not null, otherwise, close_date is null.

"},{"location":"setup/guide/scenario/first-data-generation/#unique","title":"unique","text":"

We check whether the combination of account_id and name are unique within the dataset. You can define one or more columns for unique validations.

"},{"location":"setup/guide/scenario/first-data-generation/#groupby","title":"groupBy","text":"

There may be some business rule that states the number of login_retry should be less than 10 for each account. We can check this via a group by validation where we group by the account_id, name, take the maximum value for login_retry per account_id,name combination, then check if it is less than 10.

You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.

"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"

Creating a data generator for a CSV file where there are multiple records per column values.

"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":""},{"location":"setup/guide/scenario/records-per-column/#get-started","title":"Get Started","text":"

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:pflooky/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"

Create a new Java or Scala class.

Make sure your class extends PlanRun.

JavaScala
import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n
import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask: ConnectionTaskBuilder[FileBuilder] = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n
"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"

By default, tasks will generate 1000 records. You can alter this value via the count configuration which can be applied to individual tasks. For example, in Scala, csv(...).count(count.records(100)) to generate only 100 records.

"},{"location":"setup/guide/scenario/records-per-column/#records-per-column","title":"Records Per Column","text":"

In this scenario, for a given account_id, full_name, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count function.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n

This will generate 1000 * 5 = 5000 records as the default number of records is set (1000) and per account_id, full_name from the initial 1000 records, 5 records will be generated.

"},{"location":"setup/guide/scenario/records-per-column/#random-records-per-column","title":"Random Records Per Column","text":"

Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.

JavaScala
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n

Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set standardDeviation and mean for the number of records generated per column to follow a normal distribution.

"},{"location":"setup/guide/scenario/records-per-column/#run","title":"Run","text":"

Let's try run.

#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n

It should look something like this.

ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n

You can now look to play around with other count configurations found here.

"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"

Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).

"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"

Ensure all data in column is equal to certain value. Value can be of any data type.

JavaScalaYAML
validation().col(\"year\").isEqual(2021)\n
validation.col(\"year\").isEqual(2021)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n
"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"

Ensure all data in column is not equal to certain value. Value can be of any data type.

JavaScalaYAML
validation().col(\"year\").isNotEqual(2021)\n
validation.col(\"year\").isNotEqual(2021)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n
"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"

Ensure all data in column is null.

JavaScalaYAML
validation().col(\"year\").isNull()\n
validation.col(\"year\").isNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"

Ensure all data in column is not null.

JavaScalaYAML
validation().col(\"year\").isNotNull()\n
validation.col(\"year\").isNotNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"

Ensure all data in column is contains certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"name\").contains(\"peter\")\n
validation.col(\"name\").contains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"

Ensure all data in column does not contain certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"name\").notContains(\"peter\")\n
validation.col(\"name\").notContains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#unqiue","title":"Unqiue","text":"

Ensure all data in column is unique.

JavaScalaYAML
validation().unique(\"account_id\", \"name\")\n
validation.unique(\"account_id\", \"name\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n
"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"

Ensure all data in column is less than certain value.

JavaScalaYAML
validation().col(\"amount\").lessThan(100)\n
validation.col(\"amount\").lessThan(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount < 100\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"

Ensure all data in column is less than or equal to certain value.

JavaScalaYAML
validation().col(\"amount\").lessThanOrEqual(100)\n
validation.col(\"amount\").lessThanOrEqual(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount <= 100\"\n
"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"

Ensure all data in column is greater than certain value.

JavaScalaYAML
validation().col(\"amount\").greaterThan(100)\n
validation.col(\"amount\").greaterThan(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount > 100\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"

Ensure all data in column is greater than or equal to certain value.

JavaScalaYAML
validation().col(\"amount\").greaterThanOrEqual(100)\n
validation.col(\"amount\").greaterThanOrEqual(100)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount >= 100\"\n
"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"

Ensure all data in column is between two values.

JavaScalaYAML
validation().col(\"amount\").between(100, 200)\n
validation.col(\"amount\").between(100, 200)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n
"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"

Ensure all data in column is not between two values.

JavaScalaYAML
validation().col(\"amount\").notBetween(100, 200)\n
validation.col(\"amount\").notBetween(100, 200)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n
"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"

Ensure all data in column is in set of defined values.

JavaScalaYAML
validation().col(\"status\").in(\"open\", \"closed\")\n
validation.col(\"status\").in(\"open\", \"closed\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n
"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"

Ensure all data in column matches certain regex expression.

JavaScalaYAML
validation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n
"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"

Ensure all data in column does not match certain regex expression.

JavaScalaYAML
validation().col(\"account_id\").notMatches(\"^acc.*\")\n
validation.col(\"account_id\").notMatches(\"^acc.*\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n
"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"

Ensure all data in column starts with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").startsWith(\"ACC\")\n
validation.col(\"account_id\").startsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"

Ensure all data in column does not start with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").notStartsWith(\"ACC\")\n
validation.col(\"account_id\").notStartsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"

Ensure all data in column ends with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").endsWith(\"ACC\")\n
validation.col(\"account_id\").endsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"

Ensure all data in column does not end with certain string. Column has to have type string.

JavaScalaYAML
validation().col(\"account_id\").notEndsWith(\"ACC\")\n
validation.col(\"account_id\").notEndsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"

Ensure all data in column has certain size. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").size(5)\n
validation.col(\"transactions\").size(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n
"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"

Ensure all data in column does not have certain size. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").notSize(5)\n
validation.col(\"transactions\").notSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"

Ensure all data in column has size less than certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").lessThanSize(5)\n
validation.col(\"transactions\").lessThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) < 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"

Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").lessThanOrEqualSize(5)\n
validation.col(\"transactions\").lessThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) <= 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"

Ensure all data in column has size greater than certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").greaterThanSize(5)\n
validation.col(\"transactions\").greaterThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) > 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"

Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.

JavaScalaYAML
validation().col(\"transactions\").greaterThanOrEqualSize(5)\n
validation.col(\"transactions\").greaterThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) >= 5\"\n
"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"

Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).

JavaScalaYAML
validation().col(\"credit_card\").luhnCheck()\n
validation.col(\"credit_card\").luhnCheck\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n
"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"

Ensure all data in column has certain data type.

JavaScalaYAML
validation().col(\"id\").hasType(\"string\")\n
validation.col(\"id\").hasType(\"string\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n
"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"

Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.

For example, CASE WHEN status == 'open' THEN balance > 0 ELSE balance == 0 END would check all rows with status open to have balance greater than 0, otherwise, check the balance is 0.

JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount < 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.expr(\"amount < 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n
"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"

If you want to run aggregations based on a particular set of columns, you can do so via group by validations. An example would be checking that the sum of amount is less than 1000 per account_id, year. The validations applied can be one of the validations from above.

"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"

Check the sum of a columns values for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"

Check the count for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"

Check the min for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"

Check the max for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"

Check the average for each group adheres to validation.

JavaScala
validation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
"},{"location":"setup/validation/validation/","title":"Validations","text":"

Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.

Currently, SQL expression validations are supported (can see here for reference what other expressions are valid), but will later be extended out to supported other validations such as aggregates (group by account_number, sum of amounts should be greater than 100), ordering (transaction dates should be in descending order), relationships (at least one transaction per account_number) or data profiling (how close produced data profile is to expected data profile).

"},{"location":"setup/validation/validation/#define-validations","title":"Define Validations","text":"

Full example validation can be found below. For more details, check out each of the subsections defined further below.

JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail\n)  .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/validation/#wait-condition","title":"Wait Condition","text":"

Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:

"},{"location":"setup/validation/validation/#pause","title":"Pause","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.pause(1))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\");\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\")\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date > DATE('2023-01-01')\"\n
"},{"location":"setup/validation/validation/#webhook","title":"Webhook","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202));  //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\"));  //use connection configuration from existing 'my_http' connection definition\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\"))  //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n
"},{"location":"setup/validation/validation/#file-available","title":"File available","text":"JavaScalaYAML
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n
"},{"location":"setup/validation/validation/#report","title":"Report","text":"

Once run, it will produce a report like this.

"},{"location":"use-case/business-value/","title":"Business Value","text":"

Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.

Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"

I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.

The companies/products not shown below either have:

"},{"location":"use-case/comparison/#data-generation","title":"Data Generation","text":"Tool Description Cost Pros Cons Clearbox AI Python based data generation tool via ML Unclear Python SDK UI interface Detect private data Report generation Batch data only No data clean up Limited/no documentation Curiosity Software Platform solution for test data management Unclear Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing No quick start No SDK Many components that may not be required No event generation support DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear Python SDK Report generation Data quality checks Business logic constraints No data connection support No data clean up No foreign key support Datafaker Realistic data generation library Free SDK for many languages Simple, easy to use Extensible Open source Generate realistic values No data connection support No data clean up No validation No foreign key support DBLDatagen Python based data generation tool Free Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries Limited support if issues Code required No data clean up No data validation Gatling HTTP API load testing tool Free (Open Source)Gatling Enterprise, usage based, starts from \u20ac89 per month, 1 user, 6.25 hours of testing Kotlin, Java & Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD CLI & Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage Howso Python based data generation tool via ML Unclear Python SDK Playground to try Open source library Customisable scenarios No support for data sources No data validation No data clean up Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD Report generation Non-technical users can use UI Customisable scenarios Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK Octopize Python based data generation tool via ML Unclear Python & R SDK Report generation API for metadata Customisable scenarios Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples Synthesized Python based data generation tool via ML Unclear CLI & Python SDK API for metadata IDE setup Data quality checks Not sure what is SDK & TDK Charge by usage No report of what was generated No relationships between data sources Tonic Platform solution for generating data Unclear UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic YData Python based data generation tool via ML. Platform solution as well Unclear Python SDK Open source Detect private data Compare datasets Report generation No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support"},{"location":"use-case/comparison/#use-of-ml-models","title":"Use of ML models","text":"

You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.

Pros

Cons

"},{"location":"use-case/roadmap/","title":"Roadmap","text":""},{"location":"use-case/use-case/","title":"Use cases","text":""},{"location":"use-case/use-case/#replicate-production-in-lower-environment","title":"Replicate production in lower environment","text":"

Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:

  1. Generates data with the latest schema changes and production like field values
  2. Run as a job on a daily/regular basis to replicate production traffic or data flows
  3. Validate data to ensure your system runs as expected
  4. Clean up data to avoid build up of generated data

"},{"location":"use-case/use-case/#local-development","title":"Local development","text":"

Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:

  1. Fewer assumptions or ambiguities when the developer codes
  2. Direct feedback loop in local computer rather than waiting for test environment for more reliable test data
  3. No domain expertise required to understand the data
  4. Easy for new developers to be onboarded and developing/testing code for jobs/services
"},{"location":"use-case/use-case/#systemintegration-testing","title":"System/integration testing","text":"

When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.

"},{"location":"use-case/use-case/#scenario-testing","title":"Scenario testing","text":"

If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (enableEdgeCases flag within <field>.generator.options, see more here).

"},{"location":"use-case/use-case/#data-debugging","title":"Data debugging","text":"

When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.

"},{"location":"use-case/use-case/#data-profiling","title":"Data profiling","text":"

When using Data Caterer with the feature flag enableGeneratePlanAndTasks enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable enableGenerateData) so that you can focus on the profile of the data you are utilising. This can be run against your production data sources to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data Caterer as no direct production connections need to be maintained to generate data in other environments (which can lead to serious concerns about data security as seen here).

"},{"location":"use-case/use-case/#schema-gathering","title":"Schema gathering","text":"

When using Data Caterer with the feature flag enableGeneratePlanAndTasks enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.

"}]} \ No newline at end of file diff --git a/site/setup/guide/index.html b/site/setup/guide/index.html index 02352d9d..713fa157 100644 --- a/site/setup/guide/index.html +++ b/site/setup/guide/index.html @@ -1911,52 +1911,32 @@

Guides

For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.

Scenarios

-
-

Free tier scenarios

-
+
-
-
-
-

Paid tier scenarios

-
-
-

Data Sources

-
-

Free tier data sources

-
+
-
-
-
-

Paid tier data sources

-
-
  • Cassandra - Cassandra tables
  • Kafka - Kafka topics
  • Solace - Solace messages
  • Marquez - Generate data based on metadata in Marquez
  • OpenMetadata - Generate data based on metadata in OpenMetadata
  • HTTP - HTTP requests
  • +
  • Files (Fixed width) - (Soon to document) A variant of CSV but with no separator
  • MySql - (Soon to document) JDBC MySql tables
-

YAML Files

Base Concept

The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that diff --git a/site/sitemap.xml.gz b/site/sitemap.xml.gz index 0e3af4bf7511923db39aefb3583d4e5dfe86b124..b8cef87da9dda8c27195870489fca26e85714f0e 100644 GIT binary patch delta 15 WcmeBS>0x1$@8;l8J-(5RjR^oAKLd;a delta 15 WcmeBS>0x1$@8;lO-Mf*EjR^o9O#@;8