This guide will help you get started setting up a fully operational PNDA cluster. The main tasks can be grouped as follows:
- Selecting a Hadoop distribution
- Creating PNDA
- Basic data exploration
- Producer integration
- Packages and applications
Some of these tasks may be performed by different people in your organization. For example, enterprise IT staff may be responsible for provisioning a cluster, software developers may write producers and consumers to process data, and data scientists may analyze collected data.
- Carefully follow the PNDA creation guide for your target environment.
- Launch the console to make sure that everything is running. Users can connect to the console by directing a browser to an URL of the form:
https://<knox FQDN>:8443/gateway/pnda/console
, where the<knox FQDN>
is the FQDN defined during the provisioning of the security material (the FQDN was stored in the.yaml
file located in the pnda-cli/platform-certificates/knox/ directory). Authentication is based on Linux PAM, so local cluster users (including the pnda admin user) are recognized as well LDAP users, depending on LDAP properties defined during the PNDA configuration steps. For more details, refer to out UIs In PNDA section of the guide.
In this tutorial, you will learn how to generate a set of sample data, bulk upload it into the PNDA platform, and explore the data using notebooks.
- Open the console, and click the Jupyter link. Upload the example notebook, and follow the instructions in the notebook.
- In the example notebook, follow the "instructions" section to use the data generation tool or embedded cell to generate test data sets.
- Use the bulk ingest tool to import the test data sets.
- Use the sample notebook to analyze the data.
For an in-depth tutorial, see the Interactive exploratory data analytics lab.
In the basic data exploration tutorial you generated sample data in a single step. In a real-world scenario, however, the data of interest is likely being generated in real-time. This data needs to be captured using producers that direct it topics configured on Kafka. For this tutorial, we will use a producer based on Logstash.
For the purposes of this guide, we'll use the test data source for the example Spark Streaming application.
- Select a host to act as the producer. It needs to have network connectivity with Kafka.
- Create a topic on Kafka using the Kafka Manager. Once created, the topic will appear in the console.
- Install and configure Logstash on the chosen host then install the following plugin.
- Run the test data script. In the
data-source
directory of the example-spark-streaming repository, there is a python script that can send a stream of data over TCP into logstash. - In the console, verify that data is flowing into the topics. In the left-hand column, you should see activity on your Kafka topic.
- After a period of time (up to 30 minutes), the master dataset will be automatically created by gobblin, and you will be able to see it in the Datasets page in the console.
- A pre-requisite to working with packages is to ensure that the application package repository is correctly configured as described in the PREPARE phase depending your infra: AWS, OpenStack or server clusters.
- Build and upload the spark-streaming app to the package repository. Upload the file to the Object Store via the platform-package-repository.
- In the console, deploy the package. On the packages page, look for the package you have just uploaded in the list of available packages. (If you don't see it, try clicking the refresh button.) Click the deploy button next to the package, and you'll see it added to the list of deployed packages.
- Create an application from the package. On the applications page, click "Create New Application", select the package you have just deployed, and follow the prompts to create a new application.
- Start the application. On the applications page, click the Start button next to the application. When the application is running, the button will change to Pause.
- If you click on the application, and then click on the metrics tab, you should see metrics appear that indicate the time taken to process batches of data.
- Open Impala and see if your data is there.
ssh
to the edge node (indicated on the console home page) and runimpala-shell
to query the data inexample_table
. The default table name isexample_table
, if you didn't change it when creating the application. Impala follows the SQL-92 standard, with some extensions, detailed in the Impala SQL reference. For example, tryselect count(*) from example_table;
.
For in-depth tutorials on Spark Streaming, see the Spark Streaming and HBase and Spark Streaming and OpenTSDB tutorials.