Skip to content

Latest commit

 

History

History
147 lines (87 loc) · 9.2 KB

view-api.md

File metadata and controls

147 lines (87 loc) · 9.2 KB
layout title
page
Kite Views API

Most of the time, you don't need to work with all of the records stored in a dataset. It is common to work with subsets, like events last month rather than all events. The Views API is a way to express constraints for the records that Kite loads.

If your dataset is partitioned, Kite intelligently determines which partitions to draw from based on a view's constraints. You don't have to specify the partitions yourself because Kite will filter out partitions that cannot contain matching records, automatically.

Kite filters records so that you can express requirements for your data and have Kite enforce them. For example, events.with("type") is a view of events where each record loaded by the view will have a non-null value for the type field.

Views

Kite's View interface represents a logical collection of records in a dataset. It might seem as though a view is a subset of a dataset, but it is more accurate to think of a dataset as a view with no constraints applied.

You can use a view as the input for a MapReduce job or read its content directly by using View#newReader to get a DatasetReader that returns only records in the view.

View instances are immutable. You can pass the view to other operations, safe in the knowledge that it won't be changed at all.

Refining a View: Selecting Records

You create a view by adding a constraint to an existing view or dataset using one of the following methods.

| Method | Definition | Example | | with | Add a non-null constraint for a field | events.with("level") | | with | Add an equality constraint for a field | events.with("level", "FATAL") | | with | Add a set-inclusion constraint for a field | events.with("level", "WARNING", "ERROR") | | from | Add a >= constraint for a field | events.from("day", 1) | | fromAfter | Add a > constraint for a field | events.fromAfter("day", 4) | | to | Add a <= constraint for a field | events.to("year", 2014) | | toBefore | Add a < constraint for a field | events.toBefore("year", 2015) |

Each method returns a new View with the additional constraint added to the parent view1.

For example, If you want to work with the ratings dataset and with numeric rating of 5, you would use the with method.

ratings.with("rating", 5);

Kite inspects each record and applies this constraint before passing records to your application. Only ratings with the value 5 are returned.

The object you pass as a constraint must match the data type. For example, if the rating field is a String data type, sending the value 5 as a constraint will throw an exception.

You can chain refinement method calls to create a more complicated view all at once. For example, if you've dfined start and end variables, you can select a range of times during which ratings are submitted by chaining from and to for the same record field.

ratings.from("time", start).to("time", end).with("rating", 5);

If the ratings dataset is partitioned by time, then the view will automatically take advantage of dataset partitioning. Kite intelligently determines which partitions to draw from in response to this filter value. See Partitioned Datasets.

Using Different Classes and Schemas: Selecting Columns

By default, views will use Avro's GenericRecord type when returning records. You can set the type that will be constructed by calling View#asType(Class) and passing a class that is compatible with the dataset's schema. Kite will automatically set the read schema based on the type you pass.

You can also set the read schema and still use generic records by calling View#asSchema(Schema) with your read schema.

Both asType and asSchema will load only the requested data fields from the dataset. Selecting fields avoids spending extra time deserializing some fields in Avro and enables Parquet to skip large portions of the underlying data. This can be used to drastically improve read speeds.

Working with Views

Loading a View

In addition to creating a view with the API, you can load a view from a view URI. A view URIs is analogous to a dataset URI, where the scheme is view: instead of dataset: and constraints are added as query arguments.

The following code snippet creates a view for a dataset of movie ratings submitted by the critic with user_id 125.

View<Record> ratings = Datasets.load("view:hive:ratings?user_id=125");

Inspecting a View

In some use cases, it might not be necessary to return a set of values, but only verify that values do or do not exist. For example, you might want to only submit a MapReduce job if there are values that would be processed. These methods allow you to inspect a view at runtime.

isEmpty

The isEmpty method returns whether your View contains any records at all.

getUri

The getUri method returns a URI for a View that can be passed to Datasets.load.

includes

The includes method returns whether an entity matches a view's constraints. That is, whether the record would be included in this View if it were present in the Dataset.

Working with Records in a View

You can interact with records in a view the way you would work with records in a full dataset. You can use newReader and newWriter to get the same reader or writer objects, but they are restricted to operations on the view.

newReader

The newReader method creates an appropriate DatasetReader that returns only records that match the view's constraints.

newWriter

The newWriter method creates an appropriate DatasetWriter that will write only records that match the view's constraints. For more information, see Writing to Views.

See [Restricted Views][restricted-views].

deleteAll

The deleteAll method deletes all entities in the dataset that match the view's constraints.

If the delete cannot be completed cleanly, then the method throws an UnsupportedOperationException. In the FileSystem implementation, for example, individual records cannot be deleted, only entire files. That means that Kite only allows you to delete an entire partition directory.

This method will delete records in a dataset, and will not delete the dataset itself. When called on a dataset, all records in the dataset will be removed. To delete a dataset in addition to the data stored in that dataset, use Datasets.delete.


Notes:

  1. Views are created by refining other views because they are immutable and cannot be changed. This works like a Java's String methods that always return new strings.