Skip to content

Latest commit

 

History

History
979 lines (702 loc) · 29.6 KB

Document.md

File metadata and controls

979 lines (702 loc) · 29.6 KB

Document, Executor, and Flow are the three fundamental concepts in Jina.

  • Document is the basic data type in Jina;
  • Executor is how Jina processes Documents;
  • Flow is how Jina streamlines and scales Executors.

Learn them all, nothing more, you are good to go.


Cookbook on Document/DocumentArray 2.0 API

Document is the basic data type that Jina operates with. Text, picture, video, audio, image or 3D mesh: They are all Documents in Jina.

DocumentArray is a sequence container of Documents. It is the first-class citizen of Executor, serving as the Executor's input and output.

You could say Document is to Jina is what np.float is to Numpy, and DocumentArray is similar to np.ndarray.

Table of Contents

Minimum working example

from jina import Document

d = Document() 

Document API

Document Attributes

A Document object has the following attributes, which can be put into the following categories:

Content attributes .buffer, .blob, .text, .uri, .content, .embedding
Meta attributes .id, .weight, .mime_type, .location, .tags, .offset, .modality, siblings
Recursive attributes .chunks, .matches, .granularity, .adjacency
Relevance attributes .score, .evaluations

Set & Unset Attributes

Set a attribute:

from jina import Document

d = Document()
d.text = 'hello world'
<jina.types.document.Document id=9badabb6-b9e9-11eb-993c-1e008a366d49 mime_type=text/plain text=hello world at 4444621648>

Unset a attribute:

d.pop('text')
<jina.types.document.Document id=cdf1dea8-b9e9-11eb-8fd8-1e008a366d49 mime_type=text/plain at 4490447504>

Unset multiple attributes:

d.pop('text', 'id', 'mime_type')
<jina.types.document.Document at 5668344144>

Construct Document

Content Attributes
doc.buffer The raw binary content of this Document
doc.blob The ndarray of the image/audio/video Document
doc.text The text info of the Document
doc.uri A uri of the Document could be: a local file path, a remote url starts with http or https or data URI scheme
doc.content One of the above non-empty field
doc.embedding The embedding ndarray of this Document

You can assign str, ndarray, buffer or uri to a Document.

from jina import Document
import numpy as np

d1 = Document(content='hello')
d2 = Document(content=b'\f1')
d3 = Document(content=np.array([1, 2, 3]))
d4 = Document(content='https://static.jina.ai/logo/core/notext/light/logo.png')
<jina.types.document.Document id=2ca74b98-aed9-11eb-b791-1e008a366d48 mimeType=text/plain text=hello at 6247702096>
<jina.types.document.Document id=2ca74f1c-aed9-11eb-b791-1e008a366d48 buffer=DDE= mimeType=text/plain at 6247702160>
<jina.types.document.Document id=2caab594-aed9-11eb-b791-1e008a366d48 blob={'dense': {'buffer': 'AQAAAAAAAAACAAAAAAAAAAMAAAAAAAAA', 'shape': [3], 'dtype': '<i8'}} at 6247702416>
<jina.types.document.Document id=4c008c40-af9f-11eb-bb84-1e008a366d49 uri=https://static.jina.ai/logo/core/notext/light/logo.png mimeType=image/png at 6252395600>

The content will be automatically assigned to either the text, buffer, blob, or uri fields. id and mime_type are auto-generated when not given.

You can get a visualization of a Document object in Jupyter Notebook or by calling .plot().

Exclusivity of doc.content

Note that one Document can only contain one type of content: it is either text, buffer, blob or uri. Setting text first and then setting uri will clear the text field.

d = Document(text='hello world')
d.uri = 'https://jina.ai/'
assert not d.text  # True

d = Document(content='https://jina.ai')
assert d.uri == 'https://jina.ai'  # True
assert not d.text  # True
d.text = 'hello world'

assert d.content == 'hello world'  # True
assert not d.uri  # True

Conversion between doc.content

You can use the following methods to convert between .uri, .text, .buffer and .blob:

doc.convert_buffer_to_blob()
doc.convert_blob_to_buffer()
doc.convert_uri_to_buffer()
doc.convert_buffer_to_uri()
doc.convert_text_to_uri()
doc.convert_uri_to_text()

You can convert a URI to a data URI (a data in-line URI scheme) using doc.convert_uri_to_datauri(). This will fetch the resource and make it inline.

In particular, when you work with an image Document, there are some extra helpers that enable more conversion:

doc.convert_image_buffer_to_blob()
doc.convert_image_blob_to_uri()
doc.convert_image_uri_to_blob()
doc.convert_image_datauri_to_blob()
Set Embedding

An embedding is a high-dimensional representation of a Document. You can assign any Numpy ndarray as a Document's embedding.

import numpy as np
from jina import Document

d1 = Document(embedding=np.array([1, 2, 3]))
d2 = Document(embedding=np.array([[1, 2, 3], [4, 5, 6]]))

Construct with Multiple Attributes

Meta Attributes
doc.tags A structured data value, consisting of fields which map to dynamically typed values
doc.id A hexdigest that represents a unique Document ID
doc.weight The weight of the Document
doc.mime_type The mime type of the Document
doc.location The position of the Document. This could be start and end index of a string; x,y (top, left) coordinates of an image crop; timestamp of an audio clip, etc
doc.offset The offset of the Document in the previous granularity Document
doc.modality An identifier of the modality the Document belongs to

You can assign multiple attributes in the constructor via:

from jina import Document

d = Document(uri='https://jina.ai',
             mime_type='text/plain',
             granularity=1,
             adjacency=3,
             tags={'foo': 'bar'})
<jina.types.document.Document id=e01a53bc-aedb-11eb-88e6-1e008a366d48 uri=https://jina.ai mimeType=text/plain tags={'foo': 'bar'} granularity=1 adjacency=3 at 6317309200>

Construct from Dict or JSON String

You can build a Document from a dict or JSON string:

from jina import Document
import json

d = {'id': 'hello123', 'content': 'world'}
d1 = Document(d)

d = json.dumps({'id': 'hello123', 'content': 'world'})
d2 = Document(d)
Parsing Unrecognized Fields

Unrecognized fields in a dict/JSON string are automatically put into the Document's .tags field:

from jina import Document

d1 = Document({'id': 'hello123', 'foo': 'bar'})
<jina.types.document.Document id=hello123 tags={'foo': 'bar'} at 6320791056>

You can use field_resolver to map external field names to Document attributes:

from jina import Document

d1 = Document({'id': 'hello123', 'foo': 'bar'}, field_resolver={'foo': 'content'})
<jina.types.document.Document id=hello123 mimeType=text/plain text=bar at 6246985488>

Construct from Another Document

Assigning a Document object to another Document object will make a shallow copy:

from jina import Document

d = Document(content='hello, world!')
d1 = d

assert id(d) == id(d1)  # True

To make a deep copy, use copy=True:

d1 = Document(d, copy=True)

assert id(d) == id(d1)  # False

You can partially update a Document according to another source Document:

from jina import Document

s = Document(
    id='🐲',
    content='hello-world',
    tags={'a': 'b'},
    chunks=[Document(id='🐢')],
)
d = Document(
    id='🐦',
    content='goodbye-world',
    tags={'c': 'd'},
    chunks=[Document(id='🐯')],
)

# only update `id` field
d.update(s, fields=['id'])

# update all fields. `tags` field as `dict` will be merged.
d.update(s)

Construct from JSON, CSV, ndarray and Files

The jina.types.document.generators module let you construct Document from common file types such as JSON, CSV, ndarray and text files. The following functions will give a generator of Document, where each Document object corresponds to a line/row in the original format:

from_ndjson() Yield Document from a line-based JSON file. Each line is a Document object
from_csv() Yield Document from a CSV file. Each line is a Document object
from_files() Yield Document from a glob files. Each file is a Document object
from_ndarray() Yield Document from a ndarray. Each row (depending on axis) is a Document object

Using a generator is sometimes less memory-demanding, as it does not load/build all Document objects in one shot.

To convert the generator to DocumentArray use:

from jina import DocumentArray
from jina.types.document.generators import from_files

DocumentArray(from_files('/*.png'))

Serialize Document

You can serialize a Document into JSON string or Python dict or binary string:

from jina import Document

d = Document(content='hello, world')
d.json()
{
  "id": "6a1c7f34-aef7-11eb-b075-1e008a366d48",
  "mimeType": "text/plain",
  "text": "hello world"
}
d.dict()
{'id': '6a1c7f34-aef7-11eb-b075-1e008a366d48', 'mimeType': 'text/plain', 'text': 'hello world'}
d.binary_str()
b'\n$6a1c7f34-aef7-11eb-b075-1e008a366d48R\ntext/plainj\x0bhello world'

Add Recursion to Document

Recursive Attributes

Document can be recursed both horizontally and vertically:

doc.chunks The list of sub-Documents of this Document. They have granularity + 1 but same adjacency
doc.matches The list of matched Documents of this Document. They have adjacency + 1 but same granularity
doc.granularity The recursion "depth" of the recursive chunks structure
doc.adjacency The recursion "width" of the recursive match structure

You can add chunks (sub-Document) and matches (neighbour-Document) to a Document:

  • Add in constructor:

    d = Document(chunks=[Document(), Document()], matches=[Document(), Document()])
  • Add to existing Document:

    d = Document()
    d.chunks = [Document(), Document()]
    d.matches = [Document(), Document()]
  • Add to existing doc.chunks or doc.matches:

    d = Document()
    d.chunks.append(Document())
    d.matches.append(Document())

Note that both doc.chunks and doc.matches return DocumentArray, which we will introduce later.

Represent Document as Dictionary or JSON

Any Document can be converted into a Python dictionary or into Json string by calling their .dict() or .json() methods.

import pprint
import numpy as np

from jina import Document

d0 = Document(id='🐲identifier', text='I am a Jina Document', tags={'cool': True}, embedding=np.array([0, 0]))
pprint.pprint(d0.dict())
pprint.pprint(d0.json())
{'embedding': {'dense': {'buffer': 'AAAAAAAAAAAAAAAAAAAAAA==',
                         'dtype': '<i8',
                         'shape': [2]}},
 'id': '🐲identifier',
 'mime_type': 'text/plain',
 'tags': {'cool': True},
 'text': 'I am a Jina Document'}
('{\n'
 '  "embedding": {\n'
 '    "dense": {\n'
 '      "buffer": "AAAAAAAAAAAAAAAAAAAAAA==",\n'
 '      "dtype": "<i8",\n'
 '      "shape": [\n'
 '        2\n'
 '      ]\n'
 '    }\n'
 '  },\n'
 '  "id": "identifier",\n'
 '  "mime_type": "text/plain",\n'
 '  "tags": {\n'
 '    "cool": true\n'
 '  },\n'
 '  "text": "I am a Jina Document"\n'
 '}')

As it can be observed, the output seems quite noisy when representing the embedding. This is because Jina Document stores embeddings in an inner structure supported by protobuf. In order to have a nicer representation of the embeddings and any ndarray field, you can call dict and json with the option prettify_ndarrays=True.

import pprint
import numpy as np

from jina import Document

d0 = Document(id='🐲identifier', text='I am a Jina Document', tags={'cool': True}, embedding=np.array([0, 0]))
pprint.pprint(d0.dict(prettify_ndarrays=True))
pprint.pprint(d0.json(prettify_ndarrays=True))
{'embedding': [0, 0],
 'id': '🐲identifier',
 'mime_type': 'text/plain',
 'tags': {'cool': True},
 'text': 'I am a Jina Document'}

('{"embedding": [0, 0], "id": "identifier", "mime_type": '
 '"text/plain", "tags": {"cool": true}, "text": "I am a Jina Document"}')

This can be useful to understand the contents of the Document and to send to backends that can process vectors as lists of values.

Visualize Document

To better see the Document's recursive structure, you can use .plot() function. If you are using JupyterLab/Notebook, all Document objects will be auto-rendered:

import numpy as np
from jina import Document

d0 = Document(id='🐲', embedding=np.array([0, 0]))
d1 = Document(id='🐦', embedding=np.array([1, 0]))
d2 = Document(id='🐢', embedding=np.array([0, 1]))
d3 = Document(id='🐯', embedding=np.array([1, 1]))

d0.chunks.append(d1)
d0.chunks[0].chunks.append(d2)
d0.matches.append(d3)

d0.plot()  # simply `d0` on JupyterLab

Add Relevancy to Documents

Relevance Attributes

doc.score The relevance information of this Document
doc.evaluations The evaluation information of this Document

You can add a relevance score to a Document object via:

from jina import Document

d = Document()
d.score.value = 0.96
d.score.description = 'cosine similarity'
d.score.op_name = 'cosine()'
<jina.types.document.Document id=0a986c50-aeff-11eb-84c1-1e008a366d48 score={'value': 0.96, 'opName': 'cosine()', 'description': 'cosine similarity'} at 6281686928>

Score information is often used jointly with matches. For example, you often see the indexer adding matches as follows:

from jina import Document

# some query Document
q = Document()
# get match Document `m`
m = Document()
m.score.value = 0.96
q.matches.append(m)

DocumentArray API

A DocumentArray is a list of Document objects. You can construct, delete, insert, sort and traverse a DocumentArray like a Python list.

Methods supported by DocumentArray:

Python list-like interface __getitem__, __setitem__, __delitem__, __len__, insert, append, reverse, extend, __iadd__, __add__, __iter__, clear, sort
Persistence save, load
Advanced getters get_attributes, get_attributes_with_docs, traverse_flat, traverse

Construct DocumentArray

You can construct a DocumentArray from an iterable of Documents:

from jina import DocumentArray, Document

# from list
da1 = DocumentArray([Document(), Document()])

# from generator
da2 = DocumentArray((Document() for _ in range(10)))

# from another `DocumentArray`
da3 = DocumentArray(da2)

Persistence via save()/load()

To save all elements in a DocumentArray in a JSON line format:

from jina import DocumentArray, Document

da = DocumentArray([Document(), Document()])

da.save('data.json')
da1 = DocumentArray.load('data.json')

DocumentArray can be also stored in binary format, which is much faster and yields smaller file:

from jina import DocumentArray, Document

da = DocumentArray([Document(), Document()])

da.save('data.bin', file_format='binary')
da1 = DocumentArray.load('data.bin', file_format='binary')

Access Element

You can access a Document in the DocumentArray via integer index, string id or slice indices:

from jina import DocumentArray, Document

da = DocumentArray([Document(id='hello'), Document(id='world'), Document(id='goodbye')])

da[0]
# <jina.types.document.Document id=hello at 5699749904>

da['world']
# <jina.types.document.Document id=world at 5736614992>

da[1:2]
# <jina.types.arrays.document.DocumentArray length=1 at 5705863632>

Sort Elements

DocumentArray is a subclass of MutableSequence, therefore you can use built-in Python sort to sort elements in a DocumentArray object, e.g.

from jina import DocumentArray, Document

da = DocumentArray(
    [
        Document(tags={'id': 1}),
        Document(tags={'id': 2}),
        Document(tags={'id': 3})
    ]
)

da.sort(key=lambda d: d.tags['id'], reverse=True)
print(da)

To sort elements in da in-place, using tags[id] value in a descending manner:

<jina.types.arrays.document.DocumentArray length=3 at 5701440528>

{'id': '6a79982a-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 3.0}},
{'id': '6a799744-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 2.0}},
{'id': '6a799190-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 1.0}}

Filter Elements

You can use Python's built-in filter() to filter elements in a DocumentArray object:

from jina import DocumentArray, Document

da = DocumentArray([Document() for _ in range(6)])

for j in range(6):
    da[j].score.value = j

for d in filter(lambda d: d.score.value > 2, da):
    print(d)
<jina.types.document.Document id=c5e588f4-b6b0-11eb-af83-1e008a366d49 score={'value': 3.0} at 5696708048>
<jina.types.document.Document id=c5e58958-b6b0-11eb-af83-1e008a366d49 score={'value': 4.0} at 5696705040>
<jina.types.document.Document id=c5e589b2-b6b0-11eb-af83-1e008a366d49 score={'value': 5.0} at 5696708048>

You can build a DocumentArray object from the filtered results:

from jina import DocumentArray, Document

da = DocumentArray([Document(weight=j) for j in range(6)])
da2 = DocumentArray(list(filter(lambda d: d.weight > 2, da)))

print(da2)
DocumentArray has 3 items:
{'id': '3bd0d298-b6da-11eb-b431-1e008a366d49', 'weight': 3.0},
{'id': '3bd0d324-b6da-11eb-b431-1e008a366d49', 'weight': 4.0},
{'id': '3bd0d392-b6da-11eb-b431-1e008a366d49', 'weight': 5.0}

Use itertools on DocumentArray

As DocumentArray is an Iterable, you can also use Python's built-in itertools module on it. This enables advanced "iterator algebra" on the DocumentArray.

For instance, you can group a DocumentArray by parent_id:

from jina import DocumentArray, Document
from itertools import groupby

da = DocumentArray([Document(parent_id=f'{i % 2}') for i in range(6)])
groups = groupby(sorted(da, key=lambda d: d.parent_id), lambda d: d.parent_id)
for key, group in groups:
    key, len(list(group))
('0', 3)
('1', 3)

Get Attributes in Bulk

DocumentArray implements powerful getters that lets you fetch multiple attributes from the Documents it contains in one-shot:

import numpy as np

from jina import DocumentArray, Document

da = DocumentArray([Document(id=1, text='hello', embedding=np.array([1, 2, 3])),
                    Document(id=2, text='goodbye', embedding=np.array([4, 5, 6])),
                    Document(id=3, text='world', embedding=np.array([7, 8, 9]))])

da.get_attributes('id', 'text', 'embedding')
[('1', '2', '3'), ('hello', 'goodbye', 'world'), (array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9]))]

This can be very useful when extracting a batch of embeddings:

import numpy as np

np.stack(da.get_attributes('embedding'))
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Access nested attributes from tags

Document contains the tags field that can hold a map-like structure that can map arbitrary values.

from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})

doc.tags['dimensions']
{'weight': 10.0, 'height': 5.0}

In order to provide easy access to nested fields, the Document allows to access attributes by composing the attribute qualified name with interlaced __ symbols:

from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})

doc.tags__dimensions__weight
10.0

This also allows to access nested metadata attributes in bulk from a DocumentArray.

from jina import Document, DocumentArray

da = DocumentArray([Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}}) for _ in range(10)]) 

da.get_attributes('tags__dimensions__height', 'tags__dimensions__weight')
[[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0], [10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]]

DocumentArrayMemmap API

When your DocumentArray object contains a large number of Document, holding it in memory can be very demanding. You may want to use DocumentArrayMemmap to alleviate this issue. A DocumentArrayMemmap stores all Documents directly on the disk, while only keeps a small lookup table in memory. This lookup table contains the offset and length of each Document, hence it is much smaller than the full DocumentArray. Elements are loaded on-demand to memory during the access.

The next table show the speed and memory consumption when writing and reading 50,000 Documents.

DocumentArrayMemmap DocumentArray
Write to disk 0.62s 0.71s
Read from disk 0.11s 0.20s
Memory usage 20MB 342MB
Disk storage 14.3MB 12.6MB

Create DocumentArrayMemmap object

from jina.types.arrays.memmap import DocumentArrayMemmap

dam = DocumentArrayMemmap('./my-memmap')

Add Documents to DocumentArrayMemmap object

from jina.types.arrays.memmap import DocumentArrayMemmap
from jina import Document

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])

The dam object stores all future Documents into ./my-memmap, there is no need to manually call save/load. In fact, save/load methods are not available in DocumentArrayMemmap.

Clear a DocumentArrayMemmap object

To clear all contents in a DocumentArrayMemmap object, simply call .clear(). It will clean all content on disk.

Pruning

One may notice another method .prune() that shares similar semantics. .prune() method is designed for "post-optimizing" the on-disk data structure of DocumentArrayMemmap object. It can reduce the on-disk usage.

Mutable sequence with "read-only" elements

The biggest caveat in DocumentArrayMemmap is that you can not modify element's attribute inplace. Though the DocumentArrayMemmap is mutable, each of its element is not. For example:

from jina.types.arrays.memmap import DocumentArrayMemmap
from jina import Document

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])

dam[0].text = 'goodbye'

print(dam[0].text)
hello

One can see the text field has not changed!

To update an existing Document in a DocumentArrayMemmap, you need to assign it to a new Document object.

from jina.types.arrays.memmap import DocumentArrayMemmap
from jina import Document

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])

dam[0] = Document(text='goodbye')

for d in dam:
    print(d)
{'id': '44a74b56-c821-11eb-8522-1e008a366d48', 'mime_type': 'text/plain', 'text': 'goodbye'}
{'id': '44a73562-c821-11eb-8522-1e008a366d48', 'mime_type': 'text/plain', 'text': 'world'}

Side-by-side vs. DocumentArray

Accessing elements in DocumentArrayMemmap is almost the same as DocumentArray, you can use integer/string index to access element; you can loop over a DocumentArrayMemmap to get all Document; you can use get_attributes or traverse_flat to achieve advanced traversal or getter.

This table summarizes the interfaces of DocumentArrayMemmap and DocumentArray:

DocumentArrayMemmap DocumentArray
__getitem__, __setitem__, __delitem__ (int)
__getitem__, __setitem__, __delitem__ (string)
__getitem__, __setitem__, __delitem__ (slice)
__iter__
__contains__
__len__
append
extend
traverse_flat, traverse
get_attributes, get_attributes_with_docs
insert
reverse (inplace)
sort (inplace)
__add__, __iadd__
__bool__
__eq__
save, load ❌ unnecessary

Convert between DocumentArray and DocumentArrayMemmap

from jina import Document, DocumentArray
from jina.types.arrays.memmap import DocumentArrayMemmap

da = DocumentArray([Document(text='hello'), Document(text='world')])

# convert DocumentArray to DocumentArrayMemmap
dam = DocumentArrayMemmap('./my-memmap')
dam.extend(da)

# convert DocumentArrayMemmap to DocumentArray
da = DocumentArray(dam)

Maintaining Consistency via .reload()

Considering two DocumentArrayMemmap objects that share the same on-disk storage ./memmap but sit in different processes/threads. After some writing ops, the consistency of the lookup table may be corrupted, as each DocumentArrayMemmap object has its own version of lookup table in memory. .reload() method is for solving this issue:

from jina.types.arrays.memmap import DocumentArrayMemmap
from jina import Document

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')

dam.extend([d1, d2])
assert len(dam) == 2
assert len(dam2) == 0

dam2.reload()
assert len(dam2) == 2

dam.clear()
assert len(dam) == 0
assert len(dam2) == 2

dam2.reload()
assert len(dam2) == 0