Skip to content

JAPL bytecode specification

Productive2 edited this page Feb 26, 2021 · 7 revisions

Unfinished specification attempt for the JAPL bytecode.

Goals

  1. Well documented
  2. Fast to create and read on a single pass, so the file does not need to appear in memory at once
  3. When a malloc would be necessary to implement a (de)serializer, always know the size first
  4. Should not prioritize portability - it is not meant to be a way of distributing code
  5. Should allow easy caching - unchanged source should not need recompiling
  6. This document should follow a rough grammar specification. Later, specific grammar specification can be suggested.

Lists

Every list is to be serialized the same way:

  1. A word indicating that it is a list
  2. Length of a single element of the list in number of bytes (type: uint32 memcopied)
    • The actual type depends on context, since the bytecode is well defined
  3. Maximum number of elements of the list (type: uint32 memcopied)
  4. All elements of the list (can be less than the maximum number, but memory must be pre-allocated for the maximum number)
  5. A word indicating end of a list

Sets

The type of the set comes from the scope containing the set. A set is a list, where every element is malloced on its own, so variable length objects can be serialized. It is serialized this way:

  1. A word indicating that it is a set
  2. All elements in the set, each in the scope "set element"
  3. A word indicating end of the set

Hashes

When "hash" is defined, it represents a sha256 hash (32 bytes of data).

SHA256 hashes are computed incrementally (during reading from the disk) and checked after the reading is finished.

Scopes

Scopes are opened with a word "open", followed by what is being opened. Then they can contain other scopes, lists, hashes or other data. They are closed with a word "close", followed by what is being closed. The info on what is being closed is just a sanity check for the (de)serializer, which should check it and crash if it doesn't match, like in XML (you can't text).

Scopes can be of two types. High level scopes can only contain other scopes. Low level scopes can only contain a single piece of data, which has a length and type defined by the scope word. Note that low level scopes can contain high level scopes indirectly, when their data type is a set.

Special words

Words are a single byte indication on what will follow.

Data type delimeters (0x00-0x1F):

  • 0x00: Open scope
  • 0x01: end scope
  • 0x02: Start of list
  • 0x03: End of list
  • 0x04: Start of set
  • 0x05: End of set

High level scopes (0x20-0x3F):

  • 0x20 Header
  • 0x21 Chunk
  • 0x22 Object
  • 0x23 Body

Low level scopes (0x40-0x9F):

Header low level scopes (0x40-0x5F)

  • 0x40 JAPL git commit
  • 0x41 JAPL version number
  • 0x42 Full path of the source file
  • 0x43 Hash of the source file
  • 0x44 Timestamp of bytecode serialization

Other low level scopes(0x60-0x9F)

  • 0x60 Object type (Object)
  • 0x61 Object identifier (Object)
  • 0x62 Chunk identifier (Chunk)
  • 0x63 Code (Chunk)
  • 0x64 Constants (Chunk)
  • 0x65 Line information (Chunk)

Header

Should be contained in the scope "header". Should contain the following information (each is in its own scope):

  1. JAPL git commit (inserted during compilation of the compiler/vm; (20 bytes)
    • for debugging purposes only, not automatically compared.
  2. JAPL version number (list of char)
    • different version number means that the bytecode should be discarded.
  3. full path of the source file (list of char)
    • only as debugging information.
  4. hash of source file
    • a good way of checking for changed source files (hashing can be done while reading the source file from the disk).
    • can also identify source files, when the timestamp is recent without name/full path which can be changed without affecting the resulting bytecode (besides the header).
  5. timestamp of hash creation
    • if too old, it will not be used for caching.

Object

An object should be in the scope "object", it is a high level scope. It can contain the following scopes:

  • Object type scope (contains a single byte, which translates to the object types enum)
  • Object identifier (The length is equal to the length of a pointer on the platform)
  • TODO

Chunk

A chunk should be in the scope "chunk". It is a high level scope.

The scopes encoded are:

  • Chunk identifier
  • Constants
    • list of object identifiers (see Object)
  • Code
    • list of instructions (list of bytes)
  • Line information
    • list of lines for each instruction (list of bytes, FIXME)

Body

Should be in the scope "body". Must contain:

  • Objects scope (contains a set of objects)
  • Chunks scope (contains a set of chunks)

Bytecode

The main scope. Must contain:

  • Header
  • Body