We define four main categories of coordinate systems:
- World coordinates: this is a fixed frame, anchored to a specific position
in the 3D world which is shared across the whole dataset. X, Y world coordinates
can be directly translated to geo-referenced EPSG:6498 coordinates by adding a
specific offset vector stored in
aerial.json
. - Vehicle coordinates: this is a moving frame, anchored to the vehicle that
captured the sensor data. The X, Y and Z axes point right, forward and up,
respectively. Vehicle coordinates are stored in
ego_pose.json
. - Sensor coordinates: these are sensor-specific moving frames, anchored to the
vehicle that captured the sensor data, but generally different from the vehicle
coordinates. Transformations between sensor coordinates and vehicle coordinates
are stored in
calibrated_sensor.json
.- For cameras, the sensor coordinate system follows the "OpenCV" convention, i.e. X, Y and Z point right, bottom and forward, respectively.
- For point clouds, there's no general convention. The user should always
interpret the stored point cloud files in the context of their specific
sensor frame defined in
calibrated_sensor.json
.
- Object coordinates: these are object specific frames, used to represent 3D
bounding box annotations, stored in
sample_annotation.json
. For objects with a well-defined orientation (e.g. cars), the X, Y and Z axes point right, forward and up, respectively. The Z axis always points up, even for objects that have some form of central symmetry (e.g. support poles). The bounding box corners have coordinates[±W/2, ±L/2, ±H/2]
in the object's frame of reference.
Transformations between coordinate systems are given as quaternion-vector pairs , and always represent the transformation from the local frame to a more global frame i.e. from object to world, from sensor to vehicle, from vehicle to world.
Given a roto-translation from frame A
to frame B
, we can transform points in
A
coordinates to points in B
coordinates as:
where the rotation matrix for a quaternion is given by:
where is the bottom-right R x C
sub-matrix of M
.
The sensor.json
table provides meta-data about the sensors used in Mapillary
Metropolis. In particular, different sensors are described by their modality
and channel
. In the following we provide additional information on each modality,
and least all channels available for each.
These sensors produce RGB images, stored as JPG in the sweeps
folder. Each sample
is guaranteed to have one equirectangular image, which should be regarded as the
main source of truth for annotations. Possible channels are:
CAM_EQUIRECTANGULAR
: the main equirectangular image.CAM_LEFT
,CAM_RIGHT
,CAM_FRONT
,CAM_BACK
: optional perspective images, pointing in the four cardinal directions w.r.t. to the ego-vehicle. These are obtained by warping the equirectangular image.
These sensors produce depth maps, stored as 16-bit PNGs in the samples
folder.
These are obtained by re-projecting the multi-view stereo reconstruction. Possible
channels are:
DEPTH_LEFT
,DEPTH_RIGHT
,DEPTH_FRONT
,DEPTH_BACK
: depth maps corresponding to the perspective images defined in the previous section.
This sensor produces point clouds, stored as NPY files in the samples
folder.
Each sensor reading contains a slice of a large MVS reconstruction, centered
around the corresponding ego-vehicle location. The only channel for this sensor
is named MVS
.
This sensor produces point clouds, stored as NPY files in the samples
folder.
Each sensor reading contains a slice of a large lidar scan, centered around the
corresponding ego-vehicle location and re-aligned to match the corresponding
multi-view stereo slice. Possible channels are:
LIDAR_MX2
: ground-level lidar data, captured by the same vehicle that collected the equirectangular images.LIDAR_AERIAL
: aerial lidar data.