Language: Go
MapleJuice is a Map-Reduce system integrating components membership protocol and Distributed File System. The system features a centralized master server that manages the maple and juice commands, maintains the system's file structure, and tracks the progress of maple (Map) and juice (Reduce) tasks. It is designed to handle the failure of worker nodes during any phase (map, juice).
The nodes chosen for the specific maple or juice task.
-
node2mapleJob: Tracks the maple jobs assigned to nodes, including which files are assigned for maple jobs and the keys generated by the node for aggregation. This data structure is crucial for handling failures.
-
keyMapleIdMap: Tracks the files containing the specific keys.
-
keyStatus: Tracks the status of a given key with states ONGOING, DONE, FAILED, indicating the completion status of the maple task.
-
mapleBarrier: Indicates whether the file processing and key extraction stage of the maple task has been completed.
-
keyTimeStamp: Records the submission time of the job for this key, aiding in the detection of soft failures in the system.
-
node2juiceJob: Tracks the juice jobs (per key jobs) assigned to nodes, aiding in failure management.
-
juiceCompMap: Indicates the status of the juiceID task for each intermediate file.
-
juiceTimestamp: Records the submission time of the juice task, helping manage soft failures.
- The master checks all files matching the input prefix and compiles them into a list.
- The master assigns each file a mapleID, assigns the task to nodes, and tracks the number of files processed.
- Upon receiving a file, the worker node runs the mapleExe, generates all keys, and updates the master.
- The master collects all keys from the worker nodes, begins key aggregation, assigns a key to a node for aggregation, and tracks the key's status.
- For key aggregation, the worker fetches relevant files as directed by the master, aggregates them into a single file, and updates the master about the key's status, adding it to the SDFS file system.
- Upon receiving all keys from all nodes after aggregation, the master completes the MAPLE job.
- Upon receiving the Juice task, the master finds all relevant files for the task.
- The master assigns each worker node a specific file to process (key-specific).
- The worker runs JuiceExe on the key-specific file and sends the output to the master.
- Upon receiving updates from all Juice tasks, the master completes the juice task.
- If a failure occurs before all files are processed, the master selects another node as a worker and reassigns the files of the failed nodes.
- If a required keyfile handled by a crashed node has not been aggregated, the master:
- Runs the complete maple task again, or
- Reassigns the keys for aggregation to other nodes.
- Silent Failure: If key aggregation is not received after a timeout, the master reassigns the key to other nodes.
- If a node fails, another node is selected to run its JuiceExe.
- If a keyJuice times out, the master reassigns the key to other nodes.
- Limit on Open Connections/Files: High parallelism was causing crashes due to excessive open file descriptors. This was mitigated by using limited tokens for opening files or connections.
- Too Much Context Switch: Excessive goroutines caused the system to stall, resolved by limiting the number of tokens for parallelism.
For large numbers of keys, MapleJuice's performance matches Hadoop as both face resource contention and require batching requests. However, MapleJuice generally performs faster due to lower initial overhead compared to Hadoop, especially with smaller datasets.