Skip to content

Load Data

Vivian Chu edited this page Nov 22, 2016 · 4 revisions

About

The basic utilities that this tutorial references tries to take an hdf5 file and convert it into a python dictionary with values converted into numpy structures for easy plotting, use, etc.

Warning: the functions if not called properly will try to load the entire hdf5 file into memory. If you are dealing with an extremely large file, this could take up your entire computer's RAM.

Using the functions

The main way to use this utility would be through importing the load_data function. The command you will want to do is:

from data_logger_bag.load_h5_dataset import load_data

You can also call python load_h5_dataset.py <Path to h5 file> to see an example of the data being loaded. Warning - this will load the entire file. Line 211 has an import pdb; pdb.set_trace() where it stops after the data is loaded in the variable data. You can then peruse the data variable, which should be a python dictionary. See external tutorials about python dictionaries if you are unfamiliar with the data structure.

Loading Data

To load data the file that needs to be either imported or called is load_h5_dataset.py. The file has a few specific functions, but the one that is called to load the data is load_data.

  • load_data takes several arguments:

    def load_data(input_filename, output_filename, save_to_file, directories=None, max_level=None)

Required inputs

  • input_filename is the hdf5 file and it expects the full path to the filename and that the filename ends in .h5

  • output_filename is an optional value that will write to a .pkl file the data that has been loaded in python dictionary format. Unfortunately, this value needs to be set to something for the function to run, even if not saving. So giving an empty string is perfectly fine

  • save_to_file: expects a boolean flag that to write the output pkl file. True will write, False will not

Optional inputs

There are optional inputs that can be used to speed up the loading process on the h5 files.

  • load_directories: This takes an array of strings where the strings are the directories that the loader should be searching for. For example directories=["defaultTask", "defaultSkill"] would search and load only the data hierarchies in those folders. This reduces the memory storage

  • max_level: This is a number where the maximum level of directories are loaded. This is to reduce time spent looking for directories specified in load_directories. If you know that defaultTask and defaultSkill are located in the top two levels, then max_level can be set to 2. (TODO: verify the exact number for this... if it is x or x+1)

Clone this wiki locally