Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BiocamIO: support sparse event based recordings #1446

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 126 additions & 7 deletions neo/rawio/biocamrawio.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,25 +197,32 @@ def open_biocam_file_header(filename):
min_digital = experiment_settings["ValueConverter"]["MinDigitalValue"]
scale_factor = experiment_settings["ValueConverter"]["ScaleFactor"]
sampling_rate = experiment_settings["TimeConverter"]["FrameRate"]
num_frames = rf['TOC'][-1,-1]

wellID = None
for key in rf:
if key[:5] == "Well_":
wellID = key
num_channels = len(rf[key]["StoredChIdxs"])
if len(rf[key]["Raw"]) % num_channels:
raise RuntimeError(f"Length of raw data array is not multiple of channel number in {key}")
num_frames = len(rf[key]["Raw"]) // num_channels
if "Raw" in rf[key]:
if len(rf[key]["Raw"]) % num_channels:
raise RuntimeError(f"Length of raw data array is not multiple of channel number in {key}")
if num_frames != len(rf[key]["Raw"]) // num_channels:
raise RuntimeError(f"Estimated number of frames from TOC does not match length of raw data array in {key}")
break
try:
num_channels_x = num_channels_y = int(np.sqrt(num_channels))
except NameError:
if not wellID:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of this code you just use key why do you need to set wellID=key? To be honest if wellID is the better name I would prefer just to say for wellID in rf and then just use wellID wherever key was appearing.

raise RuntimeError("No Well found in the file")
num_channels_x = num_channels_y = int(np.sqrt(num_channels))
if num_channels_x * num_channels_y != num_channels:
raise RuntimeError(f"Cannot determine structure of the MEA plate with {num_channels} channels")
channels = 1 + np.concatenate(np.transpose(np.meshgrid(range(num_channels_x), range(num_channels_y))))

gain = scale_factor * (max_uv - min_uv) / (max_digital - min_digital)
offset = min_uv
read_function = readHDF5t_brw4
if "Raw" in rf[wellID]:
read_function = readHDF5t_brw4
elif "EventsBasedSparseRaw" in rf[wellID]:
read_function = readHDF5t_brw4_sparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a warning here to warn that this is not good practice to fill in the gaps with random synthetic noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't non-filled data interfere with spike detection algorithms based on peak detection ? Since 3Brain data is stored unsigned, filling the gaps with zeros means they will be read as -2048 µV instead of being at the baseline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best solution would be to fill with a value that when scaled returns 0 volts/ microvolts. So I think this is a good point that we should keep in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that default should return something scaled to 0 volts. But I suggest to change it from the current value of 2048 to something like the mean of the known reads. And in particular, value 2048 fails for channels used as trackers (usually channel 1-1 for calibration events) because this one is only zeroes and sometimes 4095 when there is a an event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't know the channels well since I don't use this. This is the issue of trying to decide to fill gaps. It sounds reasonable that this needs to be addressed on a stream basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure how the 3Brain compression algorithm handles the calibration channel or if it's treated differently, but in the current configuration and with default parameters channel 1-1 reads as something like [2048],[2048],[2048],[2048],[4095],[4095],[4095],[2048],[2048],[2048],[2048], and once signed as [ 0],[ 0],[ 0],[ 0],[2047],[2047],[2047],[ 0],[ 0],[ 0],[ 0],. I don't know if any pipelines retain that channel downstream for spike sorting, but even if they do it's scaled to the rest of the channels.

Seeing as the 12 bits signed conversion effectively applies a flat -2048 to everything, I think having something recording-independent might be more reliable.


return dict(
file_handle=rf,
Expand Down Expand Up @@ -249,3 +256,115 @@ def readHDF5t_brw4(rf, t0, t1, nch):
for key in rf:
if key[:5] == "Well_":
return rf[key]["Raw"][nch * t0 : nch * t1].reshape((t1 - t0, nch), order="C")


def readHDF5t_brw4_sparse(rf, t0, t1, nch):
useSyntheticNoise = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what the benefit of this is? If it is always true then why do you need the if statement. You're not actually checking anything...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use_synthetic_noise (default False) should be exposed as an argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3Brain's code example is meant as a simple demo script, not to be used as an API, which is why a lot is done explicitely like this. I'm not sure how far up the chain this argument would need to go, I haven't personnaly used a reader that had such options before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a neo perspective especially for something like this we would want the user to have a say (although like Alessio said a default of False should be used because Neo is not really meant to create data our goal is really just to supply the underlying data).

noiseStdDev = None
startFrame = t0
numFrames = t1 - t0
for key in rf:
if key[:5] == "Well_":
wellID = key
break
# initialize an empty (fill with zeros) data collection
data = np.zeros((nch, numFrames), dtype=np.int16)
# fill the data collection with Gaussian noise if requested
if useSyntheticNoise:
generateSyntheticNoise(rf, data, wellID, startFrame, numFrames, stdDev=noiseStdDev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what others think, but I'm personally not a huge fan of mutating the variable in place (even though I do it sometimes myself). I think it is harder to bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, especially with a name as generic as data floating around in the namespace.
From my own fork (for reference) :

    if synth:
        arr = GenerateSyntheticNoise_arr(rf, wellID, t0, t1, numFrames, nch)

    else:
        arr = np.zeros((numFrames, nch))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that I like something like this better (although see note for zeros below as well).

# fill the data collection with the decoded event based sparse raw data
decodeEventBasedRawData(rf, data, wellID, startFrame, numFrames)
return data.T


def decodeEventBasedRawData(rf, data, wellID, startFrame, numFrames):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also private right? Or is this one public? I would argue that for public functions we should aim for PEP8 compliance (snake_case and not camelCase).

# Source: Documentation by 3Brain
# https://gin.g-node.org/NeuralEnsemble/ephy_testing_data/src/master/biocam/documentation_brw_4.x_bxr_3.x_bcmp_1.x_in_brainwave_5.x_v1.1.3.pdf
# collect the TOCs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahlzahn

Although this is in the 3Brain docs, we think that the coould really use a more pythonic re-write! Would you have time to give it a go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-grimaud any interest in helping re-write this portion in a pythonic way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the the function as a whole ?

I'm pretty convinced there's a more readable and efficient way to parse bytes than one big while loop but the way the data is structured makes it tricky.

I gave it a try last week but couldn't get anything functional out of it yet, I'll try again this week.

toc = np.array(rf["TOC"])
eventsToc = np.array(rf[wellID]["EventsBasedSparseRawTOC"])
# from the given start position and duration in frames, localize the corresponding event positions
# using the TOC
tocStartIdx = np.searchsorted(toc[:, 1], startFrame)
tocEndIdx = min(
np.searchsorted(toc[:, 1], startFrame + numFrames, side="right") + 1,
len(toc) - 1)
eventsStartPosition = eventsToc[tocStartIdx]
eventsEndPosition = eventsToc[tocEndIdx]
# decode all data for the given well ID and time interval
binaryData = rf[wellID]["EventsBasedSparseRaw"][eventsStartPosition:eventsEndPosition]
binaryDataLength = len(binaryData)
pos = 0
while pos < binaryDataLength:
chIdx = int.from_bytes(binaryData[pos:pos + 4], byteorder="little", signed=True)
pos += 4
chDataLength = int.from_bytes(binaryData[pos:pos + 4], byteorder="little", signed=True)
pos += 4
chDataPos = pos
while pos < chDataPos + chDataLength:
fromInclusive = int.from_bytes(binaryData[pos:pos + 8], byteorder="little", signed=True)
pos += 8
toExclusive = int.from_bytes(binaryData[pos:pos + 8], byteorder="little", signed=True)
pos += 8
rangeDataPos = pos
for j in range(fromInclusive, toExclusive):
if j >= startFrame + numFrames:
break
if j >= startFrame:
data[chIdx][j - startFrame] = int.from_bytes(
binaryData[rangeDataPos:rangeDataPos + 2], byteorder="little", signed=True)
rangeDataPos += 2
pos += (toExclusive - fromInclusive) * 2


def generateSyntheticNoise(rf, data, wellID, startFrame, numFrames, stdDev=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a private function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of readability on my own, I declared both this and the main parsing function at the base level of the file then wrote a bare minimum function like the others before :

def readHDF5t_brw4_event_based(rf, t0, t1, nch):
    for key in rf:
        if key[:5] == "Well_":
            return DecodeEventBasedRawData_arr(rf, key, t0, t1, nch)

# Source: Documentation by 3Brain
# https://gin.g-node.org/NeuralEnsemble/ephy_testing_data/src/master/biocam/documentation_brw_4.x_bxr_3.x_bcmp_1.x_in_brainwave_5.x_v1.1.3.pdf
# collect the TOCs
toc = np.array(rf["TOC"])
noiseToc = np.array(rf[wellID]["NoiseTOC"])
# from the given start position in frames, localize the corresponding noise positions
# using the TOC
tocStartIdx = np.searchsorted(toc[:, 1], startFrame)
noiseStartPosition = noiseToc[tocStartIdx]
noiseEndPosition = noiseStartPosition
for i in range(tocStartIdx + 1, len(noiseToc)):
nextPosition = noiseToc[i]
if nextPosition > noiseStartPosition:
noiseEndPosition = nextPosition
break
if noiseEndPosition == noiseStartPosition:
for i in range(tocStartIdx - 1, 0, -1):
previousPosition = noiseToc[i]
if previousPosition < noiseStartPosition:
noiseEndPosition = noiseStartPosition
noiseStartPosition = previousPosition
break
# obtain the noise info at the start position
noiseChIdxs = rf[wellID]["NoiseChIdxs"][noiseStartPosition:noiseEndPosition]
noiseMean = rf[wellID]["NoiseMean"][noiseStartPosition:noiseEndPosition]
if stdDev is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this being exposed? If the user will never be allowed to do this why do we need this check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious about this, I can't find it in the original 3Brain docs.

noiseStdDev = rf[wellID]["NoiseStdDev"][noiseStartPosition:noiseEndPosition]
else:
noiseStdDev = np.repeat(stdDev, noiseEndPosition - noiseStartPosition)
noiseLength = noiseEndPosition - noiseStartPosition
noiseInfo = {}
meanCollection = []
stdDevCollection = []
for i in range(1, noiseLength):
noiseInfo[noiseChIdxs[i]] = [noiseMean[i], noiseStdDev[i]]
meanCollection.append(noiseMean[i])
stdDevCollection.append(noiseStdDev[i])
# calculate the median mean and standard deviation of all channels to be used for
# invalid channels
dataMean = np.median(meanCollection)
dataStdDev = np.median(stdDevCollection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hard to parse (for the future-- the median of the mean) why not median_data_mean so that in the future we don't do the double take? I'm assuming that's just how the code you're porting was written. So this is less important. Just made me do a quick doubletake.

# fill with Gaussian noise
for chIdx in range(len(data)):
if chIdx in noiseInfo:
data[chIdx] = np.array(np.random.normal(noiseInfo[chIdx][0], noiseInfo[chIdx][1],
numFrames), dtype=np.int16)
else:
data[chIdx] = np.array(np.random.normal(dataMean, dataStdDev, numFrames),
dtype=np.int16)

Loading