Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early return FileMappedDict initialisation when in readonly mode. #5

Open
wants to merge 5 commits into
base: gc_production_branch_do_not_push
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 12 additions & 16 deletions lib/prometheus/client/data_stores/direct_file_store.rb
Original file line number Diff line number Diff line change
Expand Up @@ -198,11 +198,7 @@ def initialize(filename, readonly = false)

if @used > 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. However, it make usage of this object dangerous.
Basically, this means that if you init with readonly, the only method you can call is all_values
If you call anything else, you'll corrupt the file.

This happens to be fine right now, because in the only case where we use readonly, that's the only method we calll. But i'm worried of not being aware of this if we (or anyone else) ever modify code down the line, this makes it quite brittle / dangerous...

Can we have some sort of @file_map_loaded? instance var that gets set by read_all_values? Then, all the other methods would have to check it and call read_all_values if that var is false.

It's quite dirty, but it'd help with the memory allocation.
Alternatively, we can change all_values so that it doesn't call read_all_values, and uses @positions instead? It'll make its code more complex but accomplish what we're trying to do here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Was focused on the code path followed by exporting/formatting and didn't realise that other methods would break if @positions is not indexed at creation time.

On that point, usage the readonly flag might be a bit confusing here in that it still allows write operations to go through, but that's outside the topic of this PR.

So, I pushed a second commit that reverts my previous change (@positions is again indexed at creation time) and uses @positions in all_values to seek across all values. Things look good, but the peak memory reduction I got in the benchmark attached to this PR is not as much exciting as it was before:

# gc_production_branch_do_not_push
# Total allocated: 127.99 MB (2231891 objects)
# Total retained:  139.68 kB (609 objects)

# Second commit in this PR
Total allocated: 95.93 MB (1566004 objects)
Total retained:  139.51 kB (608 objects)

This is because, while not reading file content twice, we're still creating more short-living objects.

However, this second change leaves read_all_values used only at creation time, so it opens the doors to another optimisation: I pushed a 3rd commit that rewrites read_all_values in order to yield key-position pairs and avoid reading metric values at creation time.

This seems to bring numbers back to the original figures:

# gc_production_branch_do_not_push
# Total allocated: 127.99 MB (2231891 objects)
# Total retained:  139.68 kB (609 objects)

# Second commit in this PR
Total allocated: 95.93 MB (1566004 objects)
Total retained:  139.51 kB (608 objects)

# Third commit in this PR
Total allocated: 86.27 MB (1334209 objects)
Total retained:  140.18 kB (612 objects)

Copy link

@dmagliola dmagliola Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new version!
I have one further proposal. read_all_values is now only called from one place... Instead of accepting a block and yielding to it for each value... You could just populate @positions from inside read_all_values and remove that indirection.

(sorry for my super late review)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmagliola pushed another commit with the changes you suggested. Would you mind having another pass?

CC @Sinjo

# File already has data. Read the existing values
with_file_lock do
read_all_values.each do |key, _, pos|
@positions[key] = pos
end
end
with_file_lock { populate_positions }
else
# File is empty. Init the `used` counter, if we're in write mode
if !readonly
Expand All @@ -213,10 +209,14 @@ def initialize(filename, readonly = false)
end
end

# Yield (key, value, pos). No locking is performed.
# Return a list of key-value pairs
def all_values
with_file_lock do
read_all_values.map { |k, v, p| [k, v] }
@positions.map do |key, pos|
@f.seek(pos)
value = @f.read(8).unpack('d')[0]
[key, value]
end
end
end

Expand Down Expand Up @@ -292,22 +292,18 @@ def init_value(key)
@positions[key] = @used - 8
end

# Yield (key, value, pos). No locking is performed.
def read_all_values
# Read position of all keys. No locking is performed.
def populate_positions
@f.seek(8)
values = []
while @f.pos < @used
padded_len = @f.read(4).unpack('l')[0]
encoded = @f.read(padded_len).unpack("A#{padded_len}")[0]
value = @f.read(8).unpack('d')[0]
values << [encoded.strip, value, @f.pos - 8]
key = @f.read(padded_len).unpack("A#{padded_len}")[0].strip
@positions[key] = @f.pos
@f.seek(8, :CUR)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh, this is so much better
Can we name this method populate_positions, or something like that?

Other than that, 👍 from me
I'd check with @Sinjo, though, it's pretty much his repo now :D

Btw, i'm assuming you've updated this repo to the latest master on the main repo, now that prometheus#95 is merged?
If not, probably a good idea to do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, i'm assuming you've updated this repo to the latest master on the main repo, now that prometheus#95 is merged? If not, probably a good idea to do that.

Nope, this PR is based on gc_production_branch_do_not_push of this forked repo (which in turn is just a copy of prometheus#95 I think?). Probably better to catchup with @Sinjo to understand if we want to get these changes tested in production.

end
values
end
end
end
end
end
end