Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Make has_time_series faster #416

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

daniel-thom
Copy link
Contributor

@daniel-thom daniel-thom commented Dec 6, 2024

This PR should remain in draft until the performance in PSI is judged acceptable.

The existing code had a few issues that caused slowness when calling has_time_series or get_time_series:

  • SQLite indexes were not present when de-serializing a PSB case. I'm actually still not sure why. They were correctly set after to_json(sys) followed by System(file.json). The new code ensures that they are always set. This could have caused latency of up to 30-40 us on each query.
  • Calling SQLite.DBInterface.execute(db, "a sql query as string") has an overhead of ~3-4 us. The delay seems to come from compiling the SQL statement. I'll note that Python does not suffer from this latency. Interpolating strings causes additional latency. This new code caches the statements. It also avoids string interpolations, but only for the has_time_series code path. This is perhaps over-engineered. If this still isn't fast enough for PSI, we need to re-think the entire idea and allow the application to defined what time series attributes should be cached.
  • Calling get_time_series results in a SQL query to retrieve the time series metadata, which is stored as a JSON string, and then deserialize it into an object. That takes ~30 us. In most cases, this is harmless. Reading the data from the disk takes a lot longer than that. Some users might want to use the in-memory time series storage, and so might notice this delay. To solve that I split the metadata-as-json data into a separate table and added caching of the real metadata objects. De-serialization will only occur on the first read request.
  • The existing code has a bug that causes has_time_series(component, Deterministic) to return false if the component has DeterministicSingleTimeSeries. This PR fixes the bug.

The fact that we expect has_time_series(component, Deterministic) to return true when only DeterministicSingleTimeSeries is present causes extra latency. We could remove that latency by changing the requirement (make the user ask for the exact type stored) or add another column to the database table.

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from 6c3dbab to e2b7959 Compare December 7, 2024 00:24
Copy link

codecov bot commented Dec 7, 2024

Codecov Report

Attention: Patch coverage is 93.24895% with 16 lines in your changes missing coverage. Please review.

Project coverage is 77.82%. Comparing base (cd78bac) to head (297a698).

Files with missing lines Patch % Lines
src/time_series_metadata_store.jl 93.17% 14 Missing ⚠️
src/utils/sqlite.jl 75.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   77.75%   77.82%   +0.06%     
==========================================
  Files          69       69              
  Lines        5481     5636     +155     
==========================================
+ Hits         4262     4386     +124     
- Misses       1219     1250      +31     
Flag Coverage Δ
unittests 77.82% <93.24%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/supplemental_attribute_associations.jl 92.99% <100.00%> (-0.30%) ⬇️
src/system_data.jl 91.28% <ø> (ø)
src/time_series_interface.jl 95.58% <100.00%> (ø)
src/time_series_manager.jl 96.69% <100.00%> (+0.82%) ⬆️
src/utils/sqlite.jl 73.17% <75.00%> (+0.44%) ⬆️
src/time_series_metadata_store.jl 93.50% <93.17%> (+<0.01%) ⬆️

... and 4 files with indirect coverage changes

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from e2b7959 to 7329536 Compare December 7, 2024 01:08
@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from 85e42ee to bb3a3b3 Compare December 9, 2024 23:36
@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from bb3a3b3 to bfe1415 Compare December 10, 2024 00:42
@daniel-thom daniel-thom marked this pull request as draft December 10, 2024 01:24
@@ -567,7 +567,7 @@ function _transform_single_time_series!(
horizon = params.horizon,
time_series_type = DeterministicSingleTimeSeries,
scaling_factor_multiplier = get_scaling_factor_multiplier(metadata),
internal = get_internal(metadata),
internal = InfrastructureSystemsInternal(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous behavior was incorrect, but was harmless. It caused the metadata of the transformed object to have the same UUID, but no code ever looked at that.

@@ -110,7 +110,11 @@ function add_time_series!(mgr::TimeSeriesManager, batch::Vector{TimeSeriesAssoci
for uuid in new_ts_uuids
serialize_time_series!(mgr.data_store, time_series_uuids[uuid])
end
add_metadata!(mgr.metadata_store, owners, all_metadata)
if length(all_metadata) == 1
add_metadata!(mgr.metadata_store, owners[1], all_metadata[1])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster than calling the the array version.

@@ -23,14 +23,17 @@ const SUPPLEMENTAL_ATTRIBUTE_TABLE_NAME = "supplemental_attributes"

mutable struct SupplementalAttributeAssociations
db::SQLite.DB
# If we don't cache SQL statements, there is a cost of 3-4 us on every query.
cached_statements::Dict{String, SQLite.Stmt}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some minor changes to the supplemental attribute associations, but did not profile them. If the time series changes work out, I can pursue these in greater detail.

# Caching compiled SQL statements saves 3-4 us per query query.
# DBInterface.jl does something similar with @prepare.
# We need this to be tied to our connection.
cached_statements::Dict{String, SQLite.Stmt}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to selective queries that we run frequently.

# This caching allows the code to skip some string interpolations.
# It is experimental for PowerSimulations, which calls has_metadata frequently.
# It may not be necessary. Savings are less than 1 us.
has_metadata_statements::Dict{HasMetadataQueryKey, SQLite.Stmt}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cached applies only to the has_time_series path, which is the most important to optimize for PSI. And is possibly over-engineered.

@daniel-thom daniel-thom force-pushed the fix/faster-has-time-series branch from 47db714 to 297a698 Compare December 10, 2024 18:04
@jd-lara
Copy link
Member

jd-lara commented Dec 11, 2024

@daniel-thom now there is a workaround in PowerSimulations to handle this case "The fact that we expect has_time_series(component, Deterministic) to return true when only DeterministicSingleTimeSeries is present causes extra latency." and any impact of that can be addressed.

@daniel-thom
Copy link
Contributor Author

@daniel-thom now there is a workaround in PowerSimulations to handle this case "The fact that we expect has_time_series(component, Deterministic) to return true when only DeterministicSingleTimeSeries is present causes extra latency." and any impact of that can be addressed.

Currently, get_time_series(Deterministic, ...) will return a DeterministicSingleTimeSeries. has_time_series doesn't, and that is a bug. We need to make the behaviors consistent. Changing get_time_series could be a big API breakage for users. Let's test the performance on a large system before we make a decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants