Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counted Set #132

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Prev Previous commit
Next Next commit
Customize implementation of MultiSets.CountedSet.count
The custom implementation of count adds the multiplicities of the elements together, which is far more efficient than the default approach of iterating through the set manually.

underestimatedCount has also been customized: it exposes the count of the underlying dictionary, which is guaranteed to be less than or equal to the cardinality of the counted set itself. This may be used to avoid trapping for sets with very large multiplicities.
  • Loading branch information
Saklad5 committed Dec 4, 2021
commit 1c567ea60cb49826b4754ce80d6c1d184db85266
18 changes: 18 additions & 0 deletions Sources/MultiSets/CountedSet/CountedSet+Collection.swift
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,15 @@ extension CountedSet: Collection {
return keyPair.key
}

/// The number of elements in the set.
///
/// - Complexity: O(*k*), where *k* is the number of unique elements in the
/// set.
@inlinable
public var count: Int {
Int(rawValue.values.reduce(.zero, +))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine per the Collection protocol, but it seems like a missed opportunity.

What if we kept a running count of items in this set, updating it on every mutation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the count would mean this is no longer an implicit data structure. I’m not really sure that’s a worthwhile tradeoff.

Is there precedent for doing that with similar data structures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh definitely -- Set and Dictionary do it, for example, to prevent having to scan their bitmap to figure out their element count.

I don't think having to store one extra integer matters much in this case -- the hash table in the underlying dictionary already comes with way too much memory overhead to call this a truly implicit data structure. 🙈

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it was an implicit data structure for a dictionary, but okay then.

}

@inlinable
public var startIndex: Index {
Index(storageIndex: rawValue.startIndex, position: 0)
Expand All @@ -84,4 +93,13 @@ extension CountedSet: Collection {
public var endIndex: Index {
Index(storageIndex: rawValue.endIndex, position: 0)
}

/// A value equal to the number of unique elements in the set.
///
/// - Complexity: O(*k*), where *k* is the number of unique elements in the
/// set.
@inlinable
public var underestimatedCount: Int {
rawValue.count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, but if we decide to add a running total instead of using reduce above, then we should return it here, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underestimatedCount has always been incredibly unclear to me, in terms of its intended role in Collections. I don’t believe anything actually says it should be equivalent to count in such cases, so I figured it’d be a useful way to express the (less computationally expensive) number of unique elements.

If we change it to reference count, we should add another property with this value.

By the way, would it be possible to add a more efficient conditional implementation of Swift Algorithms’ Unique methods for CountedSet, in case anyone ends up using them together? I’m guessing the answer is no, but it’d be a nice-to-have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underestimatedCount has always been incredibly unclear to me, in terms of its intended role in Collections. I don’t believe anything actually says it should be equivalent to count in such cases, so I figured it’d be a useful way to express the (less computationally expensive) number of unique elements.

This is a good idea in general, and Collection does allow this! underestimatedCount is typically only used in contexts that are generic over Sequence, but that doesn't mean it's a bad idea to make it diverge from count -- it can still be quite beneficial.

By the way, would it be possible to add a more efficient conditional implementation of Swift Algorithms’ Unique methods for CountedSet, in case anyone ends up using them together? I’m guessing the answer is no, but it’d be a nice-to-have.

The way to do this right now would require Algorithms to import Collections, which would be a bad idea.

However, Swift does have support for cross-module overlays, which are modules that automatically get loaded whenever some module A and module B are both imported to a source file. (Apple's SDKs use this to add, e.g., MapKit-related functionality to SwiftUI when a file imports both of these modules.) If this functionality was cleaned up & promoted to a full-blown language feature, including SwiftPM support for defining cross-import modules, then we could use that to provide uniquing overloads that take CountedSets.

Another alternative is to define a set of additional collection protocols in a separate package, and have swift-collections and swift-algorithms both depend on that. This would let swift-algorithms provide generic overloads for specific algorithms to speed them up when possible.

I don't think we need to do this for CountedMultiset right now, but it's something to think about for later. (FWIW, I think we'll eventually want to define some sort of a dictionary protocol at least, and possibly a UniqueCollection protocol.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the easiest way is to simply provide the uniquing methods directly on the counted multiset type, without importing Algorithms. This would work, but perhaps it would be even better to provide a uniqueMembers view instead, exposing the underlying dictionary's keys view.

(I'm not sure we need to do this though -- we are already exposing the storage dictionary, after all, so folks can simply access keys directly.)

}
}