-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorization for prime fields #852
Vectorization for prime fields #852
Conversation
great idea, we should do it because this change is needed on its own even if vectorization strategy changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right direction
src/secret_sharing/array.rs
Outdated
}; | ||
|
||
#[derive(Clone, Debug, Eq, PartialEq)] | ||
pub struct StdArray<V, const N: usize>([V; N]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth exploring an idea of using one array type for both and adding interim structs that provide semantic to it?
apologies for stupid names, but something like ByteValue<V, N>(Array<V, N>)
and BitValue<WORDS>(Array<usize, WORDS>)
. It feels cleaner to implement Add
/Mul
on those rather than having two different array types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you proposing to implement Add
and Mul
on Array
, or on BitValue
/ ByteValue
?
The problem with implementing them on Array
is that we want to specialize them to bitwise operations for GF2 and arithmetic operations for any other field.
If they are on BitValue
and ByteValue
, what functionality do you see Array
having? The other significant functionality on arrays right now is serialization and FromRandom
. Serialization could be done generically, but I'm not sure deserialization can be (if we want to reduce values mod the prime), nor FromRandom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure that we need both Gf2Array and the boolean arrays -- but I haven't tried to reconcile them yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason I missed this message. The core idea was to avoid confusion and accidental misuse of StdArray
as a trait bound for some things that do not require all the arithmetic, just because it is convenient and used everywhere.
Keeping the semantic of arrays as is (i.e. implementing traits for wrappers) feels cleaner but I don't think I strongly oppose the proposed approach
src/secret_sharing/mod.rs
Outdated
|
||
const BITS: u32; | ||
pub trait Vectorized<const N: usize>: SharedValue { | ||
type Message: Message + Clone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious what drives the requirement for it to be Clone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I had a Clone
somewhere in one of my draft implementations of vectorized multiply. I'll try to get rid of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things that make it hard to remove Clone
:
- Currently the
send
function takes its argument by value, not by reference. - We
clone
one of the arguments when what we really want is a ref-times-ref multiply.
Both of these are fixable, but not sure if we want to combine it with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make it easier for now for send
to take B: Borrow<Message>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not reviewed everything. I first wanted to understand the general approach better. I got a bit confused about sharedvaluearray and why we implement our own arraytype with operations on them. I was also wondering whether Gf2Array would play a similar role to e.g. the BA40 or a Gf40bit type. See details below.
src/ff/curve_points.rs
Outdated
type Storage = CompressedRistretto; | ||
type Array<const N: usize> = StdArray<RP25519, N>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to define our own array type? Can't we use something like a generic array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the MPK meet up document we have specified:
async fn convert_to_25519(..., input: [AdditiveShareGf2>; Fp25519::BITS]) -> AdditiveShare<Fp25519, N>>
I was hoping to use just a standard array over additive shares of a BA type and the array length would be e.g. Fp25519::BITS = 252.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to use our own array type in order to provide an implementation of the arithmetic operation traits -- the coherence rules don't let us implement them for [RP25519; N]
.
I think the restriction imposed by the coherence rules here is actually a good thing -- the danger in being able to define a fundamental operation like Add
for arrays is that not everyone may agree on what the operation means for arrays. If we wrap the array, then the behavior of the operations is part of the definition of our wrapper type.
src/ff/galois_field.rs
Outdated
@@ -131,7 +131,7 @@ fn clmul<GF: GaloisField>(a: GF, b: GF) -> u128 { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only use galois fields in the previous version of IPA, do we plan to parallelize this as well? I thought we just want to maintain compatibility but not add new features to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintain compatibility is what I was going for here -- I don't foresee any immediate vectorization of the stuff that uses Galois fields larger than GF2, but if we want to keep the types in the codebase as SharedValue
s, then they need to fulfill the trait definition.
(In my latest version, the Array
associated type is no longer attached to SharedValue
, for unrelated reasons -- but I still expect small changes to some of the types that new IPA doesn't use.)
src/ff/galois_field.rs
Outdated
@@ -567,6 +568,7 @@ bit_array_impl!( | |||
bit_array_40, | |||
Gf40Bit, | |||
U8_5, | |||
crate::secret_sharing::StdArray<Gf40Bit, N>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to perform efficient operations on Boolean values, we need <GfNBit, 40> rather than <Gf40Bit, N>. This would allow us to simultaneously access the first bit to the 40th bit of all N records and do e.g. and addition or conversion operation efficiently. Is this just for conversion purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing a parameterized GfNBit
type for bit-level resolution of N
is hard -- which is why the Galois fields and boolean arrays are implemented with macros.
But more fundamentally I agree with the point that there are use cases for both (N, 40) and (40, N). This is the horizontal vs. vertical vectorization distinction.
.recv_channel(role.peer(Direction::Left)) | ||
.receive(record_id) | ||
.await?; | ||
let left_d: F::Array<N> = <F as Vectorized<N>>::from_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the vector get too large for a single message? E.g. in the extreme case where we vectorize over 256 or 512 records and each record consists of 32 byte values. That would result in 8-16 kbyte messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's anything about 8-16 kB messages that makes them unusable. But I do think that the incremental gains of vectorization will mostly disappear at some fraction of that.
Right now, it looks like each supported vectorization config (data type and width) will need to be listed explicitly in at least in a macro invocation, which means it won't be possible for derived parameters to inadvertently drift outside the sensible range.
src/secret_sharing/array.rs
Outdated
|
||
fn add(self, rhs: &'b StdArray<V, N>) -> Self::Output { | ||
//StdArray(self.0.iter().zip(rhs.0.iter()).map(|(a, b)| *a + *b).collect::<Vec<_>>().try_into().unwrap()) | ||
StdArray(array::from_fn(|i| self.0[i] + rhs.0[i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need operations on an array. Could you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same role as the corresponding operation implemented on the BA* types. The BA types implement elementwise operations on an array of bits using bitwise AND, XOR. This is an implementation of elementwise operations on an array of Fp32BitPrime or Fp25519 or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that we would pass an array without boolean operations to a function and if we need to add within the function, we would implement this explicitly. I am wondering, if I e.g. add x,y,z together, this would need two iteration. Would a compiler be able to optimize this and merge it into a single iteration? At least for Fp25519 types, there are actually not that many places where we need basic operations like additions. However there are other operations like mapping them to curve points where we also would need to explicitly iterate over the vector within a the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related, does the compiler currently unroll all the loops and rearrange assembly instructions for cache optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the compiler will unroll these, map them to SIMD intrinsics, etc.
src/secret_sharing/array.rs
Outdated
|
||
// Introducing the `Eq` bound here is a wart. For some reason AdditiveShare is `Eq`, but | ||
// `SharedValue` is not. Probably all of these should be Eq, or none of them. | ||
impl<V: SharedValue + Eq, const N: usize> SharedValueArray<V> for StdArray<V, N> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that this was what we didn't want to do (and what I had proposed originally), the only difference seems to me that sharedvalue is now sharedvaluearray. Could you elaborate on the reason we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of related issues here, and I'm not sure SharedValueArray
is the key element of all of them -- I thought the major design decision was between something like AdditiveShare<[T; N]>
and AdditiveShare<T, N>
.
The purpose of SharedValueArray
is to serve as the trait bound for the Array
associated type. The associated type SharedValue::Array
can't be constrained as a SharedValue
, since that creates a cycle when trying to determine if something is a SharedValue
.
In the current version, the Array
associated type is no longer on SharedValue
, so I guess it might be possible to merge the SharedValueArray
type with SharedValue
. I'm not sure I want to do that though -- it seems nice to have a distinction between "I'm working with a value" and "I'm working with a vector of values". There's also an analogous set of traits Field
and FieldArray
-- in that case there is the added concern that the array is not a field.
There is a broader issue of whether we want the APIs used to implement protocols to make vectorization explicit (SharedValue<V, N>
and *Array
traits that need to be referenced in vectorized code), or implicit (SharedValue<[V; N]>
and the regular operations on SharedValue
and Field
become vectorized). I believe that making it explicit is less error-prone. Any time you use the API vectorization, it's definitely vectorization across records (vertical) -- and you can apply this vectorization to anything. Vectorization across values within a record (horizontal) needs to be coded explicitly in the protocol, since it requires knowledge of the structure of the data.
Some of this is also related to the question of what the differences are between BA*, Gf2Array, and GfNBit. I am hoping to have a proposal for that soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the BA type, I was adding an Array trait (CustomArray, ArrayAccess). Would it be enough to distinguish a shared value of a single value from an array by checking whether the array trait has been implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a circuitous answer, but we might be able to merge ArrayAccess
and SharedValueArray
into a single trait, and then I would say that yes, whatever the merged trait is can distinguish arrays from single values.
A difference between ArrayAccess
and SharedValueArray
is that SharedValueArray
inherits a bunch of the arithmetic operation traits. Right now I believe some arithmetic operations are invoked on BAs via Field
-- if the BAs stop implementing Field
, what trait do you think they should get the arithmetic operations from? (There are also some explicit std::ops::Not
bounds in the PRF IPA protocols, but naming individual arithmetic operations everywhere we need them gets tedious.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all arithmetic operations come from sharedvalue except multiplication (which currently comes from field) and std::ops::Not. I think it makes sense to bundle them all together in an existing trait. I am fine moving forward with SharedValueArray (kind of as a merge of the previous sharedvalue and arrayaccess/customarray). For the multiplication and maybe the not operation we might still need separate solutions since we might not want to implement them for all sharedvaluearrays. On the other side we could also optimize for bitarray since most of the operations are boolean arithmetics (so we could ask for sharedvaluearray to implement std::ops::Not). Do you think that there is anything that would still use sharedvalue rather than sharedvaluearray<1>?
src/secret_sharing/gf2_array.rs
Outdated
}; | ||
|
||
#[derive(Clone, Debug, PartialEq, Eq)] | ||
pub struct Gf2Array<const N: usize>([usize; N]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to use the BA type for this since it allows easy addition by just xoring the hole bitarray. I saw you use bitslices however GF2Array seems to be a usize slice. Couldn't we just use a bitslice here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it ought to be possible to use BAs rather than have a separate Gf2Array type. I didn't put particular thought into using [usize; N]
here, I probably did it because it was closest to the [T; N]
used by StdArray. Bitslices ought to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to store a single bit in each usize then? or do you pack them and split them in usize chunks? We need access each bit individually but at the same time want to have a compact representation in memory and network messages.
f259738
to
4f9957c
Compare
4f9957c
to
98cb837
Compare
98cb837
to
4b7c7fe
Compare
4b7c7fe
to
1e7872d
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #852 +/- ##
==========================================
+ Coverage 89.24% 90.03% +0.79%
==========================================
Files 181 182 +1
Lines 24936 25460 +524
==========================================
+ Hits 22255 22924 +669
+ Misses 2681 2536 -145 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much good to go imo, I would just love to see a bit more documentation around arrays (sharedvalue, field).
AdditiveShare(self.0 + rhs.0, self.1 + rhs.1) | ||
fn add(self, rhs: &'b AdditiveShare<V, N>) -> Self::Output { | ||
AdditiveShare( | ||
Add::add(self.0.clone(), &rhs.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you mentioned it somewhere but I can't find it - is there a reason why need to clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be thinking of the discussion about the clones in the multiply
implementation, and the issue is the same -- this is dispatching via the SharedValueArray
trait, which requires Array: Sub<Array>
and Array: Sub<&Array>
, but can't require &Array: Sub<&Array>
, because it's not possible to express a bound on a reference to the type in a supertrait bound.
If we really wanted to avoid these clones, we could implement a custom trait that provides ref-by-ref arithmetic operations as an inherent method accepting two arguments (trait MyAdd { fn add(a: &Self, b: &Self) -> Self; }
. I feel like the compiler is probably doing a pretty good job of optimizing these clones so that it may not be worth the trouble to do that, but I don't have solid data to back that up.
We could also drop the ref-by-ref impls here, which would make it more explicit in the caller when a clone is taking place.
//! There are two sets of traits related to vectorization. | ||
//! | ||
//! If you are writing protocols, the trait of interest is `FieldSimd<N>`, which can be specified in | ||
//! a trait bound, something like `F: Field + FieldSimd<N>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Field + FieldSimd<N>
and not just FieldSimd<N>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related to the issue described on line 157 of this file.
Requiring Field: FieldVectorizable<1>
means that existing protocols with a parameter F: Field
can invoke the width-1 version of the vectorized multiply, without needing to update the trait bounds. But that means we can't require that FieldSimd<N>: Field
or FieldVectorizable<N>
: Field
, because that would introduce a cycle.
I may be trying too hard to avoid touching all the trait bounds -- there is a similar issue around extracting u128 conversions from the Field
trait, which I was going to address by creating a separate ArbitrarilyLargeField
trait, but in the past few days I've come around to thinking it's cleaner and not really a bigger change to add a U128Conversions
trait bound wherever necessary. Similar may be true here.
|
||
// We could define a `SharedValueSimd` trait that is the analog of this for `SharedValue`s, but | ||
// there are not currently any protocols that need it. | ||
pub trait FieldSimd<const N: usize>: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the purpose of this trait, seems to exist for the purpose of grouping various trait bounds. Is it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. It is like a convenience trait alias. Meaning something like trait MyTrait: LongListOfTraits {}
/ impl<T: LongListOfTraits> MyTrait for T
, which lets you write T: MyTrait
instead of T: LongListOfTraits
.
But in this case LongListOfTraits
is not so much a long list of traits, as a short list of traits with a complicated associated type bound. But the point is the same -- without FieldSimd
, lots of protocols would need to say F: Field + Vectorizable<N, Array = <Self as FieldVectorizable<N>>::ArrayAlias> + FieldVectorizable<N>
.
|
||
// Supported vectorizations | ||
|
||
impl FieldSimd<32> for Fp32BitPrime {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarifying question - is it for modulus conversion to be able to process all bits in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed today but to capture the answer, no, it's not to support any particular application, it's just the vectorization dimension I picked for the initial test cases in this PR. And, the vectorization dimension const parameter of FieldSimd
being 32 is unrelated to the fact that it's Fp32BitPrime
-- the dimension could be anything within reason.
ipa-core/src/secret_sharing/mod.rs
Outdated
+ for<'a> Sub<&'a Self, Output = Self> | ||
+ SubAssign<Self> | ||
+ for<'a> SubAssign<&'a Self> | ||
+ Message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would imagine the reverse to be a better model to avoid cyclic dependencies. Secret sharing is a core module and infrastructure stuff already depends on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an overlapping trait impl problem. There can't be wildcard Message
impls for both V: SharedValue
and A: SharedValueArray
. (There's also a problem with the type parameter of SharedValueArray
being unconstrained in that impl, but that might be resolvable by changing it to an associated type.)
One strategy would be to implement Message
for anything that is serializable, i.e. impl<T: Debug + Send + Serializable + 'static> Message for T {}
. That makes it easier to inadvertently create a channel for some type that wasn't really meant to be sent over the network -- not sure how serious a concern that is.
Another possibility would be moving the Message
trait down to the bottom layer and implementing it explicitly on each applicable type (fields, BAs, etc.), like Vectorizable
.
Or maybe implementing SharedValue
for SharedValueArray
, if we're okay getting rid of trait Block
and associated type SharedValue::Store
, neither of which is used very extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind if I just create another trait: Sendable
and implement it for SharedValue
and SharedValueArray
? Then we can implement Message
just for Sendable
and be done with it.
Any other hierarchy I feel may be confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that is sufficient? SharedValue
and SharedValueArray
are traits, so Sendable
would need to be implemented for T: SharedValue(Array)?
, and the compiler will reject them as potentially overlapping like it does with Message
.
If the idea is to add Sendable
to core rather than moving Message
to core, then maybe implementing it explicitly on fields, BAs, etc. rather than for a type parameter helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some experiments with Sendable
here: f1a3b52
I think we can make it work relatively easy but the purpose of Message
trait in this case becomes vague
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a reasonable approach to me. I don't think Sendable
and Message
are distinguished by purpose, but one is used in the protocol code and the other is used in infrastructure code. I tweaked your commit a little bit and added it to this branch.
f2d5b3c
to
cb0e7f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looks good to me.
@@ -95,6 +101,8 @@ macro_rules! boolean_array_impl_small { | |||
|
|||
// TODO(812): remove this impl; BAs are not field elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still considering resolving Issue #812? Or should we consider it as resolved since we anyway work with FieldVectorizable now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be resolved by the two branches we looked at on Friday:
https://github.com/andyleiserson/ipa/compare/bit-vectorization...andyleiserson:ipa:u128?expand=1
https://github.com/andyleiserson/ipa/compare/u128...andyleiserson:ipa:field-cleanup?expand=1
Semi-honest AdditiveShare can hold a vector of sharings instead of just one sharing. The semi-honest multiply can operate on these vectors.
10c7a25
to
5027219
Compare
Updated based on review comments.
Semi-honest AdditiveShare can hold a vector of sharings instead of just one sharing. The semi-honest multiply can operate on these vectors.
There are some comments in
src/secret_sharing/mod.rs
that may help provide an overall picture.Notes
AdditiveShare
members #908; once that is merged the diff will be smaller (but still big).Gf2Array
type, I have removed it in favor of using the BA types.SharedValueArray
withArrayAccess
/CustomArray
, because there's actually a significant difference between the two: protocol code shouldn't need to access individual elements of aSharedValueArray
since the computation for each record is exactly the same.Performance
Results from
cargo bench --no-default-features -F enable-benches,descriptive-gate --bench criterion_arithmetic
(with additional benchmark changes not in this PR). Throughput measured in Melem/s (millions of multiplies per second).Configs are Width x Depth x Active work.
xN: Vectorized N-wide
ser: Serial
parN: Parallel w/ N tokio worker threads
tri: TotalRecords::Indeterminate (no buffering in OrderingSender)