Field operations for small prime moduli #1982

atrieu · 2024-11-05T16:32:50Z

I'm trying to use fiat-crypto to synthesize (bedrock2) C code for NTT operations as used in some PQ lattice-based cryptography, where the prime modulus fits in one word.
How difficult would it be to output "call-by-value" field operations, i.e., without the load/store opening/ending in functions ?
I tried looking into how to modify the pipeline, though I have a bit of a hard time understanding, so I would appreciate some help to get started.

JasonGross · 2024-11-06T02:35:46Z

The relevant entry-point for the bedrock2 pipeline is

fiat-crypto/src/Bedrock/Field/Stringification/Stringification.v

Lines 200 to 202 in 9bf41bc

    
           let out := translate_func 
        
                        e innames inlengths insizes outnames outsizes in 
        
           let f : string*func := (("internal_" ++ name)%string, fst out) in

fiat-crypto/src/Bedrock/Field/Stringification/Stringification.v

Lines 212 to 220 in 9bf41bc

    
           => inl ((header ++ ["static"] ++ bedrock_func_to_lines f) 
        
                     ++ ["/* NOTE: The following wrapper function is not covered by Coq proofs */" 
        
                         ; (((if inline then "__inline__ " else "") ++ (if static then "static " else "") ++ "void " ++ name ++ "(") 
        
                              ++ (String.concat ", " (C.to_retarg_list prefix outdata ++ C.to_arg_list_for_each_lhs_of_arrow prefix indata)) 
        
                              ++ ") {") 
        
                         ; "  " ++ (wrap_call indata outdata f insizes outsizes) ++ ";" 
        
                         ; "}" 
        
                         ; "" 
        
                        ]%string%list,

. As far as making translate_func avoid load/store when we only have one limb, I think @jadephilipoom or @andres-erbsen is better-placed to answer about the difficulty, since I haven't touched the bedrock2 code much.
There seems to be some documentation in https://github.com/mit-plv/fiat-crypto/blob/master/src/Bedrock/Field/README.md. translate_func itself is implemented at

fiat-crypto/src/Bedrock/Field/Translation/Func.v

Lines 45 to 89 in 9bf41bc

    
             (* Translates functions in 3 steps: 
        
                1) load any lists from the arguments 
        
                2) call translate_cmd to translate the function body and get the 
        
                   return values 
        
                3) store the return values in the designated variables (with 
        
                   lists being written into memory) 
        
               The reason for the load/translate/store pattern is so that, for 
        
               the inductive proof of translate_cmd, there is no need to reason 
        
               about the memory. Since fiat-crypto doesn't do any list 
        
               manipulations in the middle of functions, but only uses lists in 
        
               arguments/return values, it's a convenient formalization. *) 
        
             Definition translate_func {t} 
        
                        (e : API.Expr t) 
        
                        (* argument variables *) 
        
                        (argnames : type.for_each_lhs_of_arrow ltype t) 
        
                        (* lengths of argument lists *) 
        
                        (lengths : type.for_each_lhs_of_arrow list_lengths t) 
        
                        (* integer sizes of argument lists *) 
        
                        (argsizes : type.for_each_lhs_of_arrow access_sizes t) 
        
                        (* return variables *) 
        
                        (rets : base_ltype (type.final_codomain t)) 
        
                        (* integer sizes of return lists *) 
        
                        (retsizes : base_access_sizes (type.final_codomain t)) 
        
               : list string * list string * cmd (* bedrock function *) 
        
                 * list_lengths (type.base (type.final_codomain t)) (* output list lengths *) := 
        
               (* load arguments *) 
        
               let load_args_out := load_arguments 0%nat argnames lengths argsizes in 
        
               let nextn := fst (fst load_args_out) in 
        
               let args := snd (fst load_args_out) in 
        
               let load_args_cmd := snd load_args_out in 
        
               (* translate *) 
        
               let out := translate_func' (e _) nextn args in 
        
               (* store return values *) 
        
               let store_rets := store_return_values (snd (fst out)) rets retsizes in 
        
               (* make new arguments for pointers to returned lists *) 
        
               let part := extract_listnames rets in 
        
               let out_ptrs := flatten_listonly_base_ltype (fst part) in 
        
               let innames := out_ptrs ++ flatten_argnames argnames in 
        
               let outnames := flatten_listexcl_base_ltype (snd part) in 
        
               (* assemble executable body: load arguments, function body, store rets *) 
        
               let body := cmd.seq (cmd.seq load_args_cmd (snd out)) (snd store_rets) in 
        
               (* assemble func (arg varnames, return varnames, executable body), 
        
                  also add in output list lengths *) 
        
               (innames, outnames, body, fst store_rets).

Adjusting the wrapper code shouldn't be too hard; it's at

fiat-crypto/src/Bedrock/Field/Stringification/Stringification.v

Lines 116 to 147 in 9bf41bc

    
           Definition wrap_call 
        
             {width BW word mem locals ext_spec varname_gen error} 
        
             `{parameters_sentinel : @parameters width BW word mem locals ext_spec varname_gen error} 
        
                      {t} 
        
                      (indata : type.for_each_lhs_of_arrow var_data t) 
        
                      (outdata : base_var_data (type.final_codomain t)) 
        
                      (f : string*func) 
        
                      (insizes : type.for_each_lhs_of_arrow access_sizes t) 
        
                      (outsizes : base_access_sizes (type.final_codomain t)) 
        
             : string 
        
             := let part := extract_list_var_data outdata in 
        
                let out_ptrs := flatten_listonly_base_var_data (fst part) in 
        
                let innames := out_ptrs ++ flatten_argnames_of_var_data indata in 
        
                let outnames := flatten_listexcl_base_var_data (snd part) in 
        
                let '(name, (args, rets, _body)) := f in 
        
                (* slightly modified from https://github.com/mit-plv/bedrock2/blob/13365e8113131601a60dd6b6ffddc5a0b92aed58/bedrock2/src/bedrock2/ToCString.v#L151-L154 *) 
        
                let '(precall, all_args) 
        
                    := match rets, outnames with 
        
                       | nil, _ | _, nil => (name, outnames ++ innames) 
        
                       | cons _ _, cons r0 _ 
        
                         => let r0 := List.last outnames r0 in 
        
                            let outnames' := List.removelast outnames in 
        
                            (("*" ++ print_name r0 ++ " = " ++ print_cast r0 ++ name)%string, outnames' ++ innames) 
        
                       end%list in 
        
                ((precall ++ "(") 
        
                   ++ (String.concat 
        
                         ", " 
        
                         (List.map 
        
                            (** TODO: Is there a better way to do this, e.g., pulling from insizes? *) 
        
                            (fun n => "(uintptr_t)" ++ print_name n) 
        
                            all_args)) 
        
                   ++ ")")%string.

If you want to play with this all using tactics like cbv and see it interactively, there's some debugging examples in https://github.com/mit-plv/fiat-crypto/blob/master/src/StandaloneDebuggingExamples.v, https://github.com/mit-plv/fiat-crypto/blob/master/src/CompilersTestCases.v, https://github.com/mit-plv/fiat-crypto/blob/master/src/SlowPrimeSynthesisExamples.v

atrieu · 2024-11-07T10:13:50Z

Thank you for the pointers!
I'm not sure I understand whether there is a difference in using the Stringification stuff, vs using the C pretty printer from bedrock?

I was just planning to use the pretty printer, since I will be using Rupicola to synthesize the NTT code after I manage to teach it how to compile arrays of field elements.

To make sure I understand the translation pipeline:

The files in PushButtonSynthesis/ transform the Gallina code in Arithmetic/ into some API.Expr t (this kinda looks like magic to me for now)
The API.Expr t stuff is later consumed by translate_func to produce a bedrock func.

So I would need to modify/copy translate_func so that it avoids the load/stores when the number of limbs is 1, which I tried to do by removing some fluff around the call to translate_func', but this is harder to make Coq typecheck than I expected :/

jadephilipoom · 2024-11-16T10:14:24Z

The existing translate_func should work the way you want if you present the field element as an integer instead of a list; it assumes that lists are "arrays in memory" and that integers are "registers". If I were you, I'd try to approach it by adjusting Arithmetic/ or maybe writing a pass to transform the API.Expr t expression itself. The former is probably easier, since it sounds like you have some new Gallina definitions anyway -- if you make these operate on Z instead of list Z, everything should work as expected.

JasonGross · 2024-11-16T12:01:12Z

I think you want to add a reification of encode (?) (the inverse of eval, I believe) a la

fiat-crypto/src/PushButtonSynthesis/UnsaturatedSolinasReificationCache.v

Lines 200 to 210 in 704fe3d

    
             Derive reified_one_gen 
        
                    SuchThat (is_reification_of reified_one_gen onemod) 
        
                    As reified_one_gen_correct. 
        
             Proof. Time cache_reify (). Time Qed. 
        
           #[global] 
        
             Hint Extern 1 (_ = _) => apply_cached_reification onemod (proj1 reified_one_gen_correct) : reify_cache_gen. 
        
           #[global] 
        
             Hint Immediate (proj2 reified_one_gen_correct) : wf_gen_cache. 
        
           #[global] 
        
             Hint Rewrite (proj1 reified_one_gen_correct) : interp_gen_cache. 
        
             Local Opaque reified_one_gen. (* needed for making [autorewrite] not take a very long time *)

or

fiat-crypto/src/PushButtonSynthesis/UnsaturatedSolinasReificationCache.v

Lines 224 to 234 in 704fe3d

    
             Derive reified_eval_gen 
        
                    SuchThat (is_reification_of reified_eval_gen evalmod) 
        
                    As reified_eval_gen_correct. 
        
             Proof. Time cache_reify (). Time Qed. 
        
           #[global] 
        
             Hint Extern 1 (_ = _) => apply_cached_reification evalmod (proj1 reified_eval_gen_correct) : reify_cache_gen. 
        
           #[global] 
        
             Hint Immediate (proj2 reified_eval_gen_correct) : wf_gen_cache. 
        
           #[global] 
        
             Hint Rewrite (proj1 reified_eval_gen_correct) : interp_gen_cache. 
        
             Local Opaque reified_eval_gen. (* needed for making [autorewrite] not take a very long time *)

cf also

fiat-crypto/src/PushButtonSynthesis/WordByWordMontgomeryReificationCache.v

Lines 251 to 256 in 704fe3d

    
           Derive reified_encode_gen 
        
                  SuchThat (is_reification_of reified_encode_gen encodemod) 
        
                  As reified_encode_gen_correct. 
        
           Proof. 
        
             Time cache_reify (). 
        
             (* we would do something faster, but it breaks extraction COQBUG(https://github.com/coq/coq/issues/7954) *)

Then we want versions of all the operations that compose with encode on the input side and eval on the output side (possibly we want to figure out how to work around coq/coq#7954 so the reification cache isn't two times slower; probably we can do this just by adding reduction at

fiat-crypto/src/PushButtonSynthesis/WordByWordMontgomeryReificationCache.v

Lines 67 to 68 in 704fe3d

    
           instantiate (1:=ltac:(let r := Reify rv in 
        
                                 refine (r @ reified_op_gen)%Expr))

), and we can emit these operations in, e.g., https://github.com/mit-plv/fiat-crypto/blob/master/src/PushButtonSynthesis/WordByWordMontgomery.v in cases where the number of limbs is 1 (possibly gated by some top-level flag). The bedrock2 code can presumably do something similar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Field operations for small prime moduli #1982

Field operations for small prime moduli #1982

atrieu commented Nov 5, 2024

JasonGross commented Nov 6, 2024

atrieu commented Nov 7, 2024

jadephilipoom commented Nov 16, 2024

JasonGross commented Nov 16, 2024

Field operations for small prime moduli #1982

Field operations for small prime moduli #1982

Comments

atrieu commented Nov 5, 2024

JasonGross commented Nov 6, 2024

atrieu commented Nov 7, 2024

jadephilipoom commented Nov 16, 2024

JasonGross commented Nov 16, 2024