CurveForge deep dive: Rust type system (ab)use
Robrecht Blancquaert, Sam Van de Velde, Ruben De Smet
- 10 minutes read - 1963 wordsCurveForge includes some (nasty) Rust type system (ab)use to ensure fast and safe code! Well, fast to run at least… the compiler is a bit less happy. If you missed our introduction post to CurveForge, you might want to read that one first.
Integers as a Type
Rust’s type system is powerful and expressive, allowing developers to encode complex invariants and constraints directly into the type system. CurveForge (ab)uses this system a bit by using (unsigned) integers at the type level.
The original inspiration came from the following comment in the code of Curve25519 Dalek:
fn sub(self, _rhs: &'a FieldElement51) -> FieldElement51 {
// To avoid underflow, first add a multiple of p.
// Choose 16*p = p << 4 to be larger than 54-bit _rhs.
//
// If we could statically track the bitlengths of the limbs
// of every FieldElement51, we could choose a multiple of p
// just bigger than _rhs and avoid having to do a reduction.
//
// Since we don't yet have type-level integers to do this,
// we have to add an explicit reduction call here.
...
}
But we can statically track bitlengths of the limbs; we just need to define some type bounds using typenum, and we get the totally not so bad:
impl<R, Bl, Br> Sub<BoundedField<R, Br>> for BoundedField<R, Bl>
where
R: FiniteFieldRepr,
Br: Add<U1> + Unsigned,
Bl: Add<Br> + Unsigned,
op!(Bl + Br): Add<U1> + Unsigned,
op!(Bl + Br + U1): IsGreaterOrEqual<R::BOUND> + IsLess<R::BOUND>,
op!(Bl + Br + U1): Mul<Le<op!(Bl + Br + U1), R::BOUND>> + Unsigned,
op!(Br + U1): Mul<GrEq<op!(Bl + Br + U1), R::BOUND>> + Unsigned,
GrEq<op!(Bl + Br + U1), R::BOUND>: Reduce<R>,
Prod<op!(Bl + Br + U1), Le<op!(Bl + Br + U1), R::BOUND>>: Unsigned,
Prod<op!(Br + U1), GrEq<op!(Bl + Br + U1), R::BOUND>>:
Unsigned + Add<Prod<op!(Bl + Br + U1), Le<op!(Bl + Br + U1), R::BOUND>>>,
Sum<
Prod<op!(Br + U1), GrEq<op!(Bl + Br + U1), R::BOUND>>,
Prod<op!(Bl + Br + U1), Le<op!(Bl + Br + U1), R::BOUND>>,
>: Unsigned,
{
type Output = BoundedField<
R,
Sum<
Prod<op!(Br + U1), GrEq<op!(Bl + Br + U1), R::BOUND>>,
Prod<op!(Bl + Br + U1), Le<op!(Bl + Br + U1), R::BOUND>>,
>,
>;
#[inline(always)]
fn sub(self, rhs: BoundedField<R, Br>) -> Self::Output {
...
}
}
Ok, maybe saying we add “some” type bounds is underselling it a bit,
especially since compile times noticeably increase once you start using this sub everywhere.
However, this does statically track the bits used by each limb,
without affecting runtime performance!
To understand what is happening here, let us first review the basics of the elements this function operates over.
Finite field elements limbs: drawn and quartered
Elliptic curve cryptography is built on elegant mathematics, but real-world implementations live in a far less idealized environment: computers with fixed-size registers, limited word sizes, and strict security constraints.
On most modern machines, the largest native integer fits in 64 bits. On embedded systems, 32 bits or fewer is common. Meanwhile, elliptic curve cryptography uses fields of 256 bits or larger. Since no single CPU register can hold such values, we must represent them as multiple machine-sized chunks. Luckily, since we operate over finite fields, we have the guarantee that when we work over an $n$-bit finite field, the integers will not grow (much) larger than $n$ bits. This is because, for a prime field $ \mathbb{F}_p $, all arithmetic is performed modulo $ p $. Thus, technically, we never have to represent values larger than $ p - 1 $, as $ p \equiv 0 $. In practice, we allow the numbers to grow slightly larger than $ p $, up to some bound. This boundedness is what makes efficient finite-field implementations possible.
Formally, we write a number $ n $ as:
$$ n = n_0 b^0 + n_1 b^1 + n_2 b^2 + \dots $$where:
- $ b $ is the radix, which is usually a power of two;
- each $ n_i $ fits in a machine word and is called a limb
Choosing $ b = 2^k $ makes this especially convenient: we can split the binary representation directly into fixed-size chunks.
For example, Curve25519 has as field modulus $ p=2^{255} - 19 $, so field elements require 255 bits. Since 255 bits do not fit into a 64-bit integer, we split the number into limbs. Using 51-bit limbs, the 255-bit value $ 2^{255} - 20 $ can be split neatly into five limbs. Each limb fits comfortably inside a 64-bit integer.
At first glance, it might seem natural to use the full width of a machine word. However, consider computing the sum of two field elements. If each limb already uses all 64 bits, then adding two limbs will overflow the register, forcing us to deal with carries and special cases everywhere. Instead, cryptographic libraries deliberately leave headroom in each limb. This unused space is called the bound.
For example, a 64-bit machine word with 51-bit limbs leaves a 13-bit bound. These extra bits allow us to perform several operations without performing a reduction: a reduction would clear the bounds, but is a costly computation.
When we perform arithmetic operations, intermediate values may exceed the field modulus. That is perfectly fine, as long as they remain within the bounds of the limb representation. For example, both addition and subtraction slightly increase the bound Some operations also care about the bound of the input: for addition the input bound does not matter as long as the output bound fits, whereas for operations like equality and multiplication the input is expected to be fully reduced. If the output bound does not fit, a reduction needs to happen before the operation can take place.
We can imagine two field elements where each limb has a random spread of 51 1’s and 0’s,
followed by 13 0’s.
Adding the two elements limb by limb results in each limb having 52 1’s and 0’s, followed by 12 0’s.
Of course, the number of trailing 0’s can be greater, but it is guaranteed to never be smaller.
This resulting number may be greater than the modulus,
but this is a perfectly valid representation.
Since each 13 bits of the bound represent the first 13 bits of the next limb,
it is relatively easy to shift over the bounds to “clear” them.
The final bound can then be seen as an extra limb,
which is handled and cleared by the reduction process.
CurveForge uses Barrett reduction,
although we also have Montgomery representation and reduction.
Ideally, the bound is spread evenly across all limbs, as in the Curve25519 example. But not all field sizes divide so neatly. When the field size does not align nicely with the machine word size, CurveForge chooses:
- a minimum bound that every limb must satisfy (decided by a compile-time variable),
- an as-even-as-possible limb split,
- possibly a smaller final limb.
The goal is always the same: maximize usable bound without increasing the number of limbs.
Behold my APIs of Field
A programmer with knowledge of the inner workings of a finite field representation can, in their algorithm, manually keep track of the bounds and perform a reduction when necessary. However, the goal with CurveForge is to offer a public-facing API wrapper around the finite field representation that automatically performs reductions when, and only when, necessary.
One of the APIs we provide is DynamicField.
This is a wrapper around a finite field representation that, per element, keeps an extra byte that tracks how full the bound is.
One of the boons of implementing this in a cryptographic library is that we can assume we know nothing about the input except its bound.
Thus, even though 0 + 0 would not technically increase the bound, we assume it does,
because we do not know the input and cannot assume it is zero while remaining cryptographically safe.
We promised a zero-cost abstraction for finite fields, but DynamicField keeps track of an extra byte.
For sure, that is not zero-cost?
Well, we also provide BoundedField.
Instead of keeping an extra byte, the bound is tracked at the type-level at compile time.
A finite field element has a BoundedField wrapper with an extra type parameter that is a typenum number.
Using other typenum traits, we can then define how this type changes for certain operations.
Since the type changes for every operation, it is more difficult to work with BoundedField than with DynamicField.
For example, in-place addition is not possible, because the output type is distinct from the input types.
However, reductions are inserted entirely at compile time, and the extra byte is no longer necessary.
The DynamicField and BoundedField trait implementations have knowledge of the internal representation of a finite field element.
For every operation, they ensure the bits are correctly distributed over the limbs.
This includes deciding when to reduce.
Thus, we arrive back at the original Sub bound example.
The first lines define that this is an operation where the left- and right-hand sides have some bounds Bl and Br,
over a finite field representation R:
impl<R, Bl, Br> Sub<BoundedField<R, Br>> for BoundedField<R, Bl>
where
R: FiniteFieldRepr,
Br: Add<U1> + Unsigned,
Bl: Add<Br> + Unsigned,
The rest of the type signature adds type-level requirements one by one until we can determine the bound of the output:
OutputBound = if Bl + Br + 1 < MAX_BOUND { Bl + Br + 1 } else { Br + 1 }.
I.e., we only reduce the right-hand side when it overflows MAX_BOUND.
Working backwards makes this easier to understand. We start by looking at the output bound.
Sum<
Prod<op!(Br + U1), GrEq<op!(Bl + Br + U1), R::BOUND>>,
Prod<op!(Bl + Br + U1), Le<op!(Bl + Br + U1), R::BOUND>>,
>
The sum of two products allows a sort of “compile time branch”;
either the result is Br + U1 (if GrEq<..> == U1),
or Bl + Br + U1 (if Le<..> == U1; the converse).
The comparisons resolve to either the number 1 or the number 0,
and the multiplication Prod allows to remove or retain the respective left-hand sides.
This bound is thus equivalent with:
// ↓ The output bound is the right bound + 1
(Br + 1) * ((Bl + Br + 1) >= MAXBOUND)
/ ↑ iff the output bound would be too large (i.e. reduce is necessary)
+ // otherwise (one of the two sides will always be 0)
// ↓ The output bound is the sum of the two bounds + 1
(Br + Bl + 1) * ((Bl + Br + 1) < MAXBOUND)
↑ iff the output bound fits
For each of these operands we have to explicitly state the type-bound which expresses that it is possible to do that operand with parameters. This results in the complex type bounds at the start of this post.
Hopefully, one day when consts can live at the same level as types in the Rust compiler, we can simplify this.
For now, we are stuck with complex types.
Or we could write another macro to generate these type bounds automatically…
Some remarks
Currently, we naively increase the bound by one bit for each operation, while technically the bound only increases by the percentage of bits the field size occupies. As the maximum bound is currently set to the number of free bits, some reductions may happen too soon. In the future, we will set the maximum bound as a function of the free bits and how large the field size is compared to the next power of two.
As you can see from the type bounds in Sub, if the right bound is already at its maximum,
technically the bound grows ever larger.
We solve this be reducing the right-hand side argument anyways,
but in the future a more elegant solution is possible.