feMulGeneric method
Limb multiplication works like pen-and-paper columnar multiplication, but with 51-bit limbs instead of digits.
a4 a3 a2 a1 a0 x
b4 b3 b2 b1 b0 =
------------------------
a4b0 a3b0 a2b0 a1b0 a0b0 +
a4b1 a3b1 a2b1 a1b1 a0b1 +
a4b2 a3b2 a2b2 a1b2 a0b2 +
a4b3 a3b3 a2b3 a1b3 a0b3 +
a4b4 a3b4 a2b4 a1b4 a0b4 =
r8 r7 r6 r5 r4 r3 r2 r1 r0
We can then use the reduction identity (a * 2²⁵⁵ + b = a * 19 + b) to reduce the limbs that would overflow 255 bits. r5 * 2²⁵⁵ becomes 19 * r5, r6 * 2³⁰⁶ becomes 19 * r6 * 2⁵¹, etc.
Reduction can be carried out simultaneously to multiplication. For example, we do not compute r5: whenever the result of a multiplication belongs to r5, like a1b4, we multiply it by 19 and add the result to r0.
a4b0 a3b0 a2b0 a1b0 a0b0 +
a3b1 a2b1 a1b1 a0b1 19×a4b1 +
a2b2 a1b2 a0b2 19×a4b2 19×a3b2 +
a1b3 a0b3 19×a4b3 19×a3b3 19×a2b3 +
a0b4 19×a4b4 19×a3b4 19×a2b4 19×a1b4 =
--------------------------------------
r4 r3 r2 r1 r0
Finally we add up the columns into wide, overlapping limbs.
Implementation
void feMulGeneric(Element a, Element b) {
final a1_19 = a.l1 * bigInt19;
final a2_19 = a.l2 * bigInt19;
final a3_19 = a.l3 * bigInt19;
final a4_19 = a.l4 * bigInt19;
// r0 = a0×b0 + 19×(a1×b4 + a2×b3 + a3×b2 + a4×b1)
final r0 = Uint128.mul64(a.l0, b.l0)
..addMul64(a1_19, b.l4)
..addMul64(a2_19, b.l3)
..addMul64(a3_19, b.l2)
..addMul64(a4_19, b.l1);
// r1 = a0×b1 + a1×b0 + 19×(a2×b4 + a3×b3 + a4×b2)
final r1 = Uint128.mul64(a.l0, b.l1)
..addMul64(a.l1, b.l0)
..addMul64(a2_19, b.l4)
..addMul64(a3_19, b.l3)
..addMul64(a4_19, b.l2);
// r2 = a0×b2 + a1×b1 + a2×b0 + 19×(a3×b4 + a4×b3)
final r2 = Uint128.mul64(a.l0, b.l2)
..addMul64(a.l1, b.l1)
..addMul64(a.l2, b.l0)
..addMul64(a3_19, b.l4)
..addMul64(a4_19, b.l3);
// r3 = a0×b3 + a1×b2 + a2×b1 + a3×b0 + 19×a4×b4
final r3 = Uint128.mul64(a.l0, b.l3)
..addMul64(a.l1, b.l2)
..addMul64(a.l2, b.l1)
..addMul64(a.l3, b.l0)
..addMul64(a4_19, b.l4);
// r4 = a0×b4 + a1×b3 + a2×b2 + a3×b1 + a4×b0
final r4 = Uint128.mul64(a.l0, b.l4)
..addMul64(a.l1, b.l3)
..addMul64(a.l2, b.l2)
..addMul64(a.l3, b.l1)
..addMul64(a.l4, b.l0);
// After the multiplication, we need to reduce (carry) the five coefficients
// to obtain a result with limbs that are at most slightly larger than 2⁵¹,
// to respect the Element invariant.
//
// Overall, the reduction works the same as carryPropagate, except with
// wider inputs: we take the carry for each coefficient by shifting it right
// by 51, and add it to the limb above it. The top carry is multiplied by 19
// according to the reduction identity and added to the lowest limb.
//
// The largest coefficient (r0) will be at most 111 bits, which guarantees
// that all carries are at most 111 - 51 = 60 bits, which fits in a uint64.
//
// r0 = a0×b0 + 19×(a1×b4 + a2×b3 + a3×b2 + a4×b1)
// r0 < 2⁵²×2⁵² + 19×(2⁵²×2⁵² + 2⁵²×2⁵² + 2⁵²×2⁵² + 2⁵²×2⁵²)
// r0 < (1 + 19 × 4) × 2⁵² × 2⁵²
// r0 < 2⁷ × 2⁵² × 2⁵²
// r0 < 2¹¹¹
//
// Moreover, the top coefficient (r4) is at most 107 bits, so c4 is at most
// 56 bits, and c4 * 19 is at most 61 bits, which again fits in a uint64 and
// allows us to easily apply the reduction identity.
//
// r4 = a0×b4 + a1×b3 + a2×b2 + a3×b1 + a4×b0
// r4 < 5 × 2⁵² × 2⁵²
// r4 < 2¹⁰⁷
//
final c0 = r0.shiftRightBy51();
final c1 = r1.shiftRightBy51();
final c2 = r2.shiftRightBy51();
final c3 = r3.shiftRightBy51();
final c4 = r4.shiftRightBy51();
final rr0 = (r0.low & maskLow51Bits) + (c4 * bigInt19);
final rr1 = (r1.low & maskLow51Bits) + c0;
final rr2 = (r2.low & maskLow51Bits) + c1;
final rr3 = (r3.low & maskLow51Bits) + c2;
final rr4 = (r4.low & maskLow51Bits) + c3;
// Now all coefficients fit into 64-bit registers but are still too large to
// be passed around as a Element. We therefore do one last carry chain,
// where the carries will be small enough to fit in the wiggle room above 2⁵¹.
l0 = rr0;
l1 = rr1;
l2 = rr2;
l3 = rr3;
l4 = rr4;
carryPropagateGeneric();
}