feMulGeneric method

void feMulGeneric(
  1. Element a,
  2. Element b
)

Limb multiplication works like pen-and-paper columnar multiplication, but with 51-bit limbs instead of digits.

                     a4   a3   a2   a1   a0  x
                     b4   b3   b2   b1   b0  =
                    ------------------------
                   a4b0 a3b0 a2b0 a1b0 a0b0  +
              a4b1 a3b1 a2b1 a1b1 a0b1       +
         a4b2 a3b2 a2b2 a1b2 a0b2            +
    a4b3 a3b3 a2b3 a1b3 a0b3                 +

a4b4 a3b4 a2b4 a1b4 a0b4 =

 r8   r7   r6   r5   r4   r3   r2   r1   r0

We can then use the reduction identity (a * 2²⁵⁵ + b = a * 19 + b) to reduce the limbs that would overflow 255 bits. r5 * 2²⁵⁵ becomes 19 * r5, r6 * 2³⁰⁶ becomes 19 * r6 * 2⁵¹, etc.

Reduction can be carried out simultaneously to multiplication. For example, we do not compute r5: whenever the result of a multiplication belongs to r5, like a1b4, we multiply it by 19 and add the result to r0.

       a4b0    a3b0    a2b0    a1b0    a0b0  +
       a3b1    a2b1    a1b1    a0b1 19×a4b1  +
       a2b2    a1b2    a0b2 19×a4b2 19×a3b2  +
       a1b3    a0b3 19×a4b3 19×a3b3 19×a2b3  +
       a0b4 19×a4b4 19×a3b4 19×a2b4 19×a1b4  =
      --------------------------------------
         r4      r3      r2      r1      r0

Finally we add up the columns into wide, overlapping limbs.

Implementation

void feMulGeneric(Element a, Element b) {
  final a1_19 = a.l1 * bigInt19;
  final a2_19 = a.l2 * bigInt19;
  final a3_19 = a.l3 * bigInt19;
  final a4_19 = a.l4 * bigInt19;

  // r0 = a0×b0 + 19×(a1×b4 + a2×b3 + a3×b2 + a4×b1)
  final r0 = Uint128.mul64(a.l0, b.l0)
    ..addMul64(a1_19, b.l4)
    ..addMul64(a2_19, b.l3)
    ..addMul64(a3_19, b.l2)
    ..addMul64(a4_19, b.l1);

  // r1 = a0×b1 + a1×b0 + 19×(a2×b4 + a3×b3 + a4×b2)
  final r1 = Uint128.mul64(a.l0, b.l1)
    ..addMul64(a.l1, b.l0)
    ..addMul64(a2_19, b.l4)
    ..addMul64(a3_19, b.l3)
    ..addMul64(a4_19, b.l2);

  // r2 = a0×b2 + a1×b1 + a2×b0 + 19×(a3×b4 + a4×b3)
  final r2 = Uint128.mul64(a.l0, b.l2)
    ..addMul64(a.l1, b.l1)
    ..addMul64(a.l2, b.l0)
    ..addMul64(a3_19, b.l4)
    ..addMul64(a4_19, b.l3);

  // r3 = a0×b3 + a1×b2 + a2×b1 + a3×b0 + 19×a4×b4
  final r3 = Uint128.mul64(a.l0, b.l3)
    ..addMul64(a.l1, b.l2)
    ..addMul64(a.l2, b.l1)
    ..addMul64(a.l3, b.l0)
    ..addMul64(a4_19, b.l4);

  // r4 = a0×b4 + a1×b3 + a2×b2 + a3×b1 + a4×b0
  final r4 = Uint128.mul64(a.l0, b.l4)
    ..addMul64(a.l1, b.l3)
    ..addMul64(a.l2, b.l2)
    ..addMul64(a.l3, b.l1)
    ..addMul64(a.l4, b.l0);

  // After the multiplication, we need to reduce (carry) the five coefficients
  // to obtain a result with limbs that are at most slightly larger than 2⁵¹,
  // to respect the Element invariant.
  //
  // Overall, the reduction works the same as carryPropagate, except with
  // wider inputs: we take the carry for each coefficient by shifting it right
  // by 51, and add it to the limb above it. The top carry is multiplied by 19
  // according to the reduction identity and added to the lowest limb.
  //
  // The largest coefficient (r0) will be at most 111 bits, which guarantees
  // that all carries are at most 111 - 51 = 60 bits, which fits in a uint64.
  //
  //     r0 = a0×b0 + 19×(a1×b4 + a2×b3 + a3×b2 + a4×b1)
  //     r0 < 2⁵²×2⁵² + 19×(2⁵²×2⁵² + 2⁵²×2⁵² + 2⁵²×2⁵² + 2⁵²×2⁵²)
  //     r0 < (1 + 19 × 4) × 2⁵² × 2⁵²
  //     r0 < 2⁷ × 2⁵² × 2⁵²
  //     r0 < 2¹¹¹
  //
  // Moreover, the top coefficient (r4) is at most 107 bits, so c4 is at most
  // 56 bits, and c4 * 19 is at most 61 bits, which again fits in a uint64 and
  // allows us to easily apply the reduction identity.
  //
  //     r4 = a0×b4 + a1×b3 + a2×b2 + a3×b1 + a4×b0
  //     r4 < 5 × 2⁵² × 2⁵²
  //     r4 < 2¹⁰⁷
  //

  final c0 = r0.shiftRightBy51();
  final c1 = r1.shiftRightBy51();
  final c2 = r2.shiftRightBy51();
  final c3 = r3.shiftRightBy51();
  final c4 = r4.shiftRightBy51();

  final rr0 = (r0.low & maskLow51Bits) + (c4 * bigInt19);
  final rr1 = (r1.low & maskLow51Bits) + c0;
  final rr2 = (r2.low & maskLow51Bits) + c1;
  final rr3 = (r3.low & maskLow51Bits) + c2;
  final rr4 = (r4.low & maskLow51Bits) + c3;

  // Now all coefficients fit into 64-bit registers but are still too large to
  // be passed around as a Element. We therefore do one last carry chain,
  // where the carries will be small enough to fit in the wiggle room above 2⁵¹.
  l0 = rr0;
  l1 = rr1;
  l2 = rr2;
  l3 = rr3;
  l4 = rr4;

  carryPropagateGeneric();
}