how to split layers across multiple GPUs (size: LLAMA_MAX_DEVICES)
external ffi.Pointer<ffi.Float> tensor_split;