Chang, Boon Chiao (2018) Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture. Master dissertation/thesis, UTAR.
Abstract
Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit for Elliptic Curve Cryptography (ECC), 2048-bit for Rivest-Shamir-Adleman (RSA) and million bits for Fully Homomorphic Encryption (FHE)). The computation of cryptographic algorithm required a lot of large integer arithmetic computation, especially large integer exponentiation and modular operation. Hence, large integer multiplication as the core operation of large integer modular exponentiation is the key factor in determining the performance of the cryptosystem in term of computation time.The minimum bit size of encryption/decryption keys to achieve certain security level have increased over years. CPU implementations are becoming less efficient in handling the computation of crypto algorithms. Hence, other contemporary processor architectures such as GPU and FPGA have become popular alternative to speed up the computation in recent years.This dissertation first discusses several existing large integer multiplication algorithms and reviews different methods used to implement thediscussed algorithms done by other researchers in both Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA).Compared to GPU, FPGA offered more low level design and development to the implementation. While GPU allowed users to configure each of the cores (processing units) to perform independent tasks, FPGA provides users a platform to design the processing units themselves to perform dedicated arithmetic and logic operation.In our GPU implementation, we present two different large integer algorithms implementation on two generation of NVIDIA GPU architectures, Kepler and Pascal. The former focuses on utilizing the shuffle instructions to reduce memory latency and bulk multiplication that is able to perform up to 900 multiplications with operands size of 1024, 2048, 4096 and 8192-bit. This implementation also proved that our proposed method of utilizing the shuffle instructions feature to share data between registers memory is able to achieve up to 11.45% and 36.54% speedup when compared with conventional methods of storing data in GPU global memory and shared memory respectively; The later implementation on Pascal architecture aims to achieve single high speed multiplication of larger size by utilizing the available GPU resources. We are able to eliminate the bottleneck, the CRT algorithm from our previous implementation by replacing it with another level of CTFNT and hence increased the multiplication operand size to 4096, 192K, 384K and 768K-bit.In our FPGA implementation, we focused on designing the processing unit ourselves that is capable of computing 3072-bit multiplication. We started with a preliminary design of a multiplier run with typical NTT module that perform computation in parallel-in-serial-out manner before we moved into design with radix-R FNT modules that are able to compute in parallel-in-parallel-out manner for better performance. Both radix-2 and radix-4 FNT modules are implemented and a further design of radix-4 FNT module, named radix-4 CTFNT module that is optimized to fit the multiplication algorithm we used is proposed. The proposed radix-4 design sacrificed the scalability for NTT with other parameters setup but is able to compute 16-points NTT with 10% faster performance and costs about 27% less resources to be implemented when compared to a typical radix-4 design.
Actions (login required)