More efficient oblivious transfer and extensions for faster secure computation

Protocols for secure computation enable parties to compute a joint function on their private inputs without revealing anything but the result. A foundation for secure computation is oblivious transfer (OT), which traditionally requires expensive public key cryptography. A more efficient way to perform many OTs is to extend a small number of base OTs using OT extensions based on symmetric cryptography. In this work we present optimizations and efficient implementations of OT and OT extensions in the semi-honest model. We propose a novel OT protocol with security in the standard model and improve OT extensions with respect to communication complexity, computation complexity, and scalability. We also provide specific optimizations of OT extensions that are tailored to the secure computation protocols of Yao and Goldreich-Micali-Wigderson and reduce the communication complexity even further. We experimentally verify the efficiency gains of our protocols and optimizations. By applying our implementation to current secure computation frameworks, we can securely compute a Levenshtein distance circuit with 1.29 billion AND gates at a rate of 1.2 million AND gates per second. Moreover, we demonstrate the importance of correctly implementing OT within secure computation protocols by presenting an attack on the FastGC framework.


Background
In the setting of secure two-party computation, two parties P0 and P1 with respective inputs x and y wish to compute a joint function f on their inputs without revealing anything but the output f (x, y). This captures a large variety of tasks, including privacy-preserving data mining, anonymous transactions, private database search, and many more. In this paper, we consider semi-honest adversaries who follow the protocol, but may attempt to learn more than allowed via the protocol communication. We focus on semi-honest security as this allows construction of highly efficient protocols for many application scenarios. This model is justified e.g., for computations between hospitals or companies that trust each other but need to run a secure protocol because of legal restrictions and/or in order to prevent inadvertent leakage (since only the output is revealed from the communication). Semi-honest security also protects against potential misuse by some insiders and future break-ins, and can be enforced with software attestation. Moreover, understanding the cost of semi-honest security is an important stepping stone to efficient malicious security. We remark that also in a large IARPA funded project on secure computation on big data, IARPA stated that the semi-honest adversary model is suitable for their applications [27].
Practical secure computation. Secure computation has been studied since the mid 1980s, when powerful feasibility results demonstrated that any efficient function can be computed securely [15,51]. However, until recently, the bulk of research on secure computation was theoretical in nature. Indeed, many held the opinion that secure computation will never be practical since carrying out cryptographic operations for every gate in a circuit computing the function (which is the way many protocols work) will never be fast enough to be of use. Due to many works that pushed secure computation further towards practical applications, e.g., [4, 5, 8, 11, 13, 21, 24, 30, 35-37, 44, 50], this conjecture has proven to be wrong and it is possible to carry out secure computation of complex functions at speeds that five years ago would have been unconceivable. For example, in FastGC [24] it was shown that AES can be securely computed with 0.2 seconds of preprocessing time and just 0.008 seconds of online computation. This has applications to private database search and also to mitigating server breaches in the cloud by sharing the decryption key for sensitive data between two servers and never revealing it (thereby forcing an attacker to compromise the security of two servers in-stead of one). In addition, [24] carried out a secure computation of a circuit of size 1.29 billion AND gates, which until recently would have been thought impossible. Their computation took 223 minutes, which is arguably too long for most applications. However, it demonstrated that large-scale secure computation can be achieved. The FastGC framework was a breakthrough result regarding the practicality of secure computation and has been used in many subsequent works, e.g., [22,23,25,26,44]. However, it is possible to still do much better. The secure computation framework of [49] improved the results of FastGC [24] by a factor of 6-80, depending on the network latency. Jumping ahead, we obtain additional speedups for both secure computation frameworks [24] and [49]. Most notably, when applying our improved OT implementation to the framework of [49], we are able to evaluate the 1.29 billion AND gate circuit in just 18 minutes. We conclude that significant efficiency improvements can still be made, considerably broadening the tasks that can be solved using secure computation in practice.
Oblivious transfer and extensions. In an oblivious transfer (OT) [48], a sender with a pair of input strings (x0, x1) interacts with a receiver who inputs a choice bit σ. The result is that the receiver learns xσ without learning anything about x1−σ, while the sender learns nothing about σ. Oblivious transfer is an extremely powerful tool and the foundation for almost all efficient protocols for secure computation. Notably, Yao's garbled-circuit protocol [51] (e.g., implemented in FastGC [24]) requires OT for every input bit of one party, and the GMW protocol [15] (e.g., implemented in [8,49]) requires OT for every AND gate of the circuit. Accordingly, the efficient instantiation of OT is of crucial importance as is evident in many recent works that focus on efficiency, e.g., [8,16,19,[22][23][24]26,34,37,43,49]. In the semi-honest case, the best known OT protocol is that of [40], which has a cost of approximately 3 exponentiations per 1-out-of-2 OT. However, if thousands, millions or even billions of oblivious transfers need to be carried out, this will become prohibitively expensive. In order to solve this problem, OT extensions [2,28] can be used. An OT extension protocol works by running a small number of OTs (say, 80 or 128) that are used as a base for obtaining many OTs via the use of cheap symmetric cryptographic operations only. This is conceptually similar to public-key encryption where instead of encrypting a large message using RSA, which would be too expensive, a hybrid encryption scheme is used such that only a single RSA computation is carried out to encrypt a symmetric key and then the long message is encrypted using symmetric operations only. Such an OT extension can actually be achieved with extraordinary efficiency; specifically, the protocol of [28] requires only three hash function computations on a single block per oblivious transfer (beyond the initial base OTs).
Related Work. There is independent work on the efficiency of OT extension with security against stronger malicious adversaries [17,42,43]. In the semi-honest model, [20] improved the implementation of the OT extension protocol of [28] in FastGC [24]. They reduce the memory footprint by splitting the OT extension protocol sequentially into multiple rounds and obtain speedups by instantiating the pseudorandom generator with AES instead of SHA-1. Their implementation evaluates 400,000 OTs (of 80-bit strings without precomputations) per second over WiFi; we propose additional optimizations and our fastest implementation eval-uates more than 700,000 OTs per second over WiFi, cf. Tab. 4.

Our Contributions and Outline
In this paper, we present more efficient protocols for OT extensions. This is somewhat surprising since the protocol of [28] sounds optimal given that only three hash function computations are needed per transfer. Interestingly, our protocols do not lower the number of hash function operations. However, we observe that significant cost is incurred due to other factors than the hash function operations. We propose several algorithmic ( §4) and protocol ( §5) optimizations and obtain an OT extension protocol (General OT, G-OT §5.3) that has lower communication, faster computation, and can be parallelized. Additionally, we propose two OT extension protocols that are specifically designed to be used in secure computation protocols and which reduce the communication and computation even further. The first of these protocols (Correlated OT, C-OT §5.4) is suitable for secure computation protocols that require correlated inputs, such as Yao's garbled circuits protocol with the free-XOR technique [32,51]. The second protocol (Random OT, R-OT §5.4) can be used in secure computation protocols where the inputs can be random, such as GMW with multiplication triples [1,15] (cf. §5.1). We apply our optimizations to the OT extension implementation of [49] (which is based on [8]) and demonstrate the improvements by extensive experiments ( §6). 1 A summary of the time complexity for 1-out-of-2 OTs on 80-bit strings is given in Fig. 1. While the original protocol of [28] as implemented in [49] evaluates 2 23 OTs in 18.0 s with one thread and in 14.5 s with two threads, our improved R-OT protocol requires only 8.4 s with one thread and 4.2 s with two threads, which demonstrates the scalability of our approach. Secure random number generation. In §3 we emphasize that when OT protocols are used as building block in a secure computation protocol, it is very important that random values are generated with a cryptographically strong 1 Our implementation is available online at http:// encrypto.de/code/OTExtension. random number generator. In fact, we show an attack on the latest version of the FastGC [24] implementation (version v0.1.1) of Yao's protocol which uses a weak random number generator. Our attack allows the full recovery of the inputs of both parties. To protect against our attack, a cryptographically strong random number generator needs to be used (which results in an increased runtime).
Faster semi-honest base OT without random oracle. In the semi-honest model, the OT of [40] is the fastest known with 2 + n exponentiations for the sender and 2n fixed-base exponentiations for the receiver, for n OTs. However, it is proven secure only in the random oracle model, which is why the authors of [40] also provide a slower semihonest OT that relies on the DDH assumption, which has complexity 4n fixed-base + 2n double exponentiations for the sender and 1 + 3n fixed-base + n exponentiations for the receiver. In §5.2 we construct a protocol secure under the Decisional Diffie-Hellmann (DDH) assumption that is much faster when many transfers are run (as in the case of OT extensions where 80 base OTs are needed) and is only slightly slower than the fastest OT in the random oracle model ( §6.1).
Faster OT extensions. In §5.3 we present an improved version of the original OT extension protocol of [28] with reduced communication and computation complexity. Furthermore, we demonstrate how the OT extension protocol can be processed in independent blocks, allowing OT extension to be parallelized and yielding a much faster runtime ( §4.1). In addition, we show how to implement the matrix transpose operation using a cache-efficient algorithm that operates on multiple entries at once ( §4.2); this has a significant effect on the runtime of the protocol. Finally, we show how to reduce the communication by approximately one quarter (depending on the bit-length of the inputs). This is of great importance since local computations of the OT extension protocol are so fast that the communication is often the bottleneck, especially when running the protocol over the Internet or even wireless networks.
Extended OT functionality. Our improved protocol can be used in any setting that regular OT can be used. However, with a mind on the application of secure computation, we further optimize the protocol by taking into account its use in the protocols of Yao [51] and GMW [15] in §5.4. For Yao's garbled circuits protocol, we observe that the OT extension protocol can choose the first value randomly and output it to the sender while the second value is computed as a function of the first value. For the GMW protocol. we observe that the OT extension protocol can choose both values randomly and output them to the sender. In both cases, the communication is reduced to a half (or even less) of the original protocol of [28].
Experimental evaluation and applications. In §6 we experimentally verify the performance improvements of our proposed optimizations for OT and OT extension. In §7 we demonstrate their efficiency gains for faster secure computation, by giving performance benchmarks for various application scenarios. For the Yao's garbled circuits framework FastGC [24], we achieve an improvement up to factor 9 for circuits with many inputs for the receiver, whereas we improve the runtime of the GMW implementation of [49] by factor 2, e.g., a Levenshtein distance circuit with 1.29 billion AND gates can now be evaluated at a rate of 1.2 million AND gates per second.

PRELIMINARIES
In the following, we summarize the security parameters used in our paper ( §2.1) and describe the OT extension protocol of [28] ( §2.2), Yao's garbled circuits protocol ( §2.3), and the GMW protocol ( §2.4) in more detail. Standard definitions of security are given in Appendix A.

Security Parameters
Throughout the paper, we denote the symmetric security parameter by κ. Tab. 1 lists usage times (time frames) for different values of the symmetric security parameter κ (SYM ) and corresponding field sizes for finite field cryptography (FFC) and elliptic curve cryptography (ECC) as recommended by NIST [45]. For FCC we use a subgroup of order q = 2κ. For ECC we use Koblitz curves which had the best performance in our experiments.

Oblivious Transfer and OT Extension
The m-times 1-out-of-2 OT functionality for -bit strings, denoted m×OT , is defined as follows: The sender S inputs m pairs of strings , the receiver R inputs a string r = (r1, . . . , rm) of length m, and R obtains x An OT extension protocol implements the m × OT functionality using a small number of actual OTs, referred to as base OTs, and cheap symmetric cryptographic operations. In [28] it is shown how to implement the m×OT functionality using a single call to κ×OTm, and 3m hash function computations. Note that κ×OTm can be implemented via a single call to κ × OTκ in order to obliviously transfer symmetric keys, and then using a pseudo-random generator G to obliviously transfer the actual inputs of length m (cf. [26,28]). In the first step of [28], S chooses a random string s ∈R {0, 1} κ , and R chooses a random m × κ bit matrix T = [t 1 | . . . | t κ ], where t i ∈ {0, 1} m denotes the i-th column of T . The parties then invoke the κ × OTm functionality, where R plays the sender with inputs (t i , t i ⊕ r) and S plays the receiver with input s. Let Q = [q 1 | . . . | q κ ] denote the m × κ matrix received by S. Note that q i = (si · r) ⊕ t i and qj = (rj · s) ⊕ tj (where tj, qj are the j-th rows of T and Q, respectively). S sends (y 0 j , y 1 j ) where y 0 j = x 0 j ⊕ H(qj) and y 1 j = x 1 j ⊕ H(qj ⊕ s), for 1 ≤ j ≤ m. R finally outputs zj = y r j j ⊕ H(tj) for every j. The protocol is secure assuming that H : {0, 1} m → {0, 1} is a random oracle, or a correlation robust function as in Definition A.2; see [28] for more details.

Yao's Garbled Circuits Protocol
Yao's garbled circuits protocol [51] allows two parties to securely compute an arbitrary function that is represented as Boolean circuit. The sender S encrypts the Boolean gates of the circuit using symmetric keys and sends the encrypted function together with the keys that correspond to his input bits to the receiver R. R then uses a 1-out-of-2 OT to obliviously obtain the keys that correspond to his inputs and evaluates the encrypted function by decrypting it gate by gate. To obtain the output, R sends the resulting keys to S or S provides a mapping from keys to output bits. We emphasize that Yao's garbled circuits protocol requires a 1-out-of-2 OT on κ-bit strings for each input bit of R. For our experiments we use the Yao's garbled circuits framework FastGC [24].

The GMW Protocol
The protocol of Goldreich, Micali, and Wigderson (GMW) [15] also represents the function to be computed as a Boolean circuit. Both parties secret-share their inputs using the XOR operation and evaluate the Boolean circuit as follows. An XOR gate is computed by locally XORing the shares while an AND gate is evaluated interactively with the help of a multiplication triple [1,49] which can be precomputed by two random 1-out-of-2 OTs on bits (cf. §5.1). To reconstruct the outputs, the parties exchange their output shares. The performance of GMW depends on the number of OTs and on the depth of the evaluated circuit, since the evaluation of AND gates requires interaction. For our experiments we use the GMW framework of [49], which is an optimization of the framework of [8] for the two party case.

RANDOM NUMBER GENERATION
The correct instantiation of primitives in implementations of cryptographic protocols is a challenging task, since various security properties have to be met. For instance, an important security property of a pseudo-random generator (PRG) is its unpredictability, i.e., given a sequence of pseudo-random bits x1...xn, the next bit xn+1 should not be predictable. If the security property of the primitive is not met, the security of the overall scheme can be compromised. We found that this was the case for the FastGC framework in version 0.1.1 [24] that uses the standard Java Random class in order to generate random values used in the base OTs, the random choices of vector s and matrix T in the OT extension, and the input keys of the garbled circuit. Overall, this enables an attack that allows each party to recover the inputs of the respective other party, as we will describe now.

The Java Random Class
The Java Random class implements a so-called truncated linear congruential generator (T-LCG) with secret seed ψ ∈ {0, 1} 48 . Random numbers can be generated by invoking the next method of an object of the Java Random class which takes as input the requested number of random bits b (for 1 ≤ b ≤ 32), updates the seed ψ = (αψ + β) mod m, and returns the topmost b bits of ψ, where α = 0x5DEECE66D, β = 0xB, and m = 2 48 are public constants. If more than 32 random bits are needed, next is called repeatedly until a sufficient number of bits has been generated. The security of T-LCGs was widely studied and they were shown to be predictable [18], even if the generated sequence is not directly output [3]. In case of the Java Random class, each iteration reveals b bits of the seed, leaving a remaining entropy of 48 − b bits. Furthermore, consecutive values can be used to build linear equations. For our analysis, we assume that the generated random value has at least length 64 bits, i.e., it was generated by two consecutive calls to the next method with b = 32. This holds for the FastGC framework [24] which uses a Java Random object to generate symmetric keys and the columns of the T matrix (we use the first 64 bits only). To predict the output of the Java Random object, we recover its secret seed ψ = ψ1...ψ48 using the 64 bit output d = d1...d64. Since the topmost 32 bits are directly used as output, we have ψ17...ψ48 = d1...d32. In addition, we have ψ 17 ...ψ 48 = d33...d64. Now, the remaining lower 16 bits ψ1...ψ16 can be recovered using the linear equation ψ = (αψ + β) mod m. Specifically, for each of the 2 16 possible values of ψ we compute (αψ +β) mod m−(ψ 17 ...ψ 48 )·2 16 . Now, for the correct value of ψ the result will be zero in the 32 most-significant bits and so will be smaller than 2 16 , whereas for all other values it will be larger (with high probability). In practice, this suffices for finding the entire seed ψ in 2 16 steps, which takes under a second. The recovered secret seed ψ can then be used to predict the output of the Java Random object.

Exploiting the Weak PRG in FastGC [24]
We demonstrate how the usage of the Java Random class in version v0.1.1 of the FastGC [24] framework can be exploited such that the sender can recover the input bits of the receiver using the T matrix generated in the OT extension protocol (cf. §2.2), and the receiver can recover the input bits of the sender using the sender's input keys to the garbled circuit. We implemented and verified both attacks on FastGC, which both run in less than a second. Note that both attacks are carried out on the honestly generated transcript, as required for the setting of semi-honest adversaries.
Recovering the Receiver's Inputs. The sender can recover the receiver's input bits using the T matrix, which is chosen randomly by the receiver in the OT extension (cf. §2.2). Upon receiving the matrix Q = [q 1 | . . . | q κ ], the sender knows that q i = t i , if si = 0, and q i = t i ⊕ r, if si = 1. Hence, whenever the receiver has si = 0, the sender obtains q i = t i and can recover an intermediate seed ψ of the Java Random object that was used to generate this column of T . Afterwards, the sender computes for j > i consecutive random outputs t j until he obtains a column q j = t j where sj = 1 which occurs with overwhelming probability 1 − κ+1 2 κ . Now, the sender can recover the receiver's input bits r by computing q j ⊕ t j = t j ⊕ r ⊕t j = r.
Recovering the Sender's Inputs. The receiver can recover the sender's input bits using the sender's input keys to the garbled circuit. In FastGC, the sender generates random symmetric keys ki ∈ {0, 1} κ for each of his input bits bi ∈ {0, 1} using the same Random object. If bi = 0, he sends Ki = ki to the receiver, else he sends [32]. In order to recover the sender's input bits, the receiver iteratively computes a candidate for the seed with which Ki was generated, computes the next − i keys k j (i < j ≤ ) and checks whether the candidate seed generates a consistent view for the observed values K j . If bi = 0, then Ki = ki and the receiver knows that he has recovered the correct seed by finding either k i+1 ⊕k i+2 = Ki+1 ⊕Ki+2 if there are at least two more input bits bi+1 = bi+2 = 1 or k j = Kj if another input bit is bj = 0. Once the receiver has found such a bi = 0, he can recover all subsequent input bits by checking whether k j = Kj (⇒ bj = 0) or not (⇒ bj = 1). If bi = 1, then Ki = ki ⊕ (∆||0) and the receiver recovers the wrong seed such that neither K j = Kj nor K i+1 ⊕ K i+2 = Ki+1 ⊕ Ki+2 hold with very high probability. Thus, the receiver knows that bi = 1 and repeats the attack for i + 1. Note that this attack fails if the sender has less than three input bits or all except the last two input bits of the sender are set to 1. In this case, however, the receiver can recover the input bits with high probability by using the remaining κ − 64 bits of the key to check if the candidate seed is correct.
Securing FastGC [24]. Securing the FastGC framework is relatively easy, since Java also provides a cryptographically strong random number generator, called SecureRandom, which by default is implemented based on SHA-1. 2 Replacing all usage of the Random class by SecureRandom increased the runtime of our experiments in §7 by around 0.5 − 4%, depending on the application. A complementary method to reduce the overhead in runtime is to use our correlated input OT extension of §5.4 which eliminates the need of generating a random T matrix s.t. our attack for reconstructing the receiver's inputs no longer works. Nevertheless, all randomness that is needed (even for our method) must be generated using SecureRandom.

ALGORITHMIC OPTIMIZATIONS
In the following we describe algorithmic optimizations that improve the scalability and computational complexity of OT extension protocols. We identified computational bottlenecks in OT extension by micro-benchmarking the 1-out-of-2 OT extension implementation of [49]. 3 We found that the combined computation time of S and R was mostly spent on three operations: the matrix transposition (43%), the evaluation of H, implemented with SHA-1 (33%), and the evaluation of G, implemented with AES (14%). To speed up OT extension, we propose to use parallelization ( §4.1) and an efficient algorithm for bit-matrix transposition ( §4.2). Note that these implementation optimizations are of general nature and can be applied to our, but also to other OT extension protocols with security against stronger active/malicious adversaries, e.g., [28,43]. As we will show later in our experiments in §6.2, both algorithmic improvements result in substantially faster runtimes, but only in settings where the computation is the bottleneck, i.e., over a fast network such as a LAN.

Blockwise Parallelized OT Extension
Previous OT extension implementations [8,49] improved the performance of OT extension by using a vertical pipelining approach, i.e., one thread is associated to each step of the protocol: the first thread evaluates the pseudorandom generator G and the second thread evaluates the correlation robust function H (cf. §2.2). However, as evaluation of G is faster than evaluation of H, the workload between the two threads is distributed unequally, causing idle time for the first thread. Additionally, this method for pipelining is designed to run exactly two threads and thus cannot easily be scaled to a larger number of threads. As observed in [20], a large number of OT extensions can be performed by sequentially running the OT extension protocol on blocks of fixed size. This reduces the total memory consumption at the expense of more communication rounds.
We propose to use a horizontal pipelining approach that splits the matrices processed in the OT extension protocol into independent blocks that can be processed in parallel using multiple threads with equal workload, i.e., each of the N threads evaluates the OT extension protocol for m N inputs in parallel. Each thread uses a separate socket to communicate with its counterpart on the other party, s.t. network scheduling is done by the operating system.

Efficient Bit-Matrix Transposition
The computational complexity of cryptographic protocols is often measured by counting the number of invocations of cryptographic primitives, since their evaluation often dominates the runtime. However, non-cryptographic operations can also have a high impact on the overall run time of executions although they might seem insignificant in the protocol description. Matrix transposition is an example for such an operation. It is required during the OT extension protocol to transpose the m × κ bit-matrix T (cf. §2.2), which is created column-wise but hashed row-wise. Although transposition is a trivial operation, it has to be performed individually for each entry in T , making it a very costly operation. We propose to efficiently implement the matrix transposition using Eklundh's algorithm [10], which uses a divideand-conquer approach to recursively swap elements of adjacent rows (cf. Fig. 2). This decreases the number of swap operations for transposing a n × n matrix from O(n 2 ) to O(n log 2 n). Additionally, since we process a bit-matrix, we can perform multiple swap operations in parallel by loading multiple bits into one register. Thereby, we again reduce the number of swap operations from O(n log 2 n) to O( n r log 2 n), where r is the register size of the CPU (r = 64 for the machines used in our experiments). Jumping ahead to the evaluation in §6, this reduced the total time for the matrix transposition by approximately a factor of 9 from 7.1 s to 0.76 s per party.

PROTOCOL OPTIMIZATIONS
In this section, we show how to efficiently base the GMW protocol on random 1-out-of-2 OTs ( §5.1), introduce a new OT protocol ( §5.2), outline an optimized OT extension protocol ( §5.3), and optimize OT extension for usage in secure computation protocols ( §5.4).
In the following, we present a different approach for generating multiplication triples using two random 1-out-of-2 OTs on bits (R-OT). The R-OT functionality is exactly the same as OT, except that the sender gets two random messages as outputs. Later in §5.4, we will show that R-OT can be instantiated more efficiently than OT. In comparison to 1-out-of-4 bit OTs, using two R-OTs only slightly increases the computation complexity (one additional evaluation of G and H and two additional matrix transpositions), but improves the communication complexity by a factor of 2. In order to generate a multiplication triple, we first introduce the f ab functionality that is implemented in Algorithm 1 using R-OT. In the f ab functionality, the parties hold no input and receive random bits ((a, u), (b, v)), under the constraint that ab = u ⊕ v. Now, note that for a multiplication triple The parties can generate a multiplication triple by invoking the f ab functionality twice: in the first invocation P0 acts as R to obtain (a0, u0) and P1 acts as S to obtain (b1, v1) with a0b1 = u0 ⊕ v1; in the second invocation P1 acts as R to obtain (a1, u1) and P0 acts as S to obtain (b0, v0) with A proof sketch for security is given in Appendix B.
2: S and R perform a R-OT with a as input of R. S obtains bits x0, x1 and R obtains bit xa as output. 3: R sets u = xa; S sets b = x0 ⊕ x1 and v = x0.

Optimized Oblivious Transfer
The best known protocols for oblivious transfer with security in the presence of semi-honest adversaries are those of Naor-Pinkas [40]. They present two protocols; a more efficient protocol that is secure in the random oracle model and a less efficient protocol that is secure in the standard model and under standard assumptions. In this section, we describe a new semi-honest OT protocol that is secure in the standard model and is essentially an optimized instantiation of the OT protocol of [12]. When implemented over elliptic curves, our protocol is about three times faster than the standard model OT of [40] and only two times slower than the random oracle OT of [40] (see §6.1 for a comparison of the protocol runtimes). Hence, our protocol is a good alternative for those preferring to not rely on random oracles. Our n×OT protocol is based on the DDH assumption and uses a key derivation function (KDF); see Definition A.1. We also assume that it is possible to sample a random element of the group, and the DDH assumption will remain hard even when the coins used to sample the element are given to the distinguisher (i.e., (g, h, g a , h a ) is indistinguishable from (g, h, g a , g b ) for random a, b, even given the coins used to sample h). This holds for all known groups in which the DDH problem is assumed to be hard and can be implemented as described next. For finite fields, one can sample a random element h ∈ Zp of order q by choosing a random x ∈R Zp and computing h = x (p−1)/q until h = 1. For elliptic curves, one chooses a random x-coordinate, obtains a quadratic equation for the y-coordinate and randomly chooses one of the solutions as h (if no solution exists, start from the beginning). The computational complexity of our protocol for n×OT is 2n exponentiations for the sender Sand 2n fixed-base exponentiations for the receiver R (in fixed-base exponentiations, the same "base" g is raised to the power of many different exponents; more efficient exponentiation algorithms exist for this case [38, Sec. 14.6.3]). In addition, S computes the KDF function 2n times, and R computes it n times. R samples n random group elements according to the above definition. See Protocol 5.1 for a detailed description of the protocol. Inputs: S holds n pairs (x 0 i , x 1 i ) of -bit strings, for every 1 ≤ i ≤ n. R holds the selection bits σ = (σ 1 , . . . , σn). The parties agree on a group G, q, g for which the DDH is hard, and a key derivation function KDF. First Round (Receiver): Choose random exponents α i ∈ R Zq and random group elements h i ∈ R G for every 1 ≤ i ≤ n. Then, for every i, set (h 0 i , h 1 i ) as follows:
The protocol is secure in the presence of a semi-honest adversary (see Definition A.3). The view of a corrupted sender consists of the pairs {(h 0 i , h 1 i )} n i=1 which are completely independent of the receiver's inputs, and therefore can be simulated perfectly. For the corrupted receiver, we need to show the existence of a simulator S1 that produces a computationally-indistinguishable view, given the inputs and outputs of the receiver, i.e., σ and (x σ 1 1 , . . . , x σn n ), without knowing the other sender values (x 1−σ 1 1 , . . . , x 1−σn n ). S1 works by running an execution of the protocol playing an honest S using inputs . . , KDF(h r n )) are indistinguishable from n uniform strings z1, . . . , zn each of size (even when the distinguisher sees G, q, g, u = g r ). This implies that the values in the real execution are computationally indistinguishable from those in the simulation.
An additional optimization for random OT. When constructing OT extensions (see §2.2) the parties first run κ × OTκ on random inputs (this holds for our optimized OT extension protocol, and also for the original protocol of [28] if κ×OTm is implemented via κ×OTκ as described in §2.2). Observe that in this case, the sender only needs to send u = g r to the receiver R; the parties can then derive the values locally (S by computing x 0 i = KDF((h 0 i ) r ) and x 1 i = KDF((h 1 i ) r ), and R by computing x σ i i = KDF(u α i )). This reduces the communication since the elements v 0 i and v 1 i do not have to be sent. In addition, this means that the messages sent by S and R are actually independent of each other, and so the protocol consists of a single round of communication. (As pointed out in [43], this optimization can also be carried out on the protocols of Naor-Pinkas [40]. However, those protocols still require two rounds of communication which can be a drawback in high latency networks.) The timings that appear in §7 are for an implementation that uses this additional optimization. 4

Optimized General OT Extension
In the following, we optimize the m×OT extension protocol of [28], described in §2.2. Recall, that in the first step of the protocol in [28], R chooses a huge m × κ matrix T = [t 1 | . . . |t κ ] while S waits idly. The parties then engage in a κ×OTm protocol, where the inputs of the receiver are (t i , t i ⊕ r) where r is its input in the outer m×OT protocol (m selection bits). After the OT, S holds t i ⊕(si ·r) for every 1 ≤ i ≤ κ. As described in the appendices of [26,28], the protocol can be modified such that R only needs to choose two small κ × κ matrices K0 = [k 0 1 | . . . |k 0 κ ] and K1 = [k 1 1 | . . . |k 1 κ ] of seeds. These seeds are used as input to κ×OTκ; specifically R's input as sender in the i-th OT is (k 0 i , k 1 i ) and, as in [28], the input of S is si. To transfer the m-bit tuple (t i , t i ⊕r) in the i-th OT, R expands k 0 i and k 1 i using a pseudo-random Our main observation is that, instead of choosing t i randomly, we can set t i = G(k 0 i ). Now, R needs to send only one m-bit element u i = G(k 0 i ) ⊕ G(k 1 i ) ⊕ r to S (whereas in previous protocols of [26,28] two m-bit elements were sent). Observe that if S had input si = 0 in the i-th OT, then it can just define its output q i to be G(k 0 i ) = G(k s i i ). In contrast, if S had input si = 1 in the i-th OT, then it can define its output q i to be G( The full description of our protocol is given in Protocol 5.2. This optimization is significant in applications of m×OT extension where m is very large and is short, such as in GMW. In typical use-cases for GMW (cf. §7), m is in the size of several millions to a billion, while is one. Thereby, the communication complexity of GMW is almost reduced by half. In addition, as in [26], observe that unlike [28] the initial OT phase in Protocol 5.2 is completely independent of the actual inputs of the parties. Thus, the parties can perform 4 We remark that, in order to prove the security of this optimization in the standard model (without a random oracle), we need to change the ideal functionality for the random OT such that for every i, the output of the sender is (β 0 i , x 0 i = KDF(g β 0 i )) and (β 1 i , x 1 i = KDF(g β 1 i )), and the output of the receiver is (σi, β σ i i , KDF(g β σ i i )). That is, in addition to receiving their input and output from the random OT functionality, the parties receive the "discrete log" of the pertinent values. This additional information is of no consequence in our applications of random OT. the initial OT phase before their inputs are determined. Finally, another problem that arises in the original protocol of [28] is that the entire m × κ matrix is transmitted together and processed. This means that the number of OTs to be obtained must be predetermined and, if m is very large, this results in considerable latency as well as memory management issues. As in [20], our optimization enables us to process small blocks of the matrix at a time, reducing latency, computation time, and memory management problems. In addition, it is possible to continually extend OTs, with no a priori bound on m. This is very useful in a secure computation setting, where parties may interact many times together with no a priori bound.
The parties invoke the κ×OTκ-functionality, where S plays the receiver with input s and R plays the sender with inputs (k 0 the i-th column is q i . Let q j denote the j-th row of the matrix Q. (Note that q j = (r j · s) ⊕ t j .) 4. S sends (y 0 j , y 1 j ) for every 1 ≤ j ≤ m, where: Output: R outputs (x r 1 1 , . . . , x rn n ); S has no output.
a This phase can be iterated. Specifically, R can compute the next κ bits of t i and u i (by applying G to get the next κ bits from the PRG for each of the seeds and using the next κ bits of its input in r) and send the block of κ×κ bits to S (κ bits from each of u 1 , . . . , u κ ).
Theorem 5.3. Assuming that G is a pseudorandom generator and H is a correlation-robust function (as in Definition A.2), Protocol 5.2 privately-computes the m × OTfunctionality in the presence of semi-honest adversaries, in the κ×OTκ-hybrid model.

Proof:
We first show that the protocol indeed implements the m×OT -functionality. Then, we prove that the protocol is secure where the sender is corrupted, and finally that it is secure when the receiver is corrupted.
Correctness. We show that the output of the receiver is (x r 1 1 , . . . , x rm m ) in an execution of the protocol where the inputs of the sender are ((x 0 1 , x 1 1 ), . . . , (x 0 m , x 1 m )) and the input of the receiver is r = (r1, . . . , rm). Let 1 ≤ j ≤ m, we show that zj = x r j j . We have two cases: 1. rj = 0: Recall that qj = (rj · s) ⊕ tj, and so qj = tj. Thus: In this case qj = s ⊕ tj, and so: Corrupted Sender. The view of the sender during the protocol contains the output from the κ×OTκ invocation and the messages u 1 , . . . , u κ . The simulator S0 simply outputs a uniform string s ∈ {0, 1} κ (which is the only randomness that S chooses in the protocol, and therefore w.l.o.g. can be interpreted as the random tape of the adversary), κ random seeds k s 1 1 , . . . , k sκ κ , which are chosen uniformly from {0, 1} κ , and κ random strings u 1 , . . . , u κ , chosen uniformly from {0, 1} m . In the real execution, (s, k s 1 1 , . . . , k sκ κ ) are chosen in exactly the same way. Each value is unknown to S (by the security of the κ×OTκ functionality), we have that G(k 1−s i i ) is indistinguishable from uniform, and so each u i is indistinguishable from uniform. Therefore, the view of the corrupted sender in the simulation is indistinguishable from its view in a real execution.
Corrupted Receiver. The view of the corrupted receiver consists of its random tape and the messages ((y 0 1 , y 1 1 ) , . . . , (y 0 m , y 1 m )) only. The simulator S1 is invoked with the inputs and outputs of the receiver, i.e., r = (r1, . . . , rm) and (x r 1 1 , . . . , x rm m ). S1 then chooses a random tape ρ for the adversary (which determines the k 0 i , k 1 i values), defines the matrix T , and computes y Then, it chooses each y 1−r j j uniformly and independently at random from {0, 1} . Finally, it outputs (ρ, (y 0 1 , y 1 1 ), . . . , (y 0 m , y 1 m )) as the view of the corrupted receiver. We now show that the output of the simulator is indistinguishable from the view of the receiver in a real execution. If rj = 0, then qj = tj and thus (y 0 j , y 1 j ) = (x 0 j ⊕ H(tj), x 1 j ⊕ H(tj ⊕ s)). If rj = 1, qj = tj ⊕ s and therefore (y 0 j , y 1 j ) = (x 0 j ⊕ H(tj ⊕ s), x 1 j ⊕ H(tj)). In the simulation, the values y r j j are computed as x r j j ⊕ H(tj) and therefore are identical to the real execution. It therefore remains to show that the values (y 1−r 1 1 , . . . , y 1−rm m ) as computed in the real execution are indistinguishable from random strings as output in the simulation. As we have seen, in the real execution each y Since H is a correlation robust function, it holds that: for random s, t1, . . . , tm ∈ {0, 1} κ , where Ua defines the uniform distribution over {0, 1} a (see Definition A.2). In the protocol we derive the values t1, . . . , tm by applying a pseudorandom generator G to the seeds k 0 1 , . . . , k 0 κ and transposing the resulting matrix. We need to show that the values H(t1 ⊕ s), . . . , H(tm ⊕ s) are still indistinguishable from uniform in this case. However, this follows from a straightforward hybrid argument (namely, that replacing truly random t i values in the input to H with pseudorandom values preserves the correlation robustness of H). We conclude that the ideal and real distributions are computationally indistinguishable.

Optimized OT Extension in Yao & GMW
The protocol described in §5.3 implements the m × OT functionality. In the following, we present further optimizations that are specifically tailored to the use of OT extensions in the secure computation protocols of Yao and GMW.
Correlated OT (C-OT) for Yao. Before proceeding to the optimization, let us focus for a moment on Yao's protocol [51] with the free-XOR [32] and point-and-permute [37] techniques. 5 Using this techniques, the sender does not choose all keys for all wires independently. Rather, it chooses a global random value δ ∈R {0, 1} κ−1 , sets ∆ = δ||1, and for every wire w it chooses a random key k 0 w ∈R {0, 1} κ and sets k 1 w = k 0 w ⊕ ∆. Later in the protocol, the parties invoke OT extension to let the receiver obliviously obtain the keys associated with its inputs. This effectively means that, instead of having to obliviously transfer two fixed independent bit strings, the sender needs to transfer two random bit strings with a fixed correlation. We can utilize this constraint on the inputs in order to save additional bandwidth in the OT extension protocol. Recall that in the last step of Protocol 5.2 for OT extension, S computes and sends the messages y 0 j = x 0 j ⊕H(qj) and y 1 j = x 1 j ⊕H(qj ⊕s). In the case of Yao, we have that x 0 j = k 0 w and x 1 j = k 1 w = k 0 w ⊕ ∆. Since k 0 w is just a random value, S can set k 0 w = H(qj) and can send the single value yj = ∆⊕H(qj)⊕H(qj ⊕s). R defines its output as H(tj) if rj = 0 or as yj ⊕ H(tj) if rj = 1. Observe that if rj = 0, then tj = qj and R outputs H(qj) = x 0 j = k 0 w , as required. In contrast, when rj = 1, it holds that tj = qj ⊕ s and thus yj ⊕ H(qj ⊕ s) = ∆ ⊕ H(qj) = ∆ ⊕ k 0 w = k 1 w , as required. Thus, in the setting of Yao's protocol when using the free-XOR technique, it is possible to save bandwidth. As the keys k 0 w , k 1 w used in Yao are also of length κ, the bandwidth is reduced from 3κ bits that are transmitted in every iteration of the extension phase to 2κ bits, effectively reducing the bandwidth by one third. Proving the security of this optimization requires assuming that H is a random oracle, in order to "program" the output to be as derived from the OT extension. In addition, we define a different OT functionality, called correlated OT (C-OT), that receives ∆ and chooses the sender's inputs uniformly under the constraint that their XOR equals ∆. Since Yao's protocol uses random keys under the same constraint, the security of Yao's protocol remains unchanged when using this optimized OT extension. Note that by using the correlated input OT extension protocol, the server needs to garble the circuit after performing the OT extension; this order is also needed for the pipelining approach used in many implementations, e.g., [24,34,36]. We remark that this optimization can be used in the more general case where in each pair one of the inputs is chosen uniformly at random and the other input is computed as a function of the first. Specifically, the sender has different functions fj for every 1 ≤ j ≤ m, and receives random values x 0 j as output from the extension protocol, which defines x 1 j = fj(x 0 j ). E.g., for Yao's garbled circuits protocol, we have x 1 j = fj(x 0 j ) = ∆ ⊕ x 0 j . Random-OT (R-OT) for GMW. When using OT extensions for implementing the GMW protocol, the efficiency can be improved even further. In this case, the inputs for S in every OT are independent random bits b 0 and b 1 (see §5.1 for how to evaluate AND gates using two random OTs).
Thus, the sender can allow the random OT extension protocol (functionality) R-OT to determine both of its inputs randomly. This is achieved in the OT extension protocol by having S define b 0 = H(qj) and b 1 = H(qj ⊕ s). Then, R computes b r j just as H(tj). The receiver's output is correct because qj = (rj · s) ⊕ tj, and thus H(tj) = H(qj) when rj = 0, and H(tj) = H(qj ⊕ s) when rj = 1. With this optimization, we obtain that the entire communication in the OT extension protocol consists only of the initial base OTs, together with the messages u 1 , . . . , u κ , and there are no yj messages. This is a dramatic improvement of bandwidth. As above, proving the security of this optimization requires assuming that H is a random oracle, in order to "program" the output to be as derived from the OT extension. In addition, the OT functionality is changed such that the sender receives both of its inputs from the functionality, and the receiver just inputs r (see [43, Fig. 26]).
Summary. The original OT extension protocol of [28] and our proposed improvements for m × OT are summarized in Tab. 2. We compare the communication complexity of R and S for m parallel 1-out-of-2 OT extensions ofbit strings, with security parameter κ (we omit the cost of the initial κ × OTκ). We also compare the assumption on the function H needed in each protocol, where CR denotes Correlation-Robustness and RO denotes Random Oracle.

Protocol
Applicability

EXPERIMENTAL EVALUATION
In the following, we evaluate the performance of our proposed optimizations. In §6.1 we compare our base OT protocol ( §5.2) to the protocols of [40] and in §6.2 we evalute the performance of our algorithmic ( §4) and protocol optimizations ( §5.3 and §5.4) for OT extension.
Benchmarking Environment. We build upon the C++ OT extension implementation of [49] which implements the OT extension protocol of [28] and is based on the implementation of [8]. We use SHA-1 to instantiate the random oracle and the correlation robust function and AES-128 in counter mode to instantiate the pseudo-random generator and the key derivation function. Our benchmarking environment consists of two 2.5 GHz Intel Core2Quad CPU (Q8300) Desktop PCs with 4 GB RAM, running Ubuntu 10.10 and OpenJDK 6, connected by a Gigabit LAN.

Base OTs
In the following, we compare the performance of the OT protocols of Naor and Pinkas [40] in the random oracle (RO) and standard (STD) model to our STD model OT protocol of §5.2 for different libraries. We either use finite field cryptography (FFC) (based on the GNU-Multiprecision library v.5.0.5) or elliptic curve cryptography (ECC) (based on the Miracl library v.5.6.1). We measure the time for performing κ 1-out-of-2 base OTs on κ-bit strings, for symmetric security parameter κ, using the key sizes from Tab. 1. The runtimes are shown in Tab. 3. For the short term security parameter, FFC using GMP outperforms ECC using Miracl by factor 2 for all protocols. However, starting from a medium term security parameter, ECC becomes increasingly more efficient and outperforms FCC by more than factor 2 for the long term security parameter. For ECC, we can observe that [40]-RO is about 5-6 times faster than [40]-STD but only 2 times faster than our §5.2-STD protocol. For FFC, our §5.2-STD protocol becomes more inefficient with increasing security parameter, since the random sampling requires nearly full-range exponentiations as opposed to the subgroup exponentiations in [40]

OT Extension
To evaluate the performance of OT extension, we measure the time for generating the random inputs for the OT extension protocol and the overall OT extension protocol execution on 10,000,000 1-out-of-2 OTs on 80-bit strings for the short-term security setting, excluding the times for the base OTs. Tab. 4 summarizes the resulting runtimes for the original version without (Orig [49] (1 T)) and with pipelining (Orig [49] (2 T)), the efficient matrix transposition (EMT §4.2), the general protocol optimization (G-OT §5.3), the correlated OT extension protocol (C-OT §5.4), the random OT extension protocol (R-OT §5.4), as well as a two and four threaded version of R-OT (2 T and 4 T, cf. §4.1). The line (x T) denotes the number of threads, running on each party. Since our optimizations target both, the runtime as well as the amount of data that is transferred, we assume two different bandwidth scenarios: LAN (Gigabit Ethernet with 1 GBit bandwidth) and WiFi (simulated by limiting the available bandwidth to 54 MBit and the latency to 2 ms). As our experiments in Tab. 4 show, the LAN setting benefits from computation optimizations (as computation is the bottleneck), whereas the WiFi setting benefits from communication optimizations (as the network is the bottleneck). All timings are the average of 100 executions with one party acting as sender and the other as receiver. Note that each version includes all prior listed optimizations. LAN setting. The original OT extension implementation of [49] has a runtime of 20.61 s without pipelining, which is reduced to only 80% (16.57 s) when using pipelining. Implementing the efficient matrix transposition of §4.2 decreases the runtime to 70% of the one-threaded original version (14.43 s) and already outperforms the pipelined version even though only one thread is used. The general improved OT extension protocol of §5.3 removes the need to  WiFi setting. In the WiFi setting, we observe that the one and two threaded original implementation is already slower compared to the LAN setting. Moreover, all optimizations that purely target the runtime have little effect, since the network has become the bottleneck. We therefore focus on the optimizations for the communication complexity. The G-OT optimization of §5.3 only slightly decreases the runtime since both parties have the same up and download bandwidth and the channel from sender to receiver becomes the bottleneck (cf. Tab. 2). 6 The C-OT extension of §5.4 reduces the runtime by a factor of 2, corresponding to the reduced communication from sender to receiver which is now equal to the communication in the opposite direction. The R-OT extension of §5.4 only slightly decreases the runtime, since now the channel from receiver to sender has become the bottleneck. Finally, the multi-threading optimization of §4.1 does not reduce the runtime as the network is the bottleneck.

APPLICATION SCENARIOS
OT extension is the foundation for efficient implementations of many secure computation protocols, including Yao's garbled circuits implemented in the FastGC framework [24] and GMW implemented in the framework of [8,49]. To demonstrate how both protocols benefit from our improved OT extensions, we apply our implementations to both frameworks and consider the following secure computation usecases: Hamming distance ( §7.1), set-intersection ( §7.2), minimum ( §7.3), and Levenshtein distance ( §7.4). The overall performance results are summarized in Tab. 5 and discussed in §7.5. All experiments were performed under the same conditions as in §6 (LAN setting) using the random-oracle protocol of [40] as base OT. We extended the FastGC framework [24] to call our C++ OT implementation using the Java Native Interface (JNI). We stress that the goal of our performance measurements is to highlight the efficiency gains of our improved OT protocols, but not to provide a comparison between Yao's garbled circuits and the GMW protocol. 6 For shorter strings or if the channel would have a higher bandwidth from sender to receiver (e.g., a DSL link), the runtime would decrease already for the G-OT optimization.

Hamming Distance
The Hamming distance between two -bit strings is the number of positions that both strings differ in. Applications of secure Hamming distance computation include privacypreserving face recognition [46] and private matching for cardinality threshold [29]. As shown in [24,49], using a circuitbased approach is a very efficient way to securely compute the face recognition algorithm of [46] which uses = 900. We use the compact Hamming distance circuit of [6] with size − HW ( ) AND gates and input bits for the client, where HW ( ) is the Hamming weight of .

Set-Intersection
Privacy-preserving set-intersection allows two parties, each holding a set of σ-bit elements, to learn the elements they have in common. Applications include governmental law enforcement [9], sharing location data [41], and botnet detection [39]. Several Boolean circuits for computing the setintersection were described and evaluated in [23]. The authors of [23] state that for small σ (up to σ = 20 in their experiments), the bitwise AND (BWA) circuit achieves the best performance. This circuit treats each element e ∈ {0, 1} σ as an index to a bit-sequence {0, 1} 2 σ and denotes the presence of e by setting the respective bit to 1. The parties then compute the set-intersection as the bitwise AND of their bit-sequences. We build the BWA circuit for σ = 20, resulting in a circuit with 2 σ = 1,048,576 AND gates and input bits for the client. To reduce the memory footprint of the FastGC framework [24], we split the overall circuit and the OTs on the input bits into blocks of size 2 16 = 65,536.

Secure Minimum
Securely computing the minimum of a set of values is a common building block in privacy-preserving protocols and is used to find best matches, e.g., for face recognition [11] or online marketplaces [8]. We use the scenario considered in [36] that securely computes the minimum of N = 1,000,000 = 20-bit values, where each party holds 500,000 values. Using the minimum circuit construction of [31], our circuit has 2 N − 2 ≈ 40,000,000 AND gates and the client has N 2 = 10,000,000 input bits. We note that the performance of the garbled circuit implementation of [36] is about the same as that of FastGC [24] -their circuit has twice the size and takes about twice as long to evaluate. For the FastGC framework we again evaluate the overall circuit by iteratively computing the minimum of at most 2,048 values.

Levenshtein Distance
The Levenshtein distance denotes the number of operations that are needed to transform a string a into another string b using an alphabet of bit-size σ. It can be  Table 5: Performance results for the frameworks of [24] and [49] with and without our optimized OT implementation. The time spent in the OT extensions is given in ().
used for privacy-preserving matching of DNA and proteinsequences [24]. We use the same circuit and setting as [24] with σ = 2 to compare strings a and b of size |a| = 2,000 and |b| = 10,000. The resulting circuit has 1.29 billion AND gates and σ|a| = 4,000 input bits for the client. The GMW framework of [49] was not able to evaluate the Levenshtein circuit since their OT extension implementation tries to process all OTs at once and their framework tries to store the whole circuit in memory, thereby exceeding the available memory of our benchmarking environment. Hence, we changed their underlying circuit structure to support largescale circuits by deleting gates that were used and building the circuit iteratively.

Discussion
We discuss the results of our experiments in Tab. 5 next. For the FastGC framework [24], our improved OT extension implementation written in C++ and using 4 threads is more than one order of magnitude faster than the corresponding single-threaded Java routine of the original implementation. The improvements on total time depend on the ratio between the number of client inputs and the circuit size: for circuits with many client inputs ( §7.1, §7.2, §7.3), we obtain a speedup by factor 2 to 9, whereas for large circuits with few inputs ( §7.4) the improvement for OTs has a negligible effect on the total runtime. To further improve the runtime of large circuits, a faster engine for circuit garbling, e.g., [4], could be combined with our improved OT implementation. For the GMW framework [49], the total runtime is dominated by the time for performing OT extension, which we reduce by factor 2.