Monday, February 27, 2017

Side-channel analysis

For many years cryptographers and industry implemented crypto algorithms on microcontrollers and ASICs simply by translating the algorithms into assembly or rtl language. Only thing that mattered was correct functionality of the design. However, in mid to late nineties some interesting attacks of cryptographic implementations were shown, that were able to completely break the security of it, reveal secret keys, while the underlying cryptographic algorithm remained mathematically unbroken. First, the timing attack was introduced. Namely the attacker was able discover information based on the time server took to respond to a query since this time depended on the secret key that was used. Most simple application of this is square and multiply algorithm used for RSA exponentiation. Namely, for every bit of the exponent square, or square and multiply operation is used. This will result in different execution times for different exponents.

Kocher et al. introduced new type of side-channel attack  by revealing that information obtained from the power measurements can be also used to extract secret values from devices such as smart cards. We start from assuming that the power signal consist of random noise and deterministic one dependent on the processed value $x$

               $ P = L(x) + R$

Random noise is modeled as a Gaussian distribution with zero mean, and deterministic key dependant value $L(x)$ is usually hamming weight or hamming distance of the value . Following on, we choose intermediate value to be targeted. Good target of the attack for symmetric designs is S-box output, since small hamming weight difference in the input of the sbox leads to not so small hamming weight difference in the output. In the original work it was simply a one bit value, that could be either one or zero. The simplest example is thus by using one output bit of the S-box, with the assumption that power consumption is different for one and zero value. For example, take one S-box operation of the first round of AES. It is calculated using equation below

$SO_i = S(pt_i \oplus key_i)$

 Since we are unaware of what is the correct key value, we construct key guesses and calculate the targeted value, e.g. first bit of the S-box output. Now, for every key guess, we separate power traces into two groups, one that contains power traces for which the key guess produces target value one and other where target value is zero. Lastly, we compute the average of two groups, reducing the influence of the noise, and then we subtract these averages. If our key guess is wrong, difference of means will be negligible, since the power traces in two groups don’t correspond to correct values, and are thus uncorrelated. However, the correct key guess will have a noticeable spike in the point in time where first bit of S-box output is computed.

This very simple, yet very effective attack opened a new area of research. Soon, more elaborate attacks were devised, attacking bytes instead of bits using correlation. More advanced attacks include correlation with variance instead of the mean, or even higher statistical moments. These attacks are usually unpractical for orders higher than two, since the noise influence hardens the attacker’s possibility to retrieve the secret value. Profiling of the devices is sometimes performed in order to devise a more accurate model of the deterministic leakage functions. These attacks are known as template attacks. Designers also came up with various ways of trying to thwart these attacks, such as increasing the noise level by adding redundant circuitry or operations, and shuffling the execution operation. Other methods include secure logic design styles such as WDDL, where logical cells are designed to consume amount of power regardless of their output. Methods on the algorithmic level, e.g. Threshold Implementations, try to ensure no information leakage happens during the execution of the algorithm, regardless of the underlying leakage model of the design technology.

Since classical DPA and correlation analysis involve constructing guesses for all possible key bytes/nibbles, it can be very time consuming. This is where leakage tests became prominent. They tell if a set of randomly chosen traces can be distinguished for set of traces with fixed input value. It works in a very similar way to one bit DPA. 2 Sets of traces are obtained, one with random inputs and one with fixed inputs. Welsh t-test is used to measure if two sets can be distinguished from one another. If $\mu, \ s$, and $n$ are mean, variance and cardinality of a set, t values are calculated as follows

              $  t = \frac{\mu_0 - \mu_1}{\sqrt{\frac{s_0^2}{n_0} + \frac{s_1^2}{n_1}}}$

Approximating statistical degree of freedom with $n=n_0 + n_1$ we can with confidence 0.99999 claim that two sets are distinguishable for $t > 4.5$.  This type of leakage tests allows the designers to be more efficient since it reduces the testing time of the design. Downside of the approach is that leakage tests do not reveal how can the secret values be extracted, only that there is a possibility for an attacker to do so.

In best paper of CHES 2016: Differential Computation Analysis: Hiding Your White-Box Designs is Not Enough, side-channel analysis has been successfully used to break many white-box implementations. This time analysis has been performed on stack layout, instead of power traces, but the principle remained the same. Taking into account that white-box cryptography is currently being widely considered to be added to many software based designs, together with plethora of IoT devices that require protection, side-channel analysis importance will continue to be an important aspect of design process of cryptographic implementation. Thus, researchers will continue to work on improvement of leakage detection and key recovery methods to further reduce cost and time to market of new devices. 

Sunday, February 19, 2017

Differential privacy

These days one reads a lot about the right to privacy. But what is it and how does it differ between the real and the digital description of the world? Briefly, it is a person's right to have and more importantly maintain control over information about oneself, which is fundamental to human's freedom of self-determination. In the digital context one seemingly lost this right in favor of using free services, handy tools that satisfy people's urge to communicate with each other and stay in touch and up-to-date at all times. Since more and more people realize that behind such apps and web pages naturally there are business models, society increasingly demands to reclaim control over collected personal data, which has been provided, unsolicited or involuntarily, to online merchants or telephony service providers at the time of use, for example. On the other hand collecting user information in databases is crucial as corporations offering named services see it. In order to make good products, tailored recommendations, and especially in the age of big data and machine learning, precise predictions by evaluating various functions on the data.

Statistical database queries have been studied quite some time now, and in fact it turns out that often it is sufficient to allow query access only to a population's aggregate data, not individual records, to derive useful statistics and achieve desired functionality! A common approach is to merely allow aggregate queries (i.e. range, histogram, average, standard deviation, ...) and rather than returning exact answers about sensitive data fields to specify intervals or give imprecise, statistically noisy counts.

You might have asked yourself, don't cryptographic techniques that have been omnipresent in this blog such as FHE (Fully Homomorphic Encryption) or secure MPC (Multi-Party Computation) solve this problem? Wouldn't it be possible to encrypt the user data yet for the service provider to compute useful statistics on it?
Indeed, it could be realized with the general FHE, MPC toolkit, but currently it is inefficient to operate them at that scale in practice, such that statistics over huge quantities of data are useful to infer statements about a given database. Hence specific, more slender tools have been designed to overcome this gap. Whereas FHE avoids a trusted 3rd party to compute (i.e. arbitrary functions or sophisticated statistics) on users sensitive data, here typically one explicitly allows a trusted 3rd party to collect and aggregate data in a privacy-preserving fashion. Users might do so i.e. when installing an app and argue to have an advantage for themselves like good default settings, an overall performance gain; or it might be a requirement to share information in order to use the service for free in the first place.

Differential privacy (often abbreviated DP) is a framework for formalizing privacy in statistical databases. It can protect against so called de-anonymization techniques that try identifying an individual record by linking two separately released databases that have been stripped off (quasi-)identifiers and look innocuous. Especially apriori knowledge or known partial history can be leveraged to derive more information from a released "anonymized dataset" other than the purpose it was originally intended to serve.

Let's look at a mathematical definition that captures and formalizes the notion of privacy and which has been studied in cryptography in the past 10 years. Let $d,n$ be a positive integers and $f: X^n \rightarrow \mathbb {R} ^{d}$ some statistics on a database comprised of $n$ records.

An algorithm $\mathcal {A}: X^n \rightarrow \mathbb {R} ^{d}$ that computes $f$ is said to have the $(\epsilon, 0)$-differential private attribute or ($\mathcal {A}$ is $\epsilon$-DP, for short) if for all neighboring subsets of a given database $x_{1} \neq x_{2}$ and $x_{1} \sim x_{2} := x_{1} \sim_1 x_{2}$ (they differ in just 1 element), and all subsets $S \subseteq \mathbb {R} ^{d}:$
\mathbb{P}[\mathcal{A}(x_{1})\in S]\leq e^{\epsilon } \cdot \mathbb{P}[\mathcal{A}(x_{2})\in S]
holds. Looking more closely at this definition $\forall x_{1}, x_{2} \in X^n, x_{1} \sim x_{2}:$
$$\mathbb{P}[{\mathcal{A}}(x_{1})\in S]
\leq e^{\epsilon } \mathbb{P}[{\mathcal{A}}(x_{2})\in S] \Leftrightarrow \frac{\mathbb{P}[{\mathcal{A}}(x_{1})\in S]}{\mathbb{P}[\mathcal{A}(x_{2})\in S]} \leq e^{\epsilon }\\ \Leftrightarrow \log \left(\frac{\mathbb P[\mathcal{A}(x_{1})\in S]}{\mathbb{P}[{\mathcal{A}}(x_{2})\in S]}\right) \leq {\epsilon}$$
we can identify the so called "privay loss" of an algorithm (or in this context often called mechanism) $\mathcal {A}$. In this setting $\epsilon$ can be called the privacy budget. In less exact terms it captures the following: By specifying the privacy budget, it is possible to control the level of privacy and make an algorithm respect this additional constraint by techniques introduced below.
For those familiar with the concept of max-divergence, the definition of privacy loss is in fact the definition of $$D_\infty( A(x_1) || A(x_2)) := \max_{S\subseteq {\textbf supp}(A(x_1))} \log \left(\frac{\mathbb P[\mathcal{A}(x_{1})\in S]}{\mathbb{P}[{\mathcal{A}}(x_{2})\in S]} \right).$$
Furthermore, the multiplicative factor $e^\epsilon$ can be -- using a common approximation for small $\epsilon<1$ -- viewed as $1+\epsilon$:
$$e^\epsilon = exp(\epsilon) = 1 + \epsilon + \epsilon^2 + \dots \approx 1 + \epsilon.$$
In less formal terms this definition says that a given result is approximately the same whether it is computed using the first or the second, neighboring database.
A more general definition, that adds flexibility -- but also makes the proofs less elegant and more technical -- is $(\epsilon, \delta)$ -differentially privacy, when
$$\mathbb P[{\mathcal {A}}(x_{1})\in S]\leq e^{\epsilon } \cdot \mathbb P[{\mathcal {A}}(x_{2})\in S] + \delta.$$
Interpreting the definition, the goal of DP is that the risk of violating one's privacy should not substantially increase as a result of either appearing in a statistical database or not. Thus an analyst should not be able to learn any information about a record (i.e. participating an online questionnaire) that couldn't have been learned if one had opted not to participate or answered the questions randomly by rolling a die or flipping a coin rather than answering truthfully.

To overcome the fundamental challenge -- the trade-off between utility of data or accuracy of returned answers and privacy of records -- the set goal is to learn as much as possible about a group's data while revealing as little as possible about any individual within the group. Transforming an algorithm into a DP-algorithm requires probabilistic tools. The sensitivity of a function of $f$ is a good measure of how much statistical noise is needed to mask an answer:$$\Delta f=\max_{x_1 \sim x_2} ||f(x_{1})-f(x_{2})||_{1} = \max_{x_1 \sim x_2} \sum_{i=1}^d |f(x_{1})_i-f(x_{2})_i|.$$Low sensitivity of a function, i.e. small change of output given two neighboring inputs, allows to add statistical noise to achieve privacy yet don't use utility. The Laplace mechanism, adds noise from the Laplace distribution $\mathcal L(\lambda)$, i.e. noise $\eta(x)\propto \exp(-|x|/\lambda)$ which has 0 mean and $\lambda$ standard deviation. Substituting Laplace noise with other probability distributions, such with a 0 mean Gaussian and $\lambda$ standard deviation would be possible, but influences proof details.
A typical construction now is, instead of computing $f(x)$ directly, to compute $\mathcal A(x) = f(x) + \eta(x)$ and obtain a $\epsilon = \frac{\Delta f}{\lambda} $-DP algorithm, since the noise of two neighboring databases doesn't exceed this $\epsilon$ even in the worst-case. In terms of composability, sequential respectively parallel composition leads to a sum of all occurring $\epsilon_i$ resp. maximum of all occurring $\epsilon_i$ differentially private steps within the composed mechanism. This allows to efficiently turn algorithms into DP-algorithm.

These basic construction, including detailed proofs, and much more were covered during the 7th BIU Winter School on Cryptography named "Differential Privacy: From Theory to Practice" featuring speakers, who defined and contributed already roughly 10 years ago to the field. Slides are already online and video recordings about to appear on the webpage.
Furthermore, relationships of DP to various, related scientific fields ranging from statistics to machine learning and finally game theory were explored.

Concluding, in wake of the growing awareness of privacy issues in the digital domain, together with stricter interpretation of legislation and finally the possibility to satisfy most interests by anonymized data anyways; several big players strive to provide differentially private collection of data. Some companies market themselves as quasi-pioneers in privacy topics for some reasons: It pays off to be perceived as the first one; they would be facing various problems in the near future anyways, if they don't respect these issues; and most importantly, they can continue their business model: creating value of user's data. The more information is queried from the database, the more statistical noise has to mask the correct answer in order to meet a predefined privacy budget bound. This total allowed, justifiable privacy leakage can be specified in the number of admissible queries or the answer accuracy.

Provable cryptography avoids the situation of mere obfuscation that can be undone by a clever enough attacker / strategy -- given the security assumption holds -- and provides bounds and thus a guideline on how to choose parameters to guarantee a desired level of privacy. Algorithms invest a given privacy budget at privacy-critical steps. With this in mind, differential privacy is an additional design paradigm for cryptographic algorithms and protocols to keep in mind.

I'd like to end this cryptologic point of view on achieving privacy goals on the internet, as I started; with a fundamental sociological question. One thought that remains standing out is: Shall we collect as much data in the first place? Is it really necessary to predict individuals as online merchants? Do we want this ubiquitous tracking? As with advanced technologies, who's long-term effects cannot be predicted, maybe also in aggregating big data and tracking the only winning move seems to be not to collect data in the first place.