## Randomized response

(a) The simplest version of randomized response involves flipping a single fair coin (50% probability of heads and 50% probability of tails). Suppose an individual is asked a potentially incriminating question, and flips a coin before answering. If the coin comes up tails, he answers truthfully, otherwise he answers “yes”. Is this mechanism differentially private? If so, what epsilon value does it achieve? Carefully justify your answer.

## Privacy-preserving synthetic data

In this problem, you will take on the role of a data owner who owns two sensitive datasets, called hw_compas and hw_fake, and is preparing to release differentially private synthetic versions of these datasets.

The first dataset, hw_compas is a subset of the dataset released by ProPublica as part of their COMPAS investigation. The hw_compas dataset has attributes age, sex, score, and race, with the following domains of values: age is an integer between 18 and 96, sex is one of ‘Male’ or ‘Female’, score is an integer between -1 and 10, race is one of ‘Other’, ‘Caucasian’, ‘African-American’, ‘Hispanic’, ‘Asian’, or ‘Native American’.

The second dataset, hw_fake, is a synthetically generated dataset. We call this dataset “fake” rather than “synthetic” because you will be using it as input to a privacy-preserving data generator. We will use the term “synthetic” to refer to privacy-preserving datasets that are produced as output of a data generator.

We generated the hw_fake dataset by sampling from the following Bayesian network:

In this Bayesian network, parent_1, parent_2, child_1, and child_2 are random variables. Each of these variables takes on one of three values {0, 1, 2}.

• Variables parent_1 and parent_2 take on each of the possible values with an equal probability. Values are assigned to these random variables independently.
• Variables child_1 and child_2 take on the value of one of their parents. Which parent’s value the child takes on is chosen with an equal probability.

To start, use the Data Synthesizer library to generate 4 synthetic datasets for each sensitive dataset hw_compas and hw_fake (8 synthetic datasets in total), each of size N=10,000, using the following settings:

• A: random mode
• B: independent attribute mode with epsilon = 0.1.
• C: correlated attribute mode with epsilon = 0.1, with Bayesian network degree k=1
• D: correlated attribute mode with epsilon = 0.1, with Bayesian network degree k=2

For guidance, you can use the HW2_Template here. We have provided the code to generate the 4 synthetic datasets for you. Please make sure to duplicate this file rather than write your code directly here.

(a) Execute the following queries on synthetic datasets and compare the results to those on the corresponding real datasets:

• Q1 (hw_compas only): Execute basic statistical queries over synthetic datasets.
The hw_compas has numerical attributes age and score. Calculate the median, mean, min, max of age and score for the synthetic datasets generated with settings A, B, C, and D (described above). Compare to the ground truth values, as computed over hw_compas. Present results in a table. Discuss the accuracy of the different methods in your report. Which methods are accurate and which are less accurate? If there are substantial differences in accuracy between methods - explain these differences.
• Q2 (hw_compas only): Compare how well random mode (A) and independent attribute mode (B) replicate the original distribution.
Plot the distributions of values of age and sex attributes in hw_compas and in synthetic datasets generated under settings A and B. Compare the histograms visually and explain the results in your report.
Next, compute cumulative measures that quantify the difference between the probability distributions over age and sex in hw_compas vs. in privacy-preserving synthetic data.
To do so, use the Two-sample Kolmogorov-Smirnov test (KS test) for the numerical attribute and Kullback-Leibler divergence (KL-divergence) for the categorical attribute, using provided functions ks_test and kl_test. Discuss the relative difference in performance under A and B in your report.
• Q3 (hw_fake only): Compare the accuracy of correlated attribute mode with k=1 (C) and with k=2 (D).
Display the pairwise mutual information matrix by heatmaps, showing mutual information between all pairs of attributes, in hw_fake and in two synthetic datasets (generated under C and D). Discuss your observations in your report, noting how well / how badly mutual information is preserved in synthetic data.

(b) (hw_compas only): Study the variability in the mean and median of age for synthetic datasets generated under settings A, B, and C.

To do this, fix epsilon = 0.1, and generate 10 synthetic datasets (by specifying different seeds).

Calculate the mean and median of age for each of the 10 datasets. Plot the 10 median values and the 10 mean values using a box-and-whiskers plot. Compare these metrics to the ground truth median and mean from the real data. Carefully explain your observations: which mode gives more accurate results and why? In which cases do we see more or less variability?

Specifically for the box-and-whiskers plots, we expect to see two subplots: one for the mean and one for the median, with the three settings (A, B and C) along the X-axis and age on the Y-axis. You should include these plots in your report.

(c) (hw_compas only): Study how well statistical properties of the data are preserved as a function of the privacy budget, epsilon. To see robust results, execute your experiment with 10 different synthetic datasets (with different seeds) for each value of epsilon, for each data generation setting (B, C, and D). Specifically, you should:

• Compute the KL-divergence over the attribute race in hw_compas. For each setting (B, C, and D), vary epsilon from 0.02 to 0.1 in increments of 0.02. Specifically, the epsilons are [0.02, 0.04, 0.06, 0.08, 1]. In total, you should generate 3*10*6 synthetic datasets and calculate the KL-divergence for race in each dataset. Create three box-and-whiskers plots, one for each setting (B, C, D). Each plot should have epsilon on the X-axis and KL-divergence on the Y-axis. Discuss your findings in the report and include your plots.