The final result we want to achieve is to extend the function of Scalica so that it can automatically match two users for a dinner date.
In general the project could be divided into three subsections:
- Requesting and storing user input
- Analyzing user input and generating a matching rank
- Presenting matching information to user
Firstly, we will want to augment the functionality in Scalica. We would like to ask the users some questions. There are 20 rating questions such as “Do you like spicy food?” “Do you like French food?” each question with a scale from 1 to 5 (1: dislike – 5: Like a lot). The service randomly chooses 3 questions, and ask the user to indicate their preference. These data will be stored in a table with userID as the primary key, and each column representing a question storing the rating value. All these data are considered to be private data. So we will also have to restrict the viewing or changing of them to unauthorized users. For example, some pervert who is trying to setup a date for himself and another person he fancies will not be able to do that.
A is responsible for this part (augmenting Scalica (and its data models) to store additional info), and he would like to use either Java or Python and Django for this part. More on database design: to make the database scalable, we will use horizontal partitioning to partition the user info data. Even though we only get one physical node for database, we will create some logical shards so that it will be easy when the amount of data grow and we need to redistribute them across different physical nodes. Each time a new entry is created, its ID will contain an identifier marking the logical shard which it’s in.
Secondly, we will want to use these data to match these people. We will be using the latent factor model to achieve this. Normally, the latent factor model will have a data table that store each user and their preference level to multiple labels. In our ca se, this matrix A will be a matrix of users vs. all the different labels created by users. These labels are the favourite books, users’ jobs, their favourite movies and etc. In the actual matrix, each value of the matrix element is the rating we got from first step corresponding to (user, label). Since it is unrealistic for users to rate their preference level for each of the label in the matrix, there will be many blanks in the matrix. We will then compute the latent factor for each user to the other labels to substitute these blanks. To compute the latent factors. We would need to create two matrix out of the original matrix A. One would be the matrix that represents users vs. flavors. The other one would be flavors vs. labels. By doing a gradient descent on the product of these two matrices, we would be able to get the latent factor matrix. We could then use the latent factor matrix to get labels that were not rated by the user but now it is. For the completed matrix, each row would be the interpreted preference array for the user against the labels. We then find out for each user, whose preference array is most alike. We would do this by computing the dot product of the difference array between two users by itself. The users with lowest three such values will be recommended to each other. Depending on the number of users we have and the frequency they choose to update their preference information, we will need to update the preference matrix in different windows of time. B will be responsible for the second part. For the language used here. He is considering either Python or C++. Former for the reason that it is easy to do matrix arithmetic with Python. The later been that it might be faster to use C++. All the above service will be run on a cloud server from Google’s Cloud Engine. We will be using Redis to store the list of suggestions for it is better supported than memcached.
Thirdly, we will have two different database, one is for the user information and the other is for the result of survey. Both of them will be horizontal partitioned and logical sharded for the expansion in the future. The host of matching algorithm will analyze the data from the survey database. At the end of every finished calculation, it will update all changes to the suggestion list in Redis. The algorithm will run every time when there is a change in survey database. C is responsible to construct survey database.
Optional: We will need to be presenting these informations to the users. We will need to guarantee that both user’s personal information is not leaked in anyway to the other prior to their meeting. First, we will present the suggested restaurant, the price range, and the dining time to both users in the match. If both users agree to meet, the system will provide them more detailed and personal information, such as interests, favorite food and etc. If one of the user in the match decides to not meet, the system will hide all the information presented before from both users and suggest a new match to both users according to the next higher rank in the system. The third part require techniques in visualizing the data. After achieving basic function mentioned above, another main focus is beautifying the user interface and the data visualization. C is responsible for this part, and he would like to use python and Django for this part. The packages I might use are Bokeh for visualization with Python.