Data Mining代写:CS626 Apriori Algorithm

代写数据挖掘里面的经典推荐算法Apriori,从给定的数据里面,根据规则挖掘特征。

Objective

The purpose of the project is to implement the apriori algorithm for generating all association rules whose support is greater than a user-supplied minimum support and whose confidence is greater than a user supplied minimum confidence.

Your program should take as command line option six parameters: (i) the minimum support, (ii) the minimum confidence, (iii) the name of the input file, (iv) the name of the output file, (v) the range of the hash function (i.e., the branching factor of the hash tree), and (vi) the maximum leaf size of the hash table. The specific parameter sequence along with the name of the executable are as follows:

hcrminer minsup minconf inputfile outputfile hfrange maxleafsize

NOTE: For those using Java, the command line should look like the following:

java hcrminer minsup minconf inputfile outputfile hfrange maxleafsize

Your program must be written in either C, C++, or Java.
You can assume that you can store the database and the frequent itemsets in memory.
For hash function, you can just simply use a “ID % hfrange” (i.e., a mod function).

Input file format

The input file consists of a set of lines, each line containing two numbers. The first number is the transaction ID and the second number is the item ID. The lines in the file are ordered in increasing transaction ID order. Note that the set of items that make up the transaction will be derived by combining the item IDs of all the lines that correspond to the same transaction ID.

Two input files are provided. The first is a small one that you can use during initial code development. The second is larger and will be the one on which you need to report performance results.

Output file format

The output file will contain as many lines as the number of high-confidence frequent rules that you found. The format of each line will be as follows:

LHS|RHS|SUPPORT|CONFIDENCE

Both LHS and RHS will contain the items that make up the left- and right-hand side of the rule in a space delimited fashion.

Due to the large number of the generated association rules, for support values of 20 and 15 do not generate the rules and just output the frequent itemsets. but you need to generate them for the other support values. The format of the output file when no association rules are generated will be of the form:

LHS|{}|SUPPORT|-1

where LHS is the frequent item set.

Report

Along with your code, you need to submit a report that contains the following: Run your code for each combination of the following sets of values for the minimum support, the minimum confidence, and option:

minsup: {15, 20, 30, 50, 100, 500, 1000} [Note that this is the support count (act ual frequency)]
minconf: {0.8, 0.9, 0.95}
hfrange: {5, 10, 20}
maxleafsize: {10, 50, 100}

For each combination of minconf, hfrange, and maxleafsize, generate two bar-charts. The first will show the amount of time required to find the frequent itemsets and to find the rules (a set of bars for each of these combinations) for the different values of minimum support. The second will show the number of frequent itemsets that were found and the number of high confidence rules that were generated (a set of bars for each of these combinations) for the different values of minimum support. Make sure that each bar is also annotated with the quantity that is plotting (at the top of the bar).

Here is a table of the approximate amount of time taken for a couple of minsup values for just the frequent itemset generation (the time corresponds to C implementation, the hfrange is set to 20 and maxleafsize is set to 100, no optimization flags are used):

minsup time (secs)
500    29.56
30     3756.69

A discussion as to which values of the hfrange and maxleafsize parameters resulted in faster execution time and why.