Data Mining代写:CS412 Text Mining

代写数据挖掘作业,根据算法对文本文件进行特征提取。作业详细给了每一步的步骤,比开放性的作业要容易很多。

Objective

  • Explore how frequent pattern mining can be applied to text mining to discover meaningful phrases.
  • In this assignment, you will first run LDA on a corpus made up of titles from 5 domains’ conference papers. Based on the results from LDA, a topic (representing a particular domain) is assigned to each word in each title. Then you write a frequent mining algorithm to mine frequent patterns from each topic to get meaningful phrases. The mined frequent patterns may not necessarily be meaningful phrases for the topic. So you will consider the question how to extract meaningful ones out of all the frequent patterns. The final goal is to output highly representative phrases for each topic.

Step 1: Get to Know the Data

We collect paper titles from conferences in computer science of 5 domains: Data Mining(DM), Machine Learning(ML), Database(DB), Information Retrieval(IR) and Theory(TH). You can download the raw data here paper_raw.txt. (Note that we will not use this file directly for this assignment. But you can look at this file to see what the original titles look like.) Each line contains two columns, PaperID and Title of a paper, separated by Tab(‘\t’). Recall the example in class. For each line in the file you can consider it as one transaction. Each word in the title is equivalent to an item in a transaction. Note that PaperID is unique in the whole data set. However, since this data is a subset of all the paper titles from a large corpus, PaperIDs are not starting from 0 and not consecutive. The raw data looks like this:

PaperID Title
7600 The Automatic Acquisition of Proof Methods
85825 Frequent pattern discovery with memory constraint

For this assignment, we pre-processed the raw data by removing stop words, converting the words to lower cases and lemmatization. You can download the processed data here paper.txt. This is the actual dataset we will use for this assignment. In this file, each line is a PaperID followed by a list of terms.The format is PaperID Tab(‘\t’) term1 Space(‘ ‘) term2 Space(‘ ‘) … paper.txt looks like this:

PaperID [term1] [term2] [term3] [term4] [term5] …… [termN]
7600 automatic acquisition proof method
85825 frequent pattern discovery memory constraint

Step 2: Preprocessing

This step prepares input for LDA. You will generate two files based on paper.txt.

Generate a Dictionary

First, you need to generate a vocabulary from paper.txt. Please name your vocabulary file as vocab.txt. Each line in this file is a unique term extracted from paper.txt; each term should appear exactly once. Here is an example of the first five lines in vocab.txt. Note that the sequence of terms may be different.

automatic 
acquisition 
proof 
method 
philosophical 
...

Tokenize Plain Text by Dictionary

In this step, we represent each title as a sparse vector of word counts. We ask you to transform each title (i.e., each line in paper.txt) into the following format.

Note: if you use any other lda package rather than the one mentioned in next step, please make sure the format in title.txt match what your lda package requires.

Step 3: Partitioning

Recall that we collect paper titles from conferences of 5 domains in computer science. If we run frequent pattern mining algorithms directly on paper titles, the patterns would be independent of topics. Thus, if we want to mine frequent patterns in each domain, titles and terms should be partitioned into 5 domains. Note that the domain knowledge is not known to you. Instead, we apply topic modeling to uncover the hidden topics behind titles and terms automatically. Specifically, we apply LDA (you are not required to understand how exactly topic modeling works) to assign a topic to each term.

Assign a Topic to Each Term

  1. Download the LDA package here. Unzip it and you could see a list of source code. You can refer to this page for a comprehensive introduction to the package.
  2. Open a terminal and go to the directory of the source code. Type make. Then an executable lda is generated.
  3. In lda-c-dist directory, there is a settings.txt file. You could use the following settings, but feel free to tune the parameters if you know how inference works for LDA.
  4. Run LDA with the following command.
  5. Check your output.

Re-organize the Terms by Topic

You are asked to re-organize the terms by 5 topics. For the i-th topic, you should create a file named topic-i.txt. Separate each line in word-assignment.dat by topics assigned to them. For example, the lines in word-assignment.dat can be considered as the following form ( Note that here we replace the integers with the real terms for better illustration):

004 automatic:02 acquisition:02 proof:02 method:02 
005 classification:03 noun:02 phrase:03 concept:01 individual:03

Step 4: Mining Frequent Patterns for Each Topic

In this step, you need to implement a frequent pattern mining algorithm. You can choose whatever frequent pattern mining algorithms you like, such as Apriori, FP-Growth, ECLAT, etc.

(Hint: you need to figure out min_sup by yourself.)

Question to ponder A: How you choose min_sup for this task? Explain how you choose the min_sup in your report. Any reasonable choice will be fine.

Step 5: Mining Maximal/Closed Patterns

In this step, you need to implement an algorithm to mine maximal patterns and closed patterns. You could write the code based on the output of Step 4, or implement a specific algorithm to mine maximal/closed patterns, such as CLOSET, MaxMiner, etc.

The output should be of the same form as the output of Step 4. Max patterns are put into max directory. The i-th file is named as max-i.txt.Closed patterns are put into closed directory. The i-th file is named as closed-i.txt.

Question to ponder B: Can you figure out which topic corresponds to which domain based on patterns you mine? Write your observations in the report.

Question to ponder C: Compare the result of frequent patterns, maximal patterns and closed patterns, is the result satisfying? Write down your analysis.

Step 6: Re-rank by Purity of Patterns

In this paper, purity is introduced as one of the ranking measures for phrases. A phrase is pure in topic t if it is only frequent in documents ( here documents refer to titles) about topic t and not frequent in documents about other topics. For example, ‘query processing’ is a more pure phrase than ‘query’ in the Database topic. We measure the purity of a pattern by comparing the probability of seeing a phrase in the topic-t collection D(t) and the probability of seeing it in any other topic-t’ collection.

Step7: Bonus

Could you come up with other filtering/ranking criteria to improve the quality of your mined phrase lists? Implement your algorithm and put your analysis in your report.

(Hint: Some related papers describe strategies to deal with this problem by balancing max patterns and closed patterns.

Step8: Report

Now you are ready to write your report.

You should include the following sections in your report:

  1. Briefly explain what algorithms you use in Step4~Step6.
  2. Answer all the questions in Question to ponder.
  3. List your source file names and their corresponding Steps.