Report For Data Mining Project
Xiumo Zhan xiumoz@sfu.ca
Bowen Sun bsa58@sfu.ca
Abstract
This project use Mapreduce programming to find all frequent itemsets among the transaction in
the given file in two passes. We use java as programming language and Eclipse with Hadoop
pluggin as the development environment. In this project, we use two passes to implement
Mapreduce with the SON algorithm and Apriori algorithm. Finally, our Mapreduce program can
achieve the expected result according to the given file, a parameter k as the number of subfiles
and a parameter s as support threshold.
Implementation
Our implementation is applied by SON algorithm. This algorithm consists of two passes, each of
which requires one Map function and one Reduce function. The SON algorithm lends itself well
to a parallel-computing environment: each of the chunks can be processed in parallel, and the
frequent itemsets from each chunk combined to form the candidates. So in order to simulate the
parallel computing environment, we build a pseudo distributed model Hadoop environment on
the Ubuntu system running as a virtual machine using Vmware workstation in our own laptop.
Pass 1
In Pass 1, we first divide the entire big file into k subfiles, and the input of each mapper is one of
the k subfiles. Then we implement the Apriori algorithm on each subfile. While applying the
Apriori algorithm, we read the entire input file and then divide them to lines, which represent the
baskets. We then split the items of each line and use the data structure list<String[]> to store the
distinct items, which is our candidate frequent 1-itemset 𝐶𝐶1. Then we compute the support of
each items in 𝐶𝐶1 to generate 𝐿𝐿1 and use this 𝐿𝐿1 to form the pairs 𝐶𝐶2. For 𝐶𝐶2, we have to
check if it can reach the threshold 𝑠𝑠 and then generate 𝐿𝐿2.
For any 𝑘𝑘 ≥ 3, if we want to self join 𝐿𝐿𝑘𝑘−1 to form 𝐶𝐶𝑘𝑘 we have to compare the first 𝑘𝑘 − 2
elements for each two itemsets in 𝐿𝐿𝑘𝑘−1. For instance, there are two 3-itemsets “234” and “235”
in L3, so we will check if the first two elements in these two itemsets are the same. The pseudo
code of this procedure is described in the following:
Combine(itemset1, itemset2)
set point=0;
set key={};
for i=1 to the length of both itemsets
if itemset1[i]==itemset2[i]
point=point+1;
key=key+itemset1[i];
else
break;
endif
endfor
if point==length of both itemsets
if itemset1[point+1]>itemset2[point+1]
key=key+itemset2[point+1]+itemset1[point+1];
else
key=key+itemset1[point+1]+itemset2[point+1];
endif
endif
We can use these “234” and “235” to form “2345”, we have to check if it is qualified to stay in the
𝐶𝐶4. We have known that “234” and “235” are already in the 𝐿𝐿3, so we just need to check if
both “245” and ”345” are in the 𝐿𝐿3 instead of checking all four 3-itemsets, which will avoid
unnecessary check. In practical programming, we notice that the itemsets that needs to be
checked are the set of itemsets containing the last two items and without one of arbitrary k − 2
items for the 𝐶𝐶𝑘𝑘 . So we will continue the self join process using 𝐿𝐿𝑘𝑘 until the the generated
𝐶𝐶𝑘𝑘+1 is empty. The pseudo code of our checking procedure can be written as the following:
Check(itemset, 𝐿𝐿𝑘𝑘−1)
set flag=1;
for i=1 to length of itemset-3
set subitemset[i]=delete itemset[i] from the itemset;
if subitemset[i] not exists in 𝐿𝐿𝑘𝑘−1
set flag=0;
break;
endif
endfor
if flag=0
delete this itemset from 𝐿𝐿𝑘𝑘;
else
keep this itemset in 𝐿𝐿𝑘𝑘;
endif
The result produced by the mapper is the candidate frequent itemsets of each subfile. Then
reducer use our first reduce function to prune the duplicated itemsets in the output of the
mapper. We notice that the first reduce function will ignore the value of support for each
itemsets, the task of computation of the actual support of each itemset will be assigned to Pass
2.
After we produce all the candidate itemsets using Apriori algorithm in Pass 1, we will output a file
of all the candidate itemsets in the format of <”itemset”, ”value”>, “value” is set to be 1 since we
need to collect all distinct candidate frequent itemset. Storing all candidate itemsets is necessary
for our algorithm to carry on because it will ensure that the candidate itemsets that are produced
in Pass 1 will be passed to the next pass.
The pseudo code of the whole procedure of Pass 1 can be represented as the following:
Class FirstMapper
Method FirstMapper( inputfile)
set i=1;
while(|𝐶𝐶𝑖𝑖| > 0)
for each 𝐶𝐶𝑖𝑖𝑘𝑘 in 𝐶𝐶𝑖𝑖
𝐶𝐶𝑖𝑖𝑘𝑘 =<Cik.key, 𝐶𝐶𝑖𝑖𝑘𝑘. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠=computesupport(inputfile,Cik)>;
endfor
𝐿𝐿𝑖𝑖=cut(𝐶𝐶𝑖𝑖);
Result=Result+𝐿𝐿𝑖𝑖;
𝐶𝐶𝑖𝑖+1=self-join(𝐿𝐿𝑖𝑖);
i++;
Output <Result.key, 1>;
Class FirstReducer
Method FirstReducer(keys, values)
Collect all distinct candidate frequent itemset;
Pass 2
In Pass 2, we will first read the output file produced in Pass 1. In Pass 2, the task of mapper is to
compute the number of appearance of the candidate frequent itemsets in each subfile.
To finish this, we use the subfiles and implement the second Map and Reduce function. The
second mapper produces the number of appearance of each candidate frequent itemset in each
subfile and transmit the result in the format <”key”,”value”> to the second reducer.
The second reducer will sum the values for the same key and will generate a new pair for each
candidate frequent itemsets. The reducer then eliminates the itemsets whose value of support
(
𝑠𝑠𝑠𝑠𝑠𝑠(𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑣𝑣𝑣𝑣)
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘
) is smaller than s. The pseudo code can be written as the following:
Class SecondMapper
Method SecondMapper(result of the pass1,subfile)
Result2={};
For each key in the result of pass1
Count=number of baskets key[i] appears in the subfile;
Result2=Result2+<key[i], count>;
Endfor
Return Result2;
Class SecondReducer
Method SecondReducer(keys, values)
for each keys[i] belongs to the input keys
values=sum of values for the same key;
ComputeAndCut(<keys[i], values>, s);
Endfor
Output <Keys, Values>;
Test
We have tried many kinds of input data to test our program. We use data with the transaction
and baskets in small size to test the correctness of the result of our program. If our input data is a
small transaction but with very big baskets, running time is still huge. If the length of baskets are
not very long, the running time can be short. We also tried different k and s and find out that
running time is relatively longer if k is too large or too small for the same s. For the same k, if
we increase the value of s, running time will be shorter and can be very long if s is very small.
So far, our program can finish processing the example.dat given parameters k=30 and s=0.02
within 3 minutes. This efficiency is much better than what we achieved at the beginning, which is
longer than one hour.
Discussion
Running time and memory space consumption are very critical factors affect the efficiency of the
program. At first, our program can work well only when the input file has small baskets.
While checking the performance of our program, we found a very fatal flaw of our program, that
is the way we read our file in the Map function in Pass 1. In the Hadoop Mapper class, the default
way of reading files of the mapper is just one line at a time. In fact, the Apriori algorithm needs
the program to read the entire input file to count the size of the transaction. In this case, the
support threshold we define is not utilized to prune 𝐶𝐶𝑘𝑘, because each itemset in this case will all
have the support of 100%. Apriori algorithm, in this case, is actually not working. In fact, it only
enumerates all the subsets of the baskets and output the result to the reducer as the candidate
frequent itemsets that we need to count the actual support in the next pass. This will need huge
memory space and too much time. After searching on the internet, we found that overriding
InputFormat class and Recordreader class can solve this problem.
Moreover, our program is very sensible to the number of subfiles and the value of the threshold.
If we produce too many subfiles, each subfile will be so small that the support of each itemset in
the subfile will be relatively large. In this case, the support threshold is also useless and Pass 1
will generate big size of candidate itemsets, which results in huge demand of time and memory
space. For the same reason, if s is too small, the number of candidate itemsets will also be large.
And if k is very small, each subfile will be very big and it costs a long time to process the whole
file since large number of baskets will result in 𝐶𝐶1 with big size, as well as the huge cost of time
and memory space.
While we have already made some progress to the improvement of the efficiency of our program,
the computing of 𝐶𝐶1, 𝐶𝐶2, 𝐿𝐿1 and 𝐿𝐿2 still takes a lot of time.
Generating 𝐶𝐶1 is slow because the program split each basket to get items, this procedure will
process file extensively. Reading and processing file in this case could be time-costly. Generating
𝐿𝐿1 is always slow when 𝐶𝐶1 is of large size, because the algorithm we use has to traverse all the
baskets to count the number of appearance of every item in 𝐶𝐶1. Generating 𝐶𝐶2 from 𝐿𝐿1 is also
a time costly step, since if |𝐿𝐿1| = 𝑛𝑛, then |𝐶𝐶2| = 𝑂𝑂(𝑛𝑛2
). Generating 𝐿𝐿2 is also slow because of
the size of 𝐶𝐶2. When k ≥ 3, the process will be much faster because the program can effectively
cut itemset in 𝐶𝐶𝑘𝑘, using the monotonicity of frequent itemset.
Our future approach of improving our program consists of the 4 following aspects.
1.Change the data structure of storing our candidate frequent itemsets.
2.Prune the redundancy of operation and data structure for our algorithm in Pass 1.
3.Combine our current algorithm other algorithms like PCY, multihash to improve the efficiency.
4.Also we need to find ways to determine the proper k for the given file and s.
Appendix
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class FrequentItemset_MapReduce {
static double s = 0.0;
static int total = 0;
static int partition = 1;
public static final String STRING_SPLIT = ",";
static List<String> FirstResult = new ArrayList<String>();
public static IntWritable one = new IntWritable(1);
public static boolean contain(String[] src, String[] dest) {
for (int i = 0; i < dest.length; i++) {
int j = 0;
for (; j < src.length; j++) {
if (src[j].equals(dest[i])) {
break;
}
}
if (j == src.length) {
return false;// can not find
}
}
return true;
}
public static class CandidateItemsetMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable arg0, Text value,
OutputCollector<Text, IntWritable> output, Reporter arg3)
throws IOException {
List<String[]> data = null;
try {
data = loadChessData(value);
} catch (Exception e) {
e.printStackTrace();
}
Map<String, Double> result = compute(data, s, null, null);
for (String key : result.keySet()) {
output.collect(new Text(key), one);
}
}
public Map<String, Double> compute(List<String[]> data,
Double minSupport, Integer maxLoop, String[] containSet) {
if (data == null || data.size() <= 0) {
return null;
}
Map<String, Double> result = new TreeMap<String, Double>();
Map<String, Double> tempresult = new HashMap<String, Double>();
String[] itemSet = getDataUnitSet(data);
int loop = 0;
// loop1
Set<String> keys = combine(tempresult.keySet(), itemSet);
tempresult.clear();
for (String key : keys) {
tempresult.put(key,
computeSupport(data, key.split(STRING_SPLIT)));
}
cut(tempresult, minSupport);
result.putAll(tempresult);
loop++;
String[] strSet = new String[tempresult.size()];
tempresult.keySet().toArray(strSet);
while (true) {
keys = combine(tempresult.keySet(), strSet);
tempresult.clear();
for (String key : keys) {
tempresult.put(key,
computeSupport(data, key.split(STRING_SPLIT)));
}
cut(tempresult, minSupport);
strSet = new String[tempresult.size()];
tempresult.keySet().toArray(strSet);
result.putAll(tempresult);
loop++;
if (tempresult.size() <= 0) {
break;
}
if (maxLoop != null && maxLoop > 0 && loop >= maxLoop) {
break;
}
}
return result;
}
public Double computeSupport(List<String[]> data, String[] subSet) {
Integer value = 0;
for (int i = 0; i < data.size(); i++) {
if (contain(data.get(i), subSet)) {
value++;
}
}
return value * 1.0 / data.size();
}
public String[] getDataUnitSet(List<String[]> data) {
List<String> uniqueKeys = new ArrayList<String>();
for (String[] dat : data) {
for (String da : dat) {
if (!uniqueKeys.contains(da)) {
uniqueKeys.add(da);
}
}
}
// String[] toBeStored = list.toArray(new String[list.size()]);
String[] result = uniqueKeys.toArray(new String[uniqueKeys.size()]);
return result;
}
public Set<String> combine(Set<String> src, String[] target) {
Set<String> dest = new TreeSet<String>();
if (src == null || src.size() <= 0) {
for (String t : target) {
dest.add(t.toString());
}
return dest;
}
for (String s : src) {
for (String t : target) {
String[] itemset1 = s.split(STRING_SPLIT);
String[] itemset2 = t.split(STRING_SPLIT);
int i = 0;
for (i = 0; i < itemset1.length - 1
&& i < itemset2.length - 1; i++) {
int a = Integer.parseInt(itemset1[i]);
int b = Integer.parseInt(itemset2[i]);
if (a != b)
break;
else
continue;
}
int a = Integer.parseInt(itemset1[i]);
int b = Integer.parseInt(itemset2[i]);
if (i == itemset2.length - 1 && a != b) {
String keys = s + STRING_SPLIT + itemset2[i];
String key[] = keys.split(STRING_SPLIT);
String Checkkeys = null;
if (a > b) {
String temp;
temp = key[key.length - 1];
key[key.length - 1] = key[key.length - 2];
key[key.length - 2] = temp;
keys = key[0];
for (int j = 0; j < key.length - 1; j++) {
keys = keys + STRING_SPLIT + key[j + 1];
}
}
if (key.length > 2) {
int k = 0;
for (k = 0; k < key.length - 2; k++) {
int end1 = keys.indexOf(key[k]);
int start2 = keys.indexOf(key[k + 1]);
Checkkeys = keys.substring(0, end1)
+ keys.substring(start2, keys.length());
if (!src.contains(Checkkeys))
break;
else
continue;
}
if (k == key.length - 2)
dest.add(keys);
}
if (Checkkeys == null) {
if (!dest.contains(keys)) {
dest.add(keys);
}
}
}
}
}
return dest;
}
public Map<String, Double> cut(Map<String, Double> tempresult,
Double minSupport) {
for (Object key : tempresult.keySet().toArray()) {
if (minSupport != null && minSupport > 0 && minSupport < 1
&& tempresult.get(key) < minSupport) {
tempresult.remove(key);
}
}
return tempresult;
}
public static List<String[]> loadChessData(Text value) throws Exception {
List<String[]> result = new ArrayList<String[]>();
StringTokenizer baskets = new StringTokenizer(value.toString(),
"n");
while (baskets.hasMoreTokens()) {
String[] items = baskets.nextToken().split(" ");
result.add(items);
}
return result;
}
}
public static class CandidateItemsetReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(1));
}
}
public static void preprocessingphase1(String[] args) throws Exception {
String originalfilepath = getLocation(args[0]);
System.out.println(originalfilepath);
if (originalfilepath == null)
return;
List<String> lines = readFile(originalfilepath);
if (lines == null)
return;
total = lines.size();
partition = Integer.parseInt(args[1]);
int m = (int) total / partition;
double m_d = total * 1.0 / partition;
if (m_d > m)
m = m + 1;
mkdir("input_temp");
for (int i = 0; i < partition; i++) {
String newpath = "input_temp/" + i + ".dat";
String input_temp = "";
for (int j = 0; j < m && total - i * m - j > 0; j++) {
input_temp += lines.get(i * m + j) + "n";
}
createFile(newpath, input_temp.getBytes());
}
}
public static void preprocessingphase2() throws Exception {
List<String> lines = readFile("output_temp/part-00000");
Iterator<String> itr = lines.iterator();
while (itr.hasNext()) {
String basket = (String) itr.next();
String itemset = basket.substring(0, basket.indexOf("t"));
FirstResult.add(itemset);
}
System.out.println("Pre processing for phase 2 finished.");
}
public static class FrequentItemsetMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String data = value.toString();
String[] baskets = data.split("n");
for (int i = 0; i < FirstResult.size(); i++) {
int number = 0;
String[] items = FirstResult.get(i).split(STRING_SPLIT);
for (int j = 0; j < baskets.length; j++) {
int k = 0;
for (k = 0; k < items.length; k++) {
String[] basketsitemset = baskets[j].split(" ");
if (contain(basketsitemset, items))
continue;
else
break;
}
if (k == items.length) {
number = number + 1;
}
}
output.collect(new Text(FirstResult.get(i)), new IntWritable(
number));
}
}
}
public static class FrequentItemsetReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (sum >= s * total)
output.collect(key, new IntWritable(sum));
}
}
public static List<String> readFile(String filePath) throws IOException {
Path f = new Path(filePath);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(filePath), conf);
FSDataInputStream dis = fs.open(f);
InputStreamReader isr = new InputStreamReader(dis, "utf-8");
BufferedReader br = new BufferedReader(isr);
List<String> lines = new ArrayList<String>();
String str = "";
while ((str = br.readLine()) != null) {
lines.add(str);
}
br.close();
isr.close();
dis.close();
System.out.println("Original file reading complete.");
return lines;
}
public static String getLocation(String path) throws Exception {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
Path listf = new Path(path);
FileStatus stats[] = hdfs.listStatus(listf);
String FilePath = stats[0].getPath().toString();
hdfs.close();
System.out.println("Find input file.");
return FilePath;
}
public static void mkdir(String path) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path srcPath = new Path(path);
boolean isok = fs.mkdirs(srcPath);
if (isok) {
System.out.println("create dir ok.");
} else {
System.out.println("create dir failure.");
}
fs.close();
}
public static void createFile(String dst, byte[] contents)
throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path dstPath = new Path(dst);
FSDataOutputStream outputStream = fs.create(dstPath);
outputStream.write(contents);
outputStream.close();
fs.close();
System.out.println("file " + dst + " create complete.");
}
public static void phase1(String[] args) throws Exception {
s = Double.parseDouble(args[2]);
JobConf conf = new JobConf(FrequentItemset_MapReduce.class);
conf.setJobName("Find frequent candidate");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CandidateItemsetMapper.class);
conf.setReducerClass(CandidateItemsetReducer.class);
conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("input_temp"));
FileOutputFormat.setOutputPath(conf, new Path("output_temp"));
JobClient.runJob(conf);
}
// phase 2
public static void phase2(String[] args) throws Exception {
JobConf conf = new JobConf(FrequentItemset_MapReduce.class);
conf.setJobName("Frequent Itemsets Count");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(FrequentItemsetMapper.class);
conf.setReducerClass(FrequentItemsetReducer.class);
FileInputFormat.setInputPaths(conf, new Path("input_temp"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
JobClient.runJob(conf);
}
public static class WholeFileRecordReader implements
RecordReader<LongWritable, Text> {
private FileSplit fileSplit;
private Configuration conf;
private boolean processed = false;
public WholeFileRecordReader(FileSplit fileSplit, Configuration conf)
throws IOException {
this.fileSplit = fileSplit;
this.conf = conf;
}
@Override
public boolean next(LongWritable key, Text value) throws IOException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
String fileName = file.getName();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
@Override
public LongWritable createKey() {
return new LongWritable();
}
@Override
public Text createValue() {
return new Text();
}
@Override
public long getPos() throws IOException {
return processed ? fileSplit.getLength() : 0;
}
@Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// do nothing
}
}
public static class WholeFileInputFormat extends
FileInputFormat<LongWritable, Text> {
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
@Override
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit split, JobConf job, Reporter reporter)
throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("The number of arguments is less than three.");
return;
}
preprocessingphase1(args);
phase1(args);
preprocessingphase2();
phase2(args);
List<String> lines = readFile("output/part-00000");
Iterator<String> itr = lines.iterator();
File filename = new File("/home/hadoop/Desktop/result.txt");
filename.createNewFile();
try {
BufferedWriter out = new BufferedWriter(new FileWriter(filename));
String firstline = Integer.toString(lines.size()) + "n";
out.write(firstline);
while (itr.hasNext()) {
String basket = (String) itr.next();
String itemset = basket.substring(0, basket.indexOf("t"));
String number = basket.substring(basket.indexOf("t") + 1,
basket.length());
out.write(itemset + "(" + number + ")" + "n");
}
out.flush();
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

More Related Content

PPTX
Unit 3 lecture-2
PPTX
PDF
The Ring programming language version 1.4.1 book - Part 29 of 31
PPT
Inroduction to r
PPT
Hadoop_Pennonsoft
PDF
Reactive Programming for a demanding world: building event-driven and respons...
PPTX
Algorithm analysis and design
ODP
Functional programming in Javascript
Unit 3 lecture-2
The Ring programming language version 1.4.1 book - Part 29 of 31
Inroduction to r
Hadoop_Pennonsoft
Reactive Programming for a demanding world: building event-driven and respons...
Algorithm analysis and design
Functional programming in Javascript

What's hot (20)

PDF
The Ring programming language version 1.2 book - Part 78 of 84
PDF
Lecture 01 variables scripts and operations
PDF
(2) c sharp introduction_basics_part_i
PDF
How to Think in RxJava Before Reacting
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
PPTX
Flink Batch Processing and Iterations
PPTX
PPTX
Apache Flink Training: DataSet API Basics
PPTX
Pa2 session 1
PPTX
Programming in Python
PPTX
Python programming
PDF
A Reflective Implementation of an Actor-based Concurrent Context-Oriented System
PDF
The Ring programming language version 1.3 book - Part 82 of 88
PDF
Functions and modules in python
DOCX
Parallel Programming With Dot Net
PDF
Python Programming - IX. On Randomness
PDF
cb streams - gavin pickin
PPT
Programming in Computational Biology
ODP
Biopython
PDF
Memory Management In Python The Basics
The Ring programming language version 1.2 book - Part 78 of 84
Lecture 01 variables scripts and operations
(2) c sharp introduction_basics_part_i
How to Think in RxJava Before Reacting
Apache Flink Training: DataStream API Part 2 Advanced
Flink Batch Processing and Iterations
Apache Flink Training: DataSet API Basics
Pa2 session 1
Programming in Python
Python programming
A Reflective Implementation of an Actor-based Concurrent Context-Oriented System
The Ring programming language version 1.3 book - Part 82 of 88
Functions and modules in python
Parallel Programming With Dot Net
Python Programming - IX. On Randomness
cb streams - gavin pickin
Programming in Computational Biology
Biopython
Memory Management In Python The Basics
Ad

Similar to DataMiningReport (20)

PDF
Hadoop implementation for algorithms apriori, pcy, son
PDF
An improvised tree algorithm for association rule mining using transaction re...
PDF
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
PPT
Lecture20
PPT
Cs501 mining frequentpatterns
PPTX
Association rule mining
PPT
The comparative study of apriori and FP-growth algorithm
PPT
My6asso
PDF
B0950814
PPT
Apriori algorithm
PPT
association(BahanAR-4) data mining apriori.ppt
PPTX
Data Mining Lecture_4.pptx
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PDF
Discovering Frequent Patterns with New Mining Procedure
PPSX
Frequent itemset mining methods
PDF
Assocrules
PPTX
Chapter 01 Introduction DM.pptx
PDF
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
PDF
5 parallel implementation 06299286
PPT
FP growth algorithm, data mining, data analystics
Hadoop implementation for algorithms apriori, pcy, son
An improvised tree algorithm for association rule mining using transaction re...
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
Lecture20
Cs501 mining frequentpatterns
Association rule mining
The comparative study of apriori and FP-growth algorithm
My6asso
B0950814
Apriori algorithm
association(BahanAR-4) data mining apriori.ppt
Data Mining Lecture_4.pptx
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Discovering Frequent Patterns with New Mining Procedure
Frequent itemset mining methods
Assocrules
Chapter 01 Introduction DM.pptx
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
5 parallel implementation 06299286
FP growth algorithm, data mining, data analystics
Ad

DataMiningReport

  • 1. Report For Data Mining Project Xiumo Zhan [email protected] Bowen Sun [email protected] Abstract This project use Mapreduce programming to find all frequent itemsets among the transaction in the given file in two passes. We use java as programming language and Eclipse with Hadoop pluggin as the development environment. In this project, we use two passes to implement Mapreduce with the SON algorithm and Apriori algorithm. Finally, our Mapreduce program can achieve the expected result according to the given file, a parameter k as the number of subfiles and a parameter s as support threshold. Implementation Our implementation is applied by SON algorithm. This algorithm consists of two passes, each of which requires one Map function and one Reduce function. The SON algorithm lends itself well to a parallel-computing environment: each of the chunks can be processed in parallel, and the frequent itemsets from each chunk combined to form the candidates. So in order to simulate the parallel computing environment, we build a pseudo distributed model Hadoop environment on the Ubuntu system running as a virtual machine using Vmware workstation in our own laptop. Pass 1 In Pass 1, we first divide the entire big file into k subfiles, and the input of each mapper is one of the k subfiles. Then we implement the Apriori algorithm on each subfile. While applying the Apriori algorithm, we read the entire input file and then divide them to lines, which represent the baskets. We then split the items of each line and use the data structure list<String[]> to store the distinct items, which is our candidate frequent 1-itemset 𝐶𝐶1. Then we compute the support of each items in 𝐶𝐶1 to generate 𝐿𝐿1 and use this 𝐿𝐿1 to form the pairs 𝐶𝐶2. For 𝐶𝐶2, we have to check if it can reach the threshold 𝑠𝑠 and then generate 𝐿𝐿2. For any 𝑘𝑘 ≥ 3, if we want to self join 𝐿𝐿𝑘𝑘−1 to form 𝐶𝐶𝑘𝑘 we have to compare the first 𝑘𝑘 − 2 elements for each two itemsets in 𝐿𝐿𝑘𝑘−1. For instance, there are two 3-itemsets “234” and “235” in L3, so we will check if the first two elements in these two itemsets are the same. The pseudo code of this procedure is described in the following: Combine(itemset1, itemset2)
  • 2. set point=0; set key={}; for i=1 to the length of both itemsets if itemset1[i]==itemset2[i] point=point+1; key=key+itemset1[i]; else break; endif endfor if point==length of both itemsets if itemset1[point+1]>itemset2[point+1] key=key+itemset2[point+1]+itemset1[point+1]; else key=key+itemset1[point+1]+itemset2[point+1]; endif endif We can use these “234” and “235” to form “2345”, we have to check if it is qualified to stay in the 𝐶𝐶4. We have known that “234” and “235” are already in the 𝐿𝐿3, so we just need to check if both “245” and ”345” are in the 𝐿𝐿3 instead of checking all four 3-itemsets, which will avoid unnecessary check. In practical programming, we notice that the itemsets that needs to be checked are the set of itemsets containing the last two items and without one of arbitrary k − 2 items for the 𝐶𝐶𝑘𝑘 . So we will continue the self join process using 𝐿𝐿𝑘𝑘 until the the generated 𝐶𝐶𝑘𝑘+1 is empty. The pseudo code of our checking procedure can be written as the following: Check(itemset, 𝐿𝐿𝑘𝑘−1) set flag=1; for i=1 to length of itemset-3 set subitemset[i]=delete itemset[i] from the itemset; if subitemset[i] not exists in 𝐿𝐿𝑘𝑘−1 set flag=0; break; endif endfor if flag=0 delete this itemset from 𝐿𝐿𝑘𝑘; else keep this itemset in 𝐿𝐿𝑘𝑘; endif The result produced by the mapper is the candidate frequent itemsets of each subfile. Then reducer use our first reduce function to prune the duplicated itemsets in the output of the mapper. We notice that the first reduce function will ignore the value of support for each
  • 3. itemsets, the task of computation of the actual support of each itemset will be assigned to Pass 2. After we produce all the candidate itemsets using Apriori algorithm in Pass 1, we will output a file of all the candidate itemsets in the format of <”itemset”, ”value”>, “value” is set to be 1 since we need to collect all distinct candidate frequent itemset. Storing all candidate itemsets is necessary for our algorithm to carry on because it will ensure that the candidate itemsets that are produced in Pass 1 will be passed to the next pass. The pseudo code of the whole procedure of Pass 1 can be represented as the following: Class FirstMapper Method FirstMapper( inputfile) set i=1; while(|𝐶𝐶𝑖𝑖| > 0) for each 𝐶𝐶𝑖𝑖𝑘𝑘 in 𝐶𝐶𝑖𝑖 𝐶𝐶𝑖𝑖𝑘𝑘 =<Cik.key, 𝐶𝐶𝑖𝑖𝑘𝑘. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠=computesupport(inputfile,Cik)>; endfor 𝐿𝐿𝑖𝑖=cut(𝐶𝐶𝑖𝑖); Result=Result+𝐿𝐿𝑖𝑖; 𝐶𝐶𝑖𝑖+1=self-join(𝐿𝐿𝑖𝑖); i++; Output <Result.key, 1>; Class FirstReducer Method FirstReducer(keys, values) Collect all distinct candidate frequent itemset; Pass 2 In Pass 2, we will first read the output file produced in Pass 1. In Pass 2, the task of mapper is to compute the number of appearance of the candidate frequent itemsets in each subfile. To finish this, we use the subfiles and implement the second Map and Reduce function. The second mapper produces the number of appearance of each candidate frequent itemset in each subfile and transmit the result in the format <”key”,”value”> to the second reducer. The second reducer will sum the values for the same key and will generate a new pair for each candidate frequent itemsets. The reducer then eliminates the itemsets whose value of support ( 𝑠𝑠𝑠𝑠𝑠𝑠(𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑣𝑣𝑣𝑣) 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 ) is smaller than s. The pseudo code can be written as the following:
  • 4. Class SecondMapper Method SecondMapper(result of the pass1,subfile) Result2={}; For each key in the result of pass1 Count=number of baskets key[i] appears in the subfile; Result2=Result2+<key[i], count>; Endfor Return Result2; Class SecondReducer Method SecondReducer(keys, values) for each keys[i] belongs to the input keys values=sum of values for the same key; ComputeAndCut(<keys[i], values>, s); Endfor Output <Keys, Values>; Test We have tried many kinds of input data to test our program. We use data with the transaction and baskets in small size to test the correctness of the result of our program. If our input data is a small transaction but with very big baskets, running time is still huge. If the length of baskets are not very long, the running time can be short. We also tried different k and s and find out that running time is relatively longer if k is too large or too small for the same s. For the same k, if we increase the value of s, running time will be shorter and can be very long if s is very small. So far, our program can finish processing the example.dat given parameters k=30 and s=0.02 within 3 minutes. This efficiency is much better than what we achieved at the beginning, which is longer than one hour. Discussion Running time and memory space consumption are very critical factors affect the efficiency of the program. At first, our program can work well only when the input file has small baskets. While checking the performance of our program, we found a very fatal flaw of our program, that is the way we read our file in the Map function in Pass 1. In the Hadoop Mapper class, the default way of reading files of the mapper is just one line at a time. In fact, the Apriori algorithm needs the program to read the entire input file to count the size of the transaction. In this case, the support threshold we define is not utilized to prune 𝐶𝐶𝑘𝑘, because each itemset in this case will all have the support of 100%. Apriori algorithm, in this case, is actually not working. In fact, it only enumerates all the subsets of the baskets and output the result to the reducer as the candidate frequent itemsets that we need to count the actual support in the next pass. This will need huge memory space and too much time. After searching on the internet, we found that overriding
  • 5. InputFormat class and Recordreader class can solve this problem. Moreover, our program is very sensible to the number of subfiles and the value of the threshold. If we produce too many subfiles, each subfile will be so small that the support of each itemset in the subfile will be relatively large. In this case, the support threshold is also useless and Pass 1 will generate big size of candidate itemsets, which results in huge demand of time and memory space. For the same reason, if s is too small, the number of candidate itemsets will also be large. And if k is very small, each subfile will be very big and it costs a long time to process the whole file since large number of baskets will result in 𝐶𝐶1 with big size, as well as the huge cost of time and memory space. While we have already made some progress to the improvement of the efficiency of our program, the computing of 𝐶𝐶1, 𝐶𝐶2, 𝐿𝐿1 and 𝐿𝐿2 still takes a lot of time. Generating 𝐶𝐶1 is slow because the program split each basket to get items, this procedure will process file extensively. Reading and processing file in this case could be time-costly. Generating 𝐿𝐿1 is always slow when 𝐶𝐶1 is of large size, because the algorithm we use has to traverse all the baskets to count the number of appearance of every item in 𝐶𝐶1. Generating 𝐶𝐶2 from 𝐿𝐿1 is also a time costly step, since if |𝐿𝐿1| = 𝑛𝑛, then |𝐶𝐶2| = 𝑂𝑂(𝑛𝑛2 ). Generating 𝐿𝐿2 is also slow because of the size of 𝐶𝐶2. When k ≥ 3, the process will be much faster because the program can effectively cut itemset in 𝐶𝐶𝑘𝑘, using the monotonicity of frequent itemset. Our future approach of improving our program consists of the 4 following aspects. 1.Change the data structure of storing our candidate frequent itemsets. 2.Prune the redundancy of operation and data structure for our algorithm in Pass 1. 3.Combine our current algorithm other algorithms like PCY, multihash to improve the efficiency. 4.Also we need to find ways to determine the proper k for the given file and s. Appendix import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Set;
  • 6. import java.util.StringTokenizer; import java.util.TreeMap; import java.util.TreeSet; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class FrequentItemset_MapReduce { static double s = 0.0; static int total = 0; static int partition = 1; public static final String STRING_SPLIT = ","; static List<String> FirstResult = new ArrayList<String>(); public static IntWritable one = new IntWritable(1); public static boolean contain(String[] src, String[] dest) { for (int i = 0; i < dest.length; i++) { int j = 0; for (; j < src.length; j++) { if (src[j].equals(dest[i])) { break;
  • 7. } } if (j == src.length) { return false;// can not find } } return true; } public static class CandidateItemsetMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable arg0, Text value, OutputCollector<Text, IntWritable> output, Reporter arg3) throws IOException { List<String[]> data = null; try { data = loadChessData(value); } catch (Exception e) { e.printStackTrace(); } Map<String, Double> result = compute(data, s, null, null); for (String key : result.keySet()) { output.collect(new Text(key), one); } } public Map<String, Double> compute(List<String[]> data, Double minSupport, Integer maxLoop, String[] containSet) { if (data == null || data.size() <= 0) { return null; } Map<String, Double> result = new TreeMap<String, Double>(); Map<String, Double> tempresult = new HashMap<String, Double>(); String[] itemSet = getDataUnitSet(data); int loop = 0; // loop1 Set<String> keys = combine(tempresult.keySet(), itemSet); tempresult.clear();
  • 8. for (String key : keys) { tempresult.put(key, computeSupport(data, key.split(STRING_SPLIT))); } cut(tempresult, minSupport); result.putAll(tempresult); loop++; String[] strSet = new String[tempresult.size()]; tempresult.keySet().toArray(strSet); while (true) { keys = combine(tempresult.keySet(), strSet); tempresult.clear(); for (String key : keys) { tempresult.put(key, computeSupport(data, key.split(STRING_SPLIT))); } cut(tempresult, minSupport); strSet = new String[tempresult.size()]; tempresult.keySet().toArray(strSet); result.putAll(tempresult); loop++; if (tempresult.size() <= 0) { break; } if (maxLoop != null && maxLoop > 0 && loop >= maxLoop) { break; } } return result; } public Double computeSupport(List<String[]> data, String[] subSet) { Integer value = 0; for (int i = 0; i < data.size(); i++) { if (contain(data.get(i), subSet)) { value++; } } return value * 1.0 / data.size(); } public String[] getDataUnitSet(List<String[]> data) { List<String> uniqueKeys = new ArrayList<String>();
  • 9. for (String[] dat : data) { for (String da : dat) { if (!uniqueKeys.contains(da)) { uniqueKeys.add(da); } } } // String[] toBeStored = list.toArray(new String[list.size()]); String[] result = uniqueKeys.toArray(new String[uniqueKeys.size()]); return result; } public Set<String> combine(Set<String> src, String[] target) { Set<String> dest = new TreeSet<String>(); if (src == null || src.size() <= 0) { for (String t : target) { dest.add(t.toString()); } return dest; } for (String s : src) { for (String t : target) { String[] itemset1 = s.split(STRING_SPLIT); String[] itemset2 = t.split(STRING_SPLIT); int i = 0; for (i = 0; i < itemset1.length - 1 && i < itemset2.length - 1; i++) { int a = Integer.parseInt(itemset1[i]); int b = Integer.parseInt(itemset2[i]); if (a != b) break; else continue; } int a = Integer.parseInt(itemset1[i]); int b = Integer.parseInt(itemset2[i]); if (i == itemset2.length - 1 && a != b) { String keys = s + STRING_SPLIT + itemset2[i]; String key[] = keys.split(STRING_SPLIT); String Checkkeys = null; if (a > b) { String temp; temp = key[key.length - 1];
  • 10. key[key.length - 1] = key[key.length - 2]; key[key.length - 2] = temp; keys = key[0]; for (int j = 0; j < key.length - 1; j++) { keys = keys + STRING_SPLIT + key[j + 1]; } } if (key.length > 2) { int k = 0; for (k = 0; k < key.length - 2; k++) { int end1 = keys.indexOf(key[k]); int start2 = keys.indexOf(key[k + 1]); Checkkeys = keys.substring(0, end1) + keys.substring(start2, keys.length()); if (!src.contains(Checkkeys)) break; else continue; } if (k == key.length - 2) dest.add(keys); } if (Checkkeys == null) { if (!dest.contains(keys)) { dest.add(keys); } } } } } return dest; } public Map<String, Double> cut(Map<String, Double> tempresult, Double minSupport) { for (Object key : tempresult.keySet().toArray()) { if (minSupport != null && minSupport > 0 && minSupport < 1 && tempresult.get(key) < minSupport) { tempresult.remove(key); } }
  • 11. return tempresult; } public static List<String[]> loadChessData(Text value) throws Exception { List<String[]> result = new ArrayList<String[]>(); StringTokenizer baskets = new StringTokenizer(value.toString(), "n"); while (baskets.hasMoreTokens()) { String[] items = baskets.nextToken().split(" "); result.add(items); } return result; } } public static class CandidateItemsetReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(1)); } } public static void preprocessingphase1(String[] args) throws Exception { String originalfilepath = getLocation(args[0]); System.out.println(originalfilepath); if (originalfilepath == null) return; List<String> lines = readFile(originalfilepath); if (lines == null) return; total = lines.size(); partition = Integer.parseInt(args[1]); int m = (int) total / partition; double m_d = total * 1.0 / partition; if (m_d > m) m = m + 1; mkdir("input_temp"); for (int i = 0; i < partition; i++) {
  • 12. String newpath = "input_temp/" + i + ".dat"; String input_temp = ""; for (int j = 0; j < m && total - i * m - j > 0; j++) { input_temp += lines.get(i * m + j) + "n"; } createFile(newpath, input_temp.getBytes()); } } public static void preprocessingphase2() throws Exception { List<String> lines = readFile("output_temp/part-00000"); Iterator<String> itr = lines.iterator(); while (itr.hasNext()) { String basket = (String) itr.next(); String itemset = basket.substring(0, basket.indexOf("t")); FirstResult.add(itemset); } System.out.println("Pre processing for phase 2 finished."); } public static class FrequentItemsetMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String data = value.toString(); String[] baskets = data.split("n"); for (int i = 0; i < FirstResult.size(); i++) { int number = 0; String[] items = FirstResult.get(i).split(STRING_SPLIT); for (int j = 0; j < baskets.length; j++) { int k = 0; for (k = 0; k < items.length; k++) { String[] basketsitemset = baskets[j].split(" "); if (contain(basketsitemset, items)) continue; else break; } if (k == items.length) { number = number + 1; } } output.collect(new Text(FirstResult.get(i)), new IntWritable(
  • 13. number)); } } } public static class FrequentItemsetReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } if (sum >= s * total) output.collect(key, new IntWritable(sum)); } } public static List<String> readFile(String filePath) throws IOException { Path f = new Path(filePath); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(filePath), conf); FSDataInputStream dis = fs.open(f); InputStreamReader isr = new InputStreamReader(dis, "utf-8"); BufferedReader br = new BufferedReader(isr); List<String> lines = new ArrayList<String>(); String str = ""; while ((str = br.readLine()) != null) { lines.add(str); } br.close(); isr.close(); dis.close(); System.out.println("Original file reading complete."); return lines; } public static String getLocation(String path) throws Exception { Configuration conf = new Configuration(); FileSystem hdfs = FileSystem.get(conf); Path listf = new Path(path); FileStatus stats[] = hdfs.listStatus(listf); String FilePath = stats[0].getPath().toString();
  • 14. hdfs.close(); System.out.println("Find input file."); return FilePath; } public static void mkdir(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path srcPath = new Path(path); boolean isok = fs.mkdirs(srcPath); if (isok) { System.out.println("create dir ok."); } else { System.out.println("create dir failure."); } fs.close(); } public static void createFile(String dst, byte[] contents) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path dstPath = new Path(dst); FSDataOutputStream outputStream = fs.create(dstPath); outputStream.write(contents); outputStream.close(); fs.close(); System.out.println("file " + dst + " create complete."); } public static void phase1(String[] args) throws Exception { s = Double.parseDouble(args[2]); JobConf conf = new JobConf(FrequentItemset_MapReduce.class); conf.setJobName("Find frequent candidate"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(CandidateItemsetMapper.class); conf.setReducerClass(CandidateItemsetReducer.class); conf.setInputFormat(WholeFileInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("input_temp")); FileOutputFormat.setOutputPath(conf, new Path("output_temp")); JobClient.runJob(conf); }
  • 15. // phase 2 public static void phase2(String[] args) throws Exception { JobConf conf = new JobConf(FrequentItemset_MapReduce.class); conf.setJobName("Frequent Itemsets Count"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(FrequentItemsetMapper.class); conf.setReducerClass(FrequentItemsetReducer.class); FileInputFormat.setInputPaths(conf, new Path("input_temp")); FileOutputFormat.setOutputPath(conf, new Path("output")); JobClient.runJob(conf); } public static class WholeFileRecordReader implements RecordReader<LongWritable, Text> { private FileSplit fileSplit; private Configuration conf; private boolean processed = false; public WholeFileRecordReader(FileSplit fileSplit, Configuration conf) throws IOException { this.fileSplit = fileSplit; this.conf = conf; } @Override public boolean next(LongWritable key, Text value) throws IOException { if (!processed) { byte[] contents = new byte[(int) fileSplit.getLength()]; Path file = fileSplit.getPath(); String fileName = file.getName(); FileSystem fs = file.getFileSystem(conf); FSDataInputStream in = null; try { in = fs.open(file); IOUtils.readFully(in, contents, 0, contents.length); value.set(contents, 0, contents.length); } finally { IOUtils.closeStream(in); } processed = true; return true;
  • 16. } return false; } @Override public LongWritable createKey() { return new LongWritable(); } @Override public Text createValue() { return new Text(); } @Override public long getPos() throws IOException { return processed ? fileSplit.getLength() : 0; } @Override public float getProgress() throws IOException { return processed ? 1.0f : 0.0f; } @Override public void close() throws IOException { // do nothing } } public static class WholeFileInputFormat extends FileInputFormat<LongWritable, Text> { @Override protected boolean isSplitable(FileSystem fs, Path filename) { return false; } @Override public RecordReader<LongWritable, Text> getRecordReader( InputSplit split, JobConf job, Reporter reporter) throws IOException { return new WholeFileRecordReader((FileSplit) split, job); } }
  • 17. public static void main(String[] args) throws Exception { if (args.length < 3) { System.out.println("The number of arguments is less than three."); return; } preprocessingphase1(args); phase1(args); preprocessingphase2(); phase2(args); List<String> lines = readFile("output/part-00000"); Iterator<String> itr = lines.iterator(); File filename = new File("/home/hadoop/Desktop/result.txt"); filename.createNewFile(); try { BufferedWriter out = new BufferedWriter(new FileWriter(filename)); String firstline = Integer.toString(lines.size()) + "n"; out.write(firstline); while (itr.hasNext()) { String basket = (String) itr.next(); String itemset = basket.substring(0, basket.indexOf("t")); String number = basket.substring(basket.indexOf("t") + 1, basket.length()); out.write(itemset + "(" + number + ")" + "n"); } out.flush(); out.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }