Rd1 r17a19 datawarehousing and mining_cap617t_cap617

HOME WORK TITLE-DESIGN PROBLEM

COURSE CODE-CAP 617T

COURSE INSTRUCTOR-Lect.Neha Malhotra Mam

COURSE TUTOR-DO

ALLOCATION DATE-01/09/12

SUBMITION DATE-01/11/12

STUDENT ROLL NO. – A19

SECTION-D1R17

DECLARATION,

I declare that this design problem is individual work. I have not copied from any
student work or from any other source except where due acknowledgement is
made explicitly in the text, nor has any part been written from me by another
person.

EVALUATOR COMMENT……… STUDENT SIGN

MARKS OBTAINED………. PRIYA RANJAN

Q.1What kind of data mining algorithms are used for mining the database
and justify the importance of the tool.

Ans:Kind of data mining algorithms are used for mining the database and we
justified the importance of the tool as described below:-

When student mistakes are recorded, association rules algorithms can be used to
find mistakes often associated together. Combined with a genetic algorithm,
concepts mastered together can be identified using student scores. The teacher may
use these findings to reflect on his/her teaching and re-design the course material.
It also comprises data exploration and visualization to present results in a
convenient way to users. New variables may be calculated and used in algorithms,
such as the average number of mistakes made per attempted exercise.

Tools: We used a range of tools. Initially we worked with Excel and Access to
perform simple SQL queries and visualization.

Data exploration and visualization: Raw data and algorithm results can be
visualized
Through tables and graphics such as graphs and histograms as well as through
more specific
Techniques such as symbolic data analysis. The aim is to display data along certain
attributes and make extreme points, trends and clusters obvious to human eye.

Clustering algorithms aim at finding homogeneous groups in data. The task of
identifying groups of records that are similar between themselves but different
from the reset of the data. Often the variables providing the best clustering should
be identified as well. We used k-means clustering and its combination with
hierarchic clustering. Both methods rest on a distance concept between individuals.
We used Euclidian distance.

Classification is used to predict values for some variable. The task of finding a
function that maps records into one of several discrete classes.
For example, given all the work done by a student, one may want to predict
whether the student will perform well in the final exam. We used C4.5 decision
tree from TADA-Ed which relies on the concept of entropy. The tree can be
represented by a set of rules such as: if x=v1 and y> v2 then t= v3.Thus, depending
on the values an individual takes for, say the variables x and y, one can predict its
value for t. The tree is built taking a representative population and is used to
Predict values for new individuals.

Association rules find relations between items. Rules have the following form: X
->Y, support 40%, confidence 66%, which could mean 'if students get X
incorrectly, then they get also Y incorrectly', with a support of 40% and a
confidence of 66%. Support is the frequency in
the population of individuals that contains both X and Y. Confidence is the
percentage of the
instances that contains Y amongst those which contain X. We implemented a
variant of the
standard Apriori algorithm in TADA-Ed that takes temporality into account.
Taking
temporality into account produces a rule X->Y only if exercise X occurred before
Y.

Q.2What preprocessing steps are lcarried out for effective mining and
comment on the same.
Ans:Tada-ed provides a pre-processing facility steps are lcarried out for
effective mining as described below:-

Tada-ed provides a pre-processing facility which allows making the data minable.
We need to specify two aspects:
(1) What element we want to cluster or classify: students, exercises, mistakes?

(2) Which attributes and distance do we want to retain to compare these elements?

Example:-An example could be to cluster students, using the number of mistakes
they made and the number of correct steps they entered.
The data was maintained in different tables was joined in a single table. After we
integrated the data into one files, to increase interpretation and comprehensibility;
we discretized the attributes to categorical ones. For examples, we grouped all
grades into five groups’ excellent, very good, good, average, and poor. In this step
the fields used in the study were determined and transformed if necessary. By
using normal distribution method, we categorized the value of each item in
questionnaire with High, Medium and Low

Data Preparation:-

Q.3In which way the Association rule mining algorithms and clustering
algorithms are useful during the analysis and interpretation?

Ans:-Association Rules:
Spatial and non-spatial association rules are in the form of X-->Y(c%), where X
and Y are sets of spatial or non-spatial predicates and c% is the confidence of the
rule. For examples:- Is_a (X, origin of CSU student) --> close to (X,railway
stations) (70%), this rule states that 60% of CSU students originally live close to
highway.- Is_a (X, CSU student) --> from (X, middle education level areas),
(80%), this rule states that 80% of CSU students are from the area which have
middle level of education.For a large database where there are a large set of objects
and attributes, there may exist a large number of associations between them
(Koperski, 1998). Some rules may only apply to a small number of objects, for
example, less than 5% of CSU's Park Recreation and Heritage students are
associated with a disability code, therefore it may not be of interest for the further
study.While 75% of students live within 10 km of railway stations, therefore it
attracted further study. A minimum support threshold needs to be specified along
with minimum confidence threshold to filter out the uninterested associations.
We used association rules to find mistakes often occurring together while solving
exercises.
The purpose of looking for these associations is for the teacher to ponder and, may
be, to

review the course material or emphasize subtleties while explaining concepts to
students.
Thus, it makes sense to have a support that is not too low.

Association rules for Year 2004.

M11 ==> M12 [sup: 77%, conf: 89%]
M12 ==> M11 [sup: 77%, conf: 87%]
M11 ==> M10 [sup: 74%, conf: 86%]
M10 ==> M12 [sup: 78%, conf: 93%]
M12 ==> M10 [sup: 78%, conf: 89%]
M10 ==> M12 [sup: 74%, conf: 88%]
M10: Premise set incorrect
M11: Rule can be applied, but deduction incorrect
M12: Wrong number of line reference given
The first association rule says that if students make mistake Rule can be applied,
but deduction incorrect while solving an exercise, then they also made the mistake
Wrong number of line references given while solving the same exercise.

Clustering
Clustering provides grouping together of similar data items. This technique
provides a high level view of the database. Clustering technique is a technique that
merges and combines techniques from different disciplines such as mathematics,
physics, math-programming, statistics, computer sciences, artificial intelligence
and databases etc. Variety of clustering algorithms exists and belongs to several
different categories. This technique helps to create new groups and classes based
on the study of patterns and relationships between values of data in a data bank. In
our case study we are considering students data which exists in three separate
clusters named as student academic record, student residence record, student
personal record etc.These clusters are again consisting of several its own attributes.
By applying this technique we will easily categorize the record of the student in
detail manner. For example:- CGPA Grade shows the total marks of the student
academic record cluster. On the basis of marks we can easily identify the actual
performance or the level of the student. This can be shown in below figure.

Different cluster numbers were tried, and successful partitioning was achieved with
5 clusters. In our case study, the cluster graph as a picture of students group
according to their performance on figure 3 gives. For graphs, the Rapidminer
software was used. The graphs are given in figure 4 is deviation plot five clusters
of students. Using these results we can divide students into five groups and guide
them according to their behavior.

We performed clustering using this subpopulation, both using
(i) k-means in TADA-Ed, and
(ii) a combination of k-means and hierarchical clustering of Clementine. Because
there is neither a fixed number nor a fixed set of exercises to compare students,
determining a distance between individuals was not obvious. We calculated and
used a new variable: the total number of mistakes made per student in an exercise.

As a result, students with similar frequency of mistakes were put in the same
group. Histograms showing the different clusters revealed interesting patterns.
Consider the histogram shown in Figure 1 obtained with TADA-Ed. There are
three clusters: 0 (red, on the left), 1 (green, in the middle) and 4 (purple, on the
right). From other windows (not shown) we know that students in cluster 0 made
many mistakes per exercise not finished, students in cluster 1 made few mistakes
and students in cluster 4 made an intermediate number of mistakes. Students
making many mistakes use also many different logic rules while solving exercises,
this is shown with the vertical, almost solid lines.

Histogram showing, for each cluster of students, the rules incorrectly used per
student

Q.4How classification rule mining algorithms was helpful to the teachers?
Justify your actions.
Ans:Classification rule mining algorithms was helpful to the teachers we
justified our action as described below:-

Classification –
We built decision trees to try and predict exam marks (for the question related to
formal proofs). Decision trees are basically a type of procedure on the basis of this
we can decide either specific value is accept or reject by that procedure. It is
basically provides mapping with the current state to the future state. In this way it
helps to take decision in efficient manner. This follows theory of dynamic
optimization. Decision trees are using If-Then Statements. The major advantage to
use this technique is that results can be displayed in graphical manner so that user
understands easily.
For evaluating the decision trees we can use various factors like Dataset, Data type,
Scalability, accuracy and robustness etc.

Description of teacher activity

Mistakes Rules Learn

Diagrammatical representation of classification rule mining algorithm

The information extracted greatly assisted us as teachers to better understand the
cohort of
Learners. Whilst SQL queries and various histograms were used during the course
of the
teaching semester to focus the following lecture on problem areas, the more
complex
mining was left for reflection between semesters.
Symbolic data analysis revealed that if students attempt at least two exercises, they
are
more likely to do more (probably overcoming the initial barrier of use) and
complete
their exercises. In subsequent years we required students to do at least 2 exercises
as
part of their assessment.
Mistakes that were associated together indicated to us that the very concept of
formal
proofs was a problem. In 2003, that portion of the course was redesigned to take
this
problem into account and the role of each part of the proof was emphasized. After
the
end of the semester, mining for mistakes associations was conducted again.
Surprisingly, results did not change much (a slight decrease in support and
confidence
levels in 2003 followed by a slight increase in 2004).

Q.5Based on your analysis, list out 5 best DM queries for the given context.

Ans:The 5 best DM queries for the given context are as follows:-
1.Symbolic data analysis revealed that if students attempt at least two exercises,
they are

more likely to do more and complete their exercises. In subsequent years we
required students to do at least 2 exercises as part of their assessment.

2.Mistakes that were associated together indicated to us that the very concept of
formal
proofs was a problem.
However, marks in the final exam continued increasing. This leads us to think that
making mistakes, especially while using a training tool, is simply part of the
learning process and was supported by the fact that the number of completed
exercises.

3.The level of prediction seems to be much better when the prediction is based on
exercises (number, length, variety of rules) rather than on mistakes made. This also
supports the idea that mistakes are part of the learning process, especially in a
practice
tool where mistakes are not penalised.

4.Using data exploration and results from decision tree, one can infer that if
students do
successfully 2 to 3 exercises for the topic, then they seem to have grasped the
concept
of formal proof and are likely to perform well in the exam question related to that
topic.
This finding is coherent with correlations calculated between marks in the final
exam
and activity with the Logic Tutor and with the general, human perception of tutors
in
this course. Therefore, a sensible warning system could look as follows. Report to
the
lecturer in charge students who have completed successfully less than 3 exercises.
For
those students, display the histogram of rules used. Be proactive towards these
students,
distinguishing those who use out the pop-up menu for logic rules from the others.

5.feedback functionality is available for association rule. where the teacher can
design his/her own proactive feedback for that particular sequence of mistakes. The
content of the page is up to the teacher. For instance for the pattern of mistakes A,
B -> C, the teacher may want to provide explanations about mistakes A and B
(which the current student has made) and review underlying concepts of mistake C.
This rule has a support and confidence.

Q.6From the given context, suggest few more areas where we can apply data
mining in an University.
Ans:We can applied data mining in an University as described below:-

Student Academic Record

Attributes Description Selected Attributes
Std_Reg_No Student RegistrationNumber
Enroll_Yr Year of Enrollment
Completion_Yr Year of Completion
CGPA_PG Post-Graduation CGPA yes
CGPA_G Graduation CGPA
CGPA_HS High School CGPA
CGPA_T Tenth CGPA
E_ID E-Mail ID
MOB_NO Mobile Number
Date_of_Birth Date of Birth
Performanc_Grade Overall Performance yes

Student Residence Record

Permanent Add Permanent Address
Correspondance_Add Address For Correspondence
City_C City
Location_L Location
State_S State

Student Personal Record

Gender_G Male-M,Female-F
Place_of_Birth_POB Student Birth Place
Nationality_N Nationality

Rd1 r17a19 datawarehousing and mining_cap617t_cap617

More Related Content

What's hot (18)

Viewers also liked (8)

Similar to Rd1 r17a19 datawarehousing and mining_cap617t_cap617 (20)

Recently uploaded (20)

Rd1 r17a19 datawarehousing and mining_cap617t_cap617