- Non-trivial process of identifying implicit, valid, novel (measured by comparing to expected values), potentially useful and understandable patterns in date
- Step in the KDD process
Cross-Industry Standard Process for Data Mining
- cheaper, faster and more reliable DM
- widespread adoption
- reduce skills required for data mining
- capture experience for reuse
- Non-proprietary
- application/industry neutral
- tool neutral
- focus on business issues
- framework for guidance
- experience based
- Business Understanding
- Determine Business Objectives, Assess Situation, Determine Data Mining Goals, Produce Project Plan
- Data Understanding
- Collect Initial Data, Describe Data, Explore Data, Verify Data Quality
- Data Preparation
- Select Data, Clean Data, Construct Data, Integrate Data, Format Data
- Modelling
- Select Modeling Technique, Generate Test Design, Build Model, Assess Model
- Evaluation
- Evaluate Results, Review Process, Determine Next Steps
- Deployment
- Plan Deployment, Plan Monitoring & Maintenance, Produce Final Report, Review Project
Scrum work management + CRISP-DM Data Mining
- 3 phases: Business Understanding, Sprint, Deployment
- 6 concepts: PO, SM, Dev. Team, DM Story, PBL, SBL
- A DM project should always start with an analysis of the data with traditional query tools
- 80% can be extracted with SQL
- 20% (hidden information) requires more advanced techniques