Data Mining

Dennis Faas's picture

Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform otherwise abstract data into useable information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

Data mining can be used to uncover patterns in data but is often carried out only on samples of data. The mining process will be ineffective if the samples are not a good representation of the larger body of data.

Data mining cannot discover patterns that may be present in the larger body of data if those patterns are not present in the sample being "mined". Inability to find patterns may become a cause for some disputes between customers and service providers. Therefore data mining is not foolproof but may be useful if sufficiently representative data samples are collected.

Data Mining Tasks And Classes

Data mining commonly involves four classes of tasks:

Clustering

Clutstering is the task of discovering groups and structures in the data that are in some way or another similar without using known structures in the data.

Classification

Classification is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines.

Regression

Regression attempts to find a function which models the data with the least error.

Association Rule Learning

Association rule learning searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

Privacy Concerns and Ethics

Some people believe that data mining itself is ethically neutral. However, the ways in which data mining can be used can raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.

Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation is when the data are accrued, possibly from various sources, and put together so that they can be analyzed.

This is not data mining per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when originally the data were anonymous.

It is recommended that an individual is made aware of the following before data are collected:

  • the purpose of the data collection and any data mining projects,
     
  • how the data will be used,
     
  • who will be able to mine the data and use them,
     
  • the security surrounding access to the data, and in addition,
     
  • how collected data can be updated.

This document is licensed under the GNU Free Documentation License (GFDL), which means that you can copy and modify it as long as the entire work (including additions) remains under this license.

Rate this article: 
No votes yet