This repository contains methods and strategies for effectively tackling challenges encountered during big data processing and AI model analysis.
Case description:
- low correlation coefficient between target and domain knowledge-based critical features
- For features exhibiting high correlation with the target, the scatter plots reveal the presence of multiple clusters, yet the precise relationship between the features and the target warrants further elucidation.
- Due to constraints in data availability, certain portions of the collected data fail to fully elucidate the genuine circumstances surrounding the features.
- Even though the collected data is unsufficient to elucidate the genuine circumstances surrounding the features, we encountered overfitting problems. The following shows three possible reasons : (1) certain features are irrelevant with the target but they coincidently follow the pattern that increases the performance of the training model (2) (3)
Analysis:
- Inconsistency in the relationship between the target and features observed in the data analysis outcomes compared to domain knowledge
- Data cleaning or genuine critical feature identification
- Detection of noise in critical features
- Decomposing complex feature information into multiple features, each containing specific details.
Method:
- Examine target variation using EDA methods and identify sources of variation based on domain knowledge. If the variation sources cause interference in the prediction, reorganize the data to ensure suitability for accurate predictions.
- Identify a feature utility metric that effectively demonstrates consistency in the relationship between the target and features as observed in the data analysis outcomes, aligning with domain knowledge. Furthermore, ensure that the relationship characterized by the index remains consistent across various resampled data subsets.
- Mutual information serves as a valuable metric for quantifying the dependency between variables, making it particularly useful for selecting relevant features and reduces dimensionality in datasets.
Implementation:
- Data: Automobile Dataset from Kaggle
- Examine target variation of price using a violin plot. The violin plot indicates that the distribution of price is unimodal for most car brands.
- Explore the dependency between the target and each feature using mutual information
The bar plot indicates that the dependency between curb-weight and price is the strongest of all.
- Note01: sklearn.feature_selection provides two functions mutual_information_regression and mutual_information_classif for numerical and categorical target variables respectively. Ensure discrete features in the dataset are transformed into integer type before calculating mutual information.
- Note02: It's good to investigate possible interaction effects before determining the features most relevant to the target based on mutual information. Example:
- One feature utility metric that quantifies the stength of (linear or non-linear) of correlation between two variables, say the target and features. The value of MI lies within 0 and infinity.
- In information theory terms, MI measures the amount of information that two variables provide about each other (symmetric).
- In terms of entropy, mutual information is clearly explained.
- Entropy, denoted by H(X), is a measure of the average level of uncertainty of a random variable. Mathematically, H(X) is defined as
. Clearly, H(X) is the expected value of the information content of the random variable X ($-\log(p(x)$). - The information content of an event for random variable X is defined as
. According to Shannon's axiom, the higher the probabilty for an event to occur, the more "surprising" the event is. Thus, is the most simple and suitable function that meets Shannon's axiom. - The negative of
is monotone decreasing given .
- Entropy, denoted by H(X), is a measure of the average level of uncertainty of a random variable. Mathematically, H(X) is defined as