Skip to content

This repository contains methods and strategies for effectively tackling challenges encountered during big data processing and AI model analysis.

Notifications You must be signed in to change notification settings

Tiffany-TW/BigData_AI_Analysis_Methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.

History

20 Commits
 
 
 
 
 
 

Repository files navigation

BigData_AI_Analysis_Methods

This repository contains methods and strategies for effectively tackling challenges encountered during big data processing and AI model analysis.

Tackling Overfitting Problems

Case 1: Feature Selection Using Mutual Information

Case description:

  • low correlation coefficient between target and domain knowledge-based critical features
  • For features exhibiting high correlation with the target, the scatter plots reveal the presence of multiple clusters, yet the precise relationship between the features and the target warrants further elucidation.
  • Due to constraints in data availability, certain portions of the collected data fail to fully elucidate the genuine circumstances surrounding the features.
  • Even though the collected data is unsufficient to elucidate the genuine circumstances surrounding the features, we encountered overfitting problems. The following shows three possible reasons : (1) certain features are irrelevant with the target but they coincidently follow the pattern that increases the performance of the training model (2) (3)

Analysis:

  1. Inconsistency in the relationship between the target and features observed in the data analysis outcomes compared to domain knowledge
  2. Data cleaning or genuine critical feature identification
  3. Detection of noise in critical features
  4. Decomposing complex feature information into multiple features, each containing specific details.

Method:

  1. Examine target variation using EDA methods and identify sources of variation based on domain knowledge. If the variation sources cause interference in the prediction, reorganize the data to ensure suitability for accurate predictions.
  2. Identify a feature utility metric that effectively demonstrates consistency in the relationship between the target and features as observed in the data analysis outcomes, aligning with domain knowledge. Furthermore, ensure that the relationship characterized by the index remains consistent across various resampled data subsets.
  3. Mutual information serves as a valuable metric for quantifying the dependency between variables, making it particularly useful for selecting relevant features and reduces dimensionality in datasets.

Implementation:

  • Data: Automobile Dataset from Kaggle
  • Examine target variation of price using a violin plot. The violin plot indicates that the distribution of price is unimodal for most car brands.image
  • Explore the dependency between the target and each feature using mutual information image The bar plot indicates that the dependency between curb-weight and price is the strongest of all.
  • Note01: sklearn.feature_selection provides two functions mutual_information_regression and mutual_information_classif for numerical and categorical target variables respectively. Ensure discrete features in the dataset are transformed into integer type before calculating mutual information.
  • Note02: It's good to investigate possible interaction effects before determining the features most relevant to the target based on mutual information. Example:

What is Mutual Information?

  • One feature utility metric that quantifies the stength of (linear or non-linear) of correlation between two variables, say the target and features. The value of MI lies within 0 and infinity.
  • In information theory terms, MI measures the amount of information that two variables provide about each other (symmetric).
  • In terms of entropy, mutual information is clearly explained.
    • Entropy, denoted by H(X), is a measure of the average level of uncertainty of a random variable. Mathematically, H(X) is defined as x X p ( x ) log p ( x ) . Clearly, H(X) is the expected value of the information content of the random variable X ($-\log(p(x)$).
    • The information content of an event for random variable X is defined as log ( p ( x ) ) . According to Shannon's axiom, the higher the probabilty for an event to occur, the more "surprising" the event is. Thus, log ( x ) is the most simple and suitable function that meets Shannon's axiom.
    • The negative of log ( p ( x ) ) is monotone decreasing given p ( x ) [ 0 , 1 ] .information_content_illus

Properties of Mutual Information

  • Mutual information helps identify relations between a feature item and the target, not limited to linear\monotone relations. The mi of periodic functions such as sin x is close to both linear and monotone functions. mi_of_functions

  • Add some gaussian noise to the function mi_of_function_add_noise

Reference

  1. https://www.kaggle.com/code/ryanholbrook/mutual-information

Case 2: Prevent overfitting using permutation importance

Reference

  1. https://www.kaggle.com/code/dansbecker/permutation-importance
  2. https://christophm.github.io/interpretable-ml-book/

Tackling Imbalanced Data

Case 1: Oversampling

1.1 SMOTE (Synthetic Minority Oversampling Technique)

1.2 ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced Learning)

About

This repository contains methods and strategies for effectively tackling challenges encountered during big data processing and AI model analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages