Skip to content

Add handle_missing_values call and switch to one-hot encoding for categorical features#6

Draft
Copilot wants to merge 5 commits into
copilot/predict-income-on-census-datafrom
copilot/handle-missing-values-processing
Draft

Add handle_missing_values call and switch to one-hot encoding for categorical features#6
Copilot wants to merge 5 commits into
copilot/predict-income-on-census-datafrom
copilot/handle-missing-values-processing

Conversation

Copilot AI commented Dec 17, 2025

Copy link
Copy Markdown
Contributor

Adds explicit missing value handling in the training notebook and improves categorical feature encoding.

Changes Made

  • New notebook cell: After the missing values check cell, added a call to DataPreprocessor.handle_missing_values() for both train and test dataframes
  • Added .gitignore: Standard Python ignores for cache files, virtual envs, IDE files
  • Fixed handle_missing_values method: Replaced df[col].fillna(..., inplace=True) with df[col] = df[col].fillna(...) to properly handle pandas copy behavior and avoid FutureWarning
  • Changed categorical encoding to one-hot encoding: Replaced LabelEncoder with OneHotEncoder for categorical features. This avoids imposing ordinal relationships on categorical data and may improve model performance. Feature dimension increases from 14 to 104 due to the one-hot encoding.

Usage

# After checking for missing values
preprocessor.identify_feature_types(train_df)
train_df = preprocessor.handle_missing_values(train_df)
test_df = preprocessor.handle_missing_values(test_df)

The handle_missing_values method replaces '?' with NaN, then fills categorical columns with mode and numerical columns with median.

Original prompt

在我的notebook中# Check for missing values
missing_train = train_df.isna().sum()
missing_test = test_df.isna().sum()

Check for ' ?' values

question_train = (train_df == '?').sum()
question_test = (test_df == '?').sum()

print("Missing values (NaN) in training data:")
print(missing_train[missing_train > 0])
print("\n'?' values in training data:")
print(question_train[question_train > 0])这之后加入data_processing中的handle_missing_values来处理这些脏数据


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits December 17, 2025 11:26
Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>
Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>
Copilot AI changed the title [WIP] Add data processing for handling missing values Add handle_missing_values call after missing value check in notebook Dec 17, 2025
Copilot AI requested a review from wuzhi456 December 17, 2025 11:29
@wuzhi456

Copy link
Copy Markdown
Owner

@copilot Missing values (NaN) in training data:
Series([], dtype: int64)

'?' values in training data:
workclass 1278
occupation 1283
native.country 423
dtype: int64
Missing values handled successfully!

Remaining missing values in training data: 0
Remaining '?' values in training data: 2984运行后还有很多?是否handle_missing_values有问题

…inplace

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot AI commented Dec 17, 2025

Copy link
Copy Markdown
Contributor Author

@copilot Missing values (NaN) in training data:
Series([], dtype: int64)
...

Found and fixed the issue in handle_missing_values. The problem was with df[col].fillna(..., inplace=True) - in newer pandas versions, this doesn't work correctly because df[col] returns a copy, not a reference.

Changed to df[col] = df[col].fillna(...) which properly assigns the result back to the dataframe. This also removes the FutureWarning.

Fixed in commit 143022f.

@wuzhi456

Copy link
Copy Markdown
Owner

@copilot def encode_categorical_features(self, df, is_training=True):
"""
Encode categorical features using label encoding.

    Args:
        df (pd.DataFrame): Input dataframe
        is_training (bool): Whether this is training data
        
    Returns:
        pd.DataFrame: Dataframe with encoded categorical features
    """
    df = df.copy()
    
    for col in self.categorical_features:
        if col not in df.columns:
            continue
            
        if is_training:
            # Fit and transform for training data
            self.label_encoders[col] = LabelEncoder()
            df[col] = self.label_encoders[col].fit_transform(df[col].astype(str))
        else:
            # Transform for test data
            if col in self.label_encoders:
                # Handle unseen categories
                df[col] = df[col].astype(str)
                known_labels = set(self.label_encoders[col].classes_)
                df[col] = df[col].apply(lambda x: x if x in known_labels else self.label_encoders[col].classes_[0])
                df[col] = self.label_encoders[col].transform(df[col])
    
    return df这里是如何对string的属性进行编码的

Copilot AI commented Dec 17, 2025

Copy link
Copy Markdown
Contributor Author

@copilot def encode_categorical_features(self, df, is_training=True):
"""
Encode categorical features using label encoding....

encode_categorical_features uses sklearn's LabelEncoder to convert string values to integers:

  1. Training data: For each categorical column, LabelEncoder().fit_transform() assigns a unique integer (0, 1, 2...) to each unique string value. For example, ['Private', 'Self-emp', 'Local-gov'] becomes [0, 1, 2].

  2. Test data: Uses the same encoder to transform values. Unknown categories (not seen in training) are mapped to the first known category to avoid errors.

The encodings are stored in self.label_encoders[col] so they can be reused consistently between train and test data.

@wuzhi456

Copy link
Copy Markdown
Owner

@copilot 我希望你把类别的编码方式改成独热编码,这样会不会效果好一些

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot AI commented Dec 17, 2025

Copy link
Copy Markdown
Contributor Author

@copilot 我希望你把类别的编码方式改成独热编码,这样会不会效果好一些

已将编码方式改为独热编码 (one-hot encoding)。改动包括:

  1. 使用 sklearn.preprocessing.OneHotEncoder 替代 LabelEncoder
  2. 每个类别值转换为单独的二进制列(例如 workclass 的 8 个类别变成 8 个 0/1 列)
  3. 特征维度从 14 增加到 104

独热编码的优势是不会引入虚假的顺序关系,对神经网络可能效果更好。

修改在 commit a85b06d

Copilot AI changed the title Add handle_missing_values call after missing value check in notebook Add handle_missing_values call and switch to one-hot encoding for categorical features Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants