Add handle_missing_values call and switch to one-hot encoding for categorical features by Copilot · Pull Request #6 · wuzhi456/AI_proj2

Copilot · 2025-12-17T11:23:00Z

Adds explicit missing value handling in the training notebook and improves categorical feature encoding.

Changes Made

New notebook cell: After the missing values check cell, added a call to DataPreprocessor.handle_missing_values() for both train and test dataframes
Added .gitignore: Standard Python ignores for cache files, virtual envs, IDE files
Fixed handle_missing_values method: Replaced df[col].fillna(..., inplace=True) with df[col] = df[col].fillna(...) to properly handle pandas copy behavior and avoid FutureWarning
Changed categorical encoding to one-hot encoding: Replaced LabelEncoder with OneHotEncoder for categorical features. This avoids imposing ordinal relationships on categorical data and may improve model performance. Feature dimension increases from 14 to 104 due to the one-hot encoding.

Usage

# After checking for missing values
preprocessor.identify_feature_types(train_df)
train_df = preprocessor.handle_missing_values(train_df)
test_df = preprocessor.handle_missing_values(test_df)

The handle_missing_values method replaces '?' with NaN, then fills categorical columns with mode and numerical columns with median.

Original prompt

在我的notebook中# Check for missing values
missing_train = train_df.isna().sum()
missing_test = test_df.isna().sum()

Check for ' ?' values

question_train = (train_df == '?').sum()
question_test = (test_df == '?').sum()

print("Missing values (NaN) in training data:")
print(missing_train[missing_train > 0])
print("\n'?' values in training data:")
print(question_train[question_train > 0])这之后加入data_processing中的handle_missing_values来处理这些脏数据

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

wuzhi456 · 2025-12-17T12:15:09Z

@copilot Missing values (NaN) in training data:
Series([], dtype: int64)

'?' values in training data:
workclass 1278
occupation 1283
native.country 423
dtype: int64
Missing values handled successfully!

Remaining missing values in training data: 0
Remaining '?' values in training data: 2984运行后还有很多？是否handle_missing_values有问题

…inplace Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot · 2025-12-17T12:19:19Z

@copilot Missing values (NaN) in training data:
Series([], dtype: int64)
...

Found and fixed the issue in handle_missing_values. The problem was with df[col].fillna(..., inplace=True) - in newer pandas versions, this doesn't work correctly because df[col] returns a copy, not a reference.

Changed to df[col] = df[col].fillna(...) which properly assigns the result back to the dataframe. This also removes the FutureWarning.

Fixed in commit 143022f.

wuzhi456 · 2025-12-17T12:26:12Z

@copilot def encode_categorical_features(self, df, is_training=True):
"""
Encode categorical features using label encoding.

    Args:
        df (pd.DataFrame): Input dataframe
        is_training (bool): Whether this is training data
        
    Returns:
        pd.DataFrame: Dataframe with encoded categorical features
    """
    df = df.copy()
    
    for col in self.categorical_features:
        if col not in df.columns:
            continue
            
        if is_training:
            # Fit and transform for training data
            self.label_encoders[col] = LabelEncoder()
            df[col] = self.label_encoders[col].fit_transform(df[col].astype(str))
        else:
            # Transform for test data
            if col in self.label_encoders:
                # Handle unseen categories
                df[col] = df[col].astype(str)
                known_labels = set(self.label_encoders[col].classes_)
                df[col] = df[col].apply(lambda x: x if x in known_labels else self.label_encoders[col].classes_[0])
                df[col] = self.label_encoders[col].transform(df[col])
    
    return df这里是如何对string的属性进行编码的

Copilot · 2025-12-17T12:27:17Z

@copilot def encode_categorical_features(self, df, is_training=True):
"""
Encode categorical features using label encoding....

encode_categorical_features uses sklearn's LabelEncoder to convert string values to integers:

Training data: For each categorical column, LabelEncoder().fit_transform() assigns a unique integer (0, 1, 2...) to each unique string value. For example, ['Private', 'Self-emp', 'Local-gov'] becomes [0, 1, 2].
Test data: Uses the same encoder to transform values. Unknown categories (not seen in training) are mapped to the first known category to avoid errors.

The encodings are stored in self.label_encoders[col] so they can be reused consistently between train and test data.

wuzhi456 · 2025-12-17T13:23:21Z

@copilot 我希望你把类别的编码方式改成独热编码,这样会不会效果好一些

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot · 2025-12-17T13:28:51Z

@copilot 我希望你把类别的编码方式改成独热编码,这样会不会效果好一些

已将编码方式改为独热编码 (one-hot encoding)。改动包括:

使用 sklearn.preprocessing.OneHotEncoder 替代 LabelEncoder
每个类别值转换为单独的二进制列（例如 workclass 的 8 个类别变成 8 个 0/1 列）
特征维度从 14 增加到 104

独热编码的优势是不会引入虚假的顺序关系，对神经网络可能效果更好。

修改在 commit a85b06d。

Initial plan

cdb097d

Copilot AI assigned Copilot and wuzhi456 Dec 17, 2025

Copilot started work on behalf of wuzhi456 December 17, 2025 11:23 View session

Copilot AI and others added 2 commits December 17, 2025 11:26

Add handle_missing_values call in notebook after missing values check

6628546

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Add .gitignore and remove cached pycache files

d0065cd

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add data processing for handling missing values~~ Add handle_missing_values call after missing value check in notebook Dec 17, 2025

Copilot AI requested a review from wuzhi456 December 17, 2025 11:29

Copilot finished work on behalf of wuzhi456 December 17, 2025 11:29

Copilot started work on behalf of wuzhi456 December 17, 2025 12:15 View session

Fix handle_missing_values to use proper pandas assignment instead of …

143022f

…inplace Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot finished work on behalf of wuzhi456 December 17, 2025 12:19

Copilot started work on behalf of wuzhi456 December 17, 2025 12:26 View session

Copilot finished work on behalf of wuzhi456 December 17, 2025 12:27

Copilot started work on behalf of wuzhi456 December 17, 2025 13:24 View session

Change categorical encoding from label encoding to one-hot encoding

a85b06d

Co-authored-by: wuzhi456 <217564920+wuzhi456@users.noreply.github.com>

Copilot AI changed the title ~~Add handle_missing_values call after missing value check in notebook~~ Add handle_missing_values call and switch to one-hot encoding for categorical features Dec 17, 2025

Copilot finished work on behalf of wuzhi456 December 17, 2025 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handle_missing_values call and switch to one-hot encoding for categorical features#6

Add handle_missing_values call and switch to one-hot encoding for categorical features#6
Copilot wants to merge 5 commits into
copilot/predict-income-on-census-datafrom
copilot/handle-missing-values-processing

Copilot AI commented Dec 17, 2025 •

edited

Loading

Uh oh!

wuzhi456 commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

wuzhi456 commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

wuzhi456 commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Usage

Check for ' ?' values

Uh oh!

wuzhi456 commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

wuzhi456 commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

wuzhi456 commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 17, 2025 •

edited

Loading