cancer-prediction

This project involves building and evaluating models to predict cancer diagnoses based on a dataset of 1,500 patients. The dataset includes various medical and lifestyle factors, such as genetic risk, cancer history, smoking habits, and more. The project explores both k-nearest neighbor (KNN) and logistic regression models, analyzing the impact of different features on the model's accuracy.

Data

Features

Link to dataset: https://www.kaggle.com/datasets/rabieelkharoua/cancer-prediction-dataset/data

The dataset contains the following features:

Age
Gender
Body Mass Index (BMI)
Genetic Risk
Physical Activity
Alcohol Intake
Smoking Habits
Cancer History

Target

Diagnosis: The target variable indicating whether the patient was diagnosed with cancer.

Methods

Data Preprocessing

Feature Scaling: Standard scaling and min-max scaling was applied to normalize the features, as applicable to each type of model.
Train-Test Split: The dataset was split into 80% training and 20% validation sets.

Model Selection

K-Nearest Neighbors (KNN): The KNN model was trained using different combinations of features. The model with k=35 yielded the best accuracy of 0.77. However, the model's inability to directly compute feature importance led to the exploration of logistic regression.
Logistic Regression: Several logistic regression models were tested using different combinations of features to determine their importance. The best accuracy achieved was 0.80, with the most significant features identified as Cancer History, Genetic Risk, Gender, Smoking Habits, and Alcohol Intake.

Feature Importance and Model Interpretation

KNN: Initially, a combination of Age, Physical Activity, and Smoking Habits was used. After testing different feature sets, the combination of Genetic Risk, Alcohol Intake, and Cancer History provided better results.
Logistic Regression: Feature importance was evaluated by examining the coefficients of the logistic regression model. The final model included Genetic Risk, Cancer History, Smoking, Gender, and Alcohol Intake as the most significant features.

Results

Best Model: The Logistic Regression model using Genetic Risk, Cancer History, Smoking, Gender, and Alcohol Intake achieved the highest accuracy of 0.80.
Feature Ranking:
1. Cancer History
2. Genetic Risk
3. Gender
4. Smoking Habits
5. Alcohol Intake

Conclusion

This project demonstrates the effectiveness of both KNN and logistic regression models in predicting cancer diagnoses. Logistic regression provided better interpretability and slightly higher accuracy, making it the preferred model for this task. The analysis highlights the critical role of Cancer History and Genetic Risk in predicting cancer, along with the importance of careful feature selection.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.DS_Store		.DS_Store
Cancer Prediction.ipynb		Cancer Prediction.ipynb
README.md		README.md
The_Cancer_data_1500_V2.csv		The_Cancer_data_1500_V2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cancer-prediction

Data

Features

Target

Methods

Data Preprocessing

Model Selection

Feature Importance and Model Interpretation

Results

Conclusion

About

Uh oh!

Releases

Packages

Languages

amanserhan/cancer-prediction

Folders and files

Latest commit

History

Repository files navigation

cancer-prediction

Data

Features

Target

Methods

Data Preprocessing

Model Selection

Feature Importance and Model Interpretation

Results

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages