-
Notifications
You must be signed in to change notification settings - Fork 0
Rex Meeting Minutes
rexshihaoren edited this page Nov 18, 2014
·
21 revisions
- Decide general directions of Rex's research: predict patient's EnjoyLife Score.
- Create private Github Repo.
- Learn R.
- Binarize EnjoyLife Score with R. Plot CDF: histogram of modfam2 (integers) and regular density graph of fam2 (floats)
- Use Linear Regression, KNN, Naive Bayes, SVM, and RandomForest classifiers to do prediction. Use 10 Fold Cross Validation; Use roc_auc as metrics.
- Compare classifiers with mean roc_auc, and here use shuffle so CV's result is not deterministic.
- In plot_pr: average precision recall instead of appending every folds.
- Use roc as optimization metrics (1. KNN, # of neighbors, e.g. parameter sweeping curve: (1,2,3,...); 2. RandomForest parameter sweeping); Plot parameter sweeping curve: x axis parameters, y axis metrics, for each classifier;Use GridSearch, RandomSearch maybe;
- Take some points in PR curve and find something interesting, statistically.
- Prepare to discuss Link Prediction metrics paper, connect w/ our project and possible applications.
- Think about Naive_Bayes parameter optimization if have time.
- 10 10 Folds.
- Read about the details about KNN.Why does it behave this way.
- Change the dist of Bayes.
- Use Build in Param-Opt.
- design doc
- 10 iter inside
- scoring = callable
- different file name for opt
- fix bug
- frame_work, like what I did before in a more efficient way
- Plot name with opt or not especially compare clfs
- Polish codes
- build h5 data structure for clinical results
- try to predict some stuff other than EnjoyLife
- Store Grid Perf
- Change name, not CDF, but conditional pdf
- Plot fitted models
- Merge table3Ful + fam2 + modfam2
-
- SD for Grid Perf
- Visualization for Grid Perf, maybe heatmap + interpolate
- Beta, Possion and NB for fitting
- EDSS rate compute gradient
- Bayes source code with possion
- one more col in modified EDSSR, ignore abs dEDSSS <= 0.5; 2 class, increase or others; do analysis
- Look at the parameter optimisation for RF + Tweak
- Make MixNB, based on goodness of fit estimators for discrete distributions
- histograms and conditional Density plots for merged_update
- Use own formula of Normal Log Likelihood to write a new Gaussian and Mix NB
- Mix NB with goodness Chi-Square
- During fit, output graph
- QOL(n) + EDSSRate(n-1) + EDSS(n-1) => ModEDSS(n)
- How to deal with missing data? e.g NA, ignore for now
- Precision Recall for new Bayes Code
- Rewrite Bayes, use NA
- Remove first visits, and change PreEDSSRate to 0 (imputation), then prediction.
- Maybe: treatment(Y/N) from time of treatment, type of treatment.
- Use the rate of everything to do prediction see what happens.
- Understand Why is GaussanNB2 not as good, and why is Linear models good. Understand these models.
- Plot Gausian’s fit on top of X.
- feature_importance
- probas = np.exp(self.predict_log_proba(X)); return probas #/ np.sum(probas, axis = 1)
- fam2 instead of modfam2 in ModEDSS Prediction, see what happens.
-
Change the code for MixNB between Poission and Gaussian and output the fit model.
-
create remote to push to the private UCSF repo
-
Look into simple parallelization
-
Look at the logistic regression coefficients
-
Add into the model:
- Patient specific
- AgeOfOnset
- Gender
- DRB1_1501
- OnsetToYr5RelapseCount
- Previous year parameters:
- DiseaseDuration
- Siena_PBVC (remove the zeros) (+gradient)
- New_T2_Lesions
- meds: doesn't help
- For above, (i) prepare the data from R (ii) check the CDFs
- Shouldn't remove more than 10% of dataset. Maybe remove Sievna_PBVC, or figure out how to deal with NA.
- RandomForest, LogisticRegression, LinearRegress, Gaussian2, MixNB, BayersBernoulli, How to handle NA. Output feature related stats: feature_importance for RandomForest, Coeff for 2 Regression, fit plot on X for 3 Bayers. Couldn't handle NA's.
- Read about AIC, BIC.
- Logistic, elastic: ridge and lasso, C. Look at x= C, y = roc, two plots (depends on penalty L1, L2).
- Try different set of features: e.g. MSSS = EDSS/DD. Disease Duration(DD) = AgeAtExam(AAE) - AgeOnSet(AOS). From a core set of features, and try add the rest one by one, and generate a table with different algorithms' ROC.
- Impute Data before plotting, 0 for all PreXXRate, KNN (or maybe RandomForestRegressor) to impute NA's.
- Try different set of features. Question, although MSSS = EDSS/DD, we did some manipulation with EDSS, also we removed preEDSS of NA's.
- IMPORTANT, high-level summary of what we did. DUE ON TUESDAY
- From Last Time: Read about AIC, BIC, Look at x= C, y = roc, two plots (depends on penalty L1, L2).
- Impute only X
- diagnonoNA includes all features without NA not complete cases
- New Bayes Model with +- sample ratio if this columns has less than 5 discrete values; if not follow MixNB
- For Regressioins: plot_coeffs.
- Change ModEDSS Remove imprecision to achieve balance
- Rerun the whole thing after today's modification
- Delete AgeOnSet, put AgeAtExam; replace DRB1 with DRB1 * PrevEDSS
- n_iter 50
- Different k Imputation with KNN maybe
- Get my ID next tuesday see if works
- Train your imputation formula then use it in testing.
- Use the old EDSS
- Change the dataframe name in R
- Store ytrue ypred, and create plotting func(datasetname, clf)
- Read paper, give a short presentation
- NA NA Go away please come back another day
- Log
datasetName <- "datasetTest" sink(file=paste0(datasetName, ".log"), append=F, split = T) cat("### PART I \n")
cat("##", "This is my log with", 1, "file to test \n")
sink()
- Store ytrue ypred, and create plotting func(datasetname, clf)
- save_ouput
- plot_roc_com(datasets = [], models = [])
- plot_pr_com(datasets = [], models = [])
- SD plots use 10 times
- Change the modEDSS for more relevant notification of increase. (I want to allow 0.5 differences for EDSS>4, but I have to talk to people to be sure of what I'm doing) and create new folders (probably with "new" at the front)
- X X X on the heatmap for the 100 pts chosen
- PredDate_Impr0-4, output should be able to coexist with old one. e.g. data/PredData, data/PredDate_Impr0-4
- Thursday 2pm Presentation
- X X X on the heatmap for the 100 pts chosen
- different clf in same for sd plot
- sd for pr
- R code change to store h5 at different location of python folder
- Use ./PredData/PredData.h5; ./PredData/data/; ./PredData/plots/ structure
- Change code to comply with new format for gridsore
- gridData use customized with tol
- y_pred y_true use new format from Antoine's code, (e.g. in compare_obj_sd)
- New ROC PR with D1C1, D1C2, D2C1, D2C2.
- GUI!!!!!
- Create 7 more columns in R for treatment, delete the old ones.
- Class Project