Skip to content

Rex Meeting Minutes

rexshihaoren edited this page Nov 18, 2014 · 21 revisions

Week 1

  1. Decide general directions of Rex's research: predict patient's EnjoyLife Score.
  2. Create private Github Repo.
  3. Learn R.
  4. Binarize EnjoyLife Score with R. Plot CDF: histogram of modfam2 (integers) and regular density graph of fam2 (floats)

Week 2

  1. Use Linear Regression, KNN, Naive Bayes, SVM, and RandomForest classifiers to do prediction. Use 10 Fold Cross Validation; Use roc_auc as metrics.

Week 3+4

  1. Compare classifiers with mean roc_auc, and here use shuffle so CV's result is not deterministic.
  2. In plot_pr: average precision recall instead of appending every folds.
  3. Use roc as optimization metrics (1. KNN, # of neighbors, e.g. parameter sweeping curve: (1,2,3,...); 2. RandomForest parameter sweeping); Plot parameter sweeping curve: x axis parameters, y axis metrics, for each classifier;Use GridSearch, RandomSearch maybe;
  4. Take some points in PR curve and find something interesting, statistically.
  5. Prepare to discuss Link Prediction metrics paper, connect w/ our project and possible applications.
  6. Think about Naive_Bayes parameter optimization if have time.

Week 5

  1. 10 10 Folds.
  2. Read about the details about KNN.Why does it behave this way.
  3. Change the dist of Bayes.
  4. Use Build in Param-Opt.

April 23

  1. design doc
  2. 10 iter inside
  3. scoring = callable
  4. different file name for opt

April 30

  1. fix bug
  2. frame_work, like what I did before in a more efficient way

May 22

  1. Plot name with opt or not especially compare clfs
  2. Polish codes
  3. build h5 data structure for clinical results
  4. try to predict some stuff other than EnjoyLife

May 27

  1. Store Grid Perf
  2. Change name, not CDF, but conditional pdf
  3. Plot fitted models
  4. Merge table3Ful + fam2 + modfam2

Jun 3

    • SD for Grid Perf
  1. Visualization for Grid Perf, maybe heatmap + interpolate
  2. Beta, Possion and NB for fitting
  3. EDSS rate compute gradient

Jun 5

  1. Bayes source code with possion
  2. one more col in modified EDSSR, ignore abs dEDSSS <= 0.5; 2 class, increase or others; do analysis

Jun 11

  1. Look at the parameter optimisation for RF + Tweak
  2. Make MixNB, based on goodness of fit estimators for discrete distributions
  3. histograms and conditional Density plots for merged_update

Jul 3

  1. Use own formula of Normal Log Likelihood to write a new Gaussian and Mix NB
  2. Mix NB with goodness Chi-Square
  3. During fit, output graph
  4. QOL(n) + EDSSRate(n-1) + EDSS(n-1) => ModEDSS(n)
  5. How to deal with missing data? e.g NA, ignore for now

Jul 11

  1. Precision Recall for new Bayes Code
  2. Rewrite Bayes, use NA
  3. Remove first visits, and change PreEDSSRate to 0 (imputation), then prediction.
  4. Maybe: treatment(Y/N) from time of treatment, type of treatment.

Jul 16

  1. Use the rate of everything to do prediction see what happens.
  2. Understand Why is GaussanNB2 not as good, and why is Linear models good. Understand these models.
  3. Plot Gausian’s fit on top of X.
  4. feature_importance
  5. probas = np.exp(self.predict_log_proba(X)); return probas #/ np.sum(probas, axis = 1)
  6. fam2 instead of modfam2 in ModEDSS Prediction, see what happens.

Aug 5

  1. Change the code for MixNB between Poission and Gaussian and output the fit model.

  2. create remote to push to the private UCSF repo

  3. Look into simple parallelization

  4. Look at the logistic regression coefficients

  5. Add into the model:

  • Patient specific
    • AgeOfOnset
    • Gender
    • DRB1_1501
    • OnsetToYr5RelapseCount
  • Previous year parameters:
    • DiseaseDuration
    • Siena_PBVC (remove the zeros) (+gradient)
    • New_T2_Lesions
    • meds: doesn't help
  1. For above, (i) prepare the data from R (ii) check the CDFs

Aug 15

  1. Shouldn't remove more than 10% of dataset. Maybe remove Sievna_PBVC, or figure out how to deal with NA.
  2. RandomForest, LogisticRegression, LinearRegress, Gaussian2, MixNB, BayersBernoulli, How to handle NA. Output feature related stats: feature_importance for RandomForest, Coeff for 2 Regression, fit plot on X for 3 Bayers. Couldn't handle NA's.
  3. Read about AIC, BIC.
  4. Logistic, elastic: ridge and lasso, C. Look at x= C, y = roc, two plots (depends on penalty L1, L2).
  5. Try different set of features: e.g. MSSS = EDSS/DD. Disease Duration(DD) = AgeAtExam(AAE) - AgeOnSet(AOS). From a core set of features, and try add the rest one by one, and generate a table with different algorithms' ROC.

Aug 22

  1. Impute Data before plotting, 0 for all PreXXRate, KNN (or maybe RandomForestRegressor) to impute NA's.
  2. Try different set of features. Question, although MSSS = EDSS/DD, we did some manipulation with EDSS, also we removed preEDSS of NA's.
  3. IMPORTANT, high-level summary of what we did. DUE ON TUESDAY
  4. From Last Time: Read about AIC, BIC, Look at x= C, y = roc, two plots (depends on penalty L1, L2).

Aug 26

  1. Impute only X
  2. diagnonoNA includes all features without NA not complete cases
  3. New Bayes Model with +- sample ratio if this columns has less than 5 discrete values; if not follow MixNB
  4. For Regressioins: plot_coeffs.

Sept 4

  1. Change ModEDSS Remove imprecision to achieve balance
  2. Rerun the whole thing after today's modification
  3. Delete AgeOnSet, put AgeAtExam; replace DRB1 with DRB1 * PrevEDSS
  4. n_iter 50
  5. Different k Imputation with KNN maybe
  6. Get my ID next tuesday see if works
  7. Train your imputation formula then use it in testing.

Sept 9

  1. Use the old EDSS
  2. Change the dataframe name in R
  3. Store ytrue ypred, and create plotting func(datasetname, clf)
  4. Read paper, give a short presentation

Sept 23

  1. NA NA Go away please come back another day
  2. Log

datasetName <- "datasetTest" sink(file=paste0(datasetName, ".log"), append=F, split = T) cat("### PART I \n")

PART I

cat("##", "This is my log with", 1, "file to test \n")

This is my log with 1 file to test

sink()

  1. Store ytrue ypred, and create plotting func(datasetname, clf)

Sept 30

  1. save_ouput
  2. plot_roc_com(datasets = [], models = [])
  3. plot_pr_com(datasets = [], models = [])

Oct 7

  1. SD plots use 10 times
  2. Change the modEDSS for more relevant notification of increase. (I want to allow 0.5 differences for EDSS>4, but I have to talk to people to be sure of what I'm doing) and create new folders (probably with "new" at the front)
  3. X X X on the heatmap for the 100 pts chosen

Oct 16

  1. PredDate_Impr0-4, output should be able to coexist with old one. e.g. data/PredData, data/PredDate_Impr0-4
  2. Thursday 2pm Presentation
  3. X X X on the heatmap for the 100 pts chosen
  4. different clf in same for sd plot
  5. sd for pr

Oct 22

  1. R code change to store h5 at different location of python folder
  2. Use ./PredData/PredData.h5; ./PredData/data/; ./PredData/plots/ structure
  3. Change code to comply with new format for gridsore

Oct 27

  1. gridData use customized with tol
  2. y_pred y_true use new format from Antoine's code, (e.g. in compare_obj_sd)

Nov 4

  1. New ROC PR with D1C1, D1C2, D2C1, D2C2.
  2. GUI!!!!!

Nov 18

  1. Create 7 more columns in R for treatment, delete the old ones.
  2. Class Project