Count every trip (the "Error bars" branch) #33

allenmichael099 · 2022-11-08T18:22:07Z

This branch is to implement the ideas in the "Estimate energy consumption mean and variance" issue and test them out. Currently the focus is on the sensing model. I have to add some more files that I use soon. Beware the paths specific to my computer for output files and the emission server location.

- add quick & dirty clustering model that gives predictions in the correct new format - update clustering.py & mapping.py with additional helper functions and params

- Implement a random-forest-based classifier and an adaboost-based classifier - Clean up the new clustering-only classifier - Remove confidence from the old clustering-only classifier Other minor changes: - Update the single_cluster_purity function to accept custom names for the column containing labels - In expand_coords, resolve an issue of improper indexing, and simplify function params

- Add TripGrouper, a helper class to generate trip-level clusters - Add OneHotWrapper, a helper class to generate one-hot encodings - rename classifiers to be in camel case - in ClusterForestPredictor, -- add method to get class probabilities -- add options to use start clusters and trip-level clusters as input features for the ensemble algorithm -- add option to drop predictions for trips without an end cluster -- refactor to use TripGrouper and OneHotWrapper -- fix incorrect default radius in set_params -- calculate class probabilities in predict() -- store predictions on the test set in a Dataframe instance variable -- separate out the pre-processing for test data -- update docstrings and comments - In ClusterOnlyPredictor, -- implement new prediction strategies (option 1: use destination cluster only, option 2: use trip-level cluster only, option 3: use a combination of trip-level and destination clusters) -- pull out helper function to get distribution of labels in a cluster --implement set_params() - In DBSCANSVM_Clustering, -- add location_type to the column names containing clusters so that we can distinguish between start and end clusters if we want to use both -- update docstrings and comments - In OldClusteringPredictor, -- Add some instance variables so we can access results outside the class -- Update docstrings - Update the AdaBoost-based predictor - Add ClusterForestSlimPredictor, a version of ClusterForestPredictor using fewer trip features

classification performance: - add ability to pass model parameters via a dictionary for cross-validation - add functions to run cross-validation for all users and for all models - handle case where model requires trip data to be in a list structure (as in OldClustering) - handle case where users input partial labels - add dictionary of predictors and parameters to evaluate - In get_clf_metrics, -- ensure that our list of labels is up to date -- verify predicted labels are valid -- return number of trips without predictions -- remove unnecessary performance metrics, which are printed directly in print_clf_metrics clustering performance: - implement function to evaluate clustering hyperparameters - implement modified H-score calculation general: - update default modes and purposes - update comments

- Also add additional parameters in mapping so we can vary the SVM threshold levels

…vate-data into hannah_clustering

Plot the results from the csv into a nice plot

…vate-data

And identify the sources of error This is a snapshot upto e-mission/e-mission-docs#798 (comment)

This represents the snapshot at e-mission/e-mission-docs#798 (comment)

- Create larger splits through bootstrapping - Scale the input features before creating the models - We are now at close to a 90% score for the cross-validation - However, we appear to be overfitting because the score on the real data is only 62% This is as of e-mission/e-mission-docs#798 (comment)

So that we can verify results before making additional changes

Documented helper_functions.py and get_EC.py Moved old functions to "extra_unused_functions.py"

Explore program variability

with different priors

…sion-eval-private-data into error_bars

Correlation_demonstration_with_fake_programs finds correlations with data characteristics and percent error, starting with arbitrary subsets of all ceo. store_expanded_labeled_trips.ipynb saves the expanded_labeled_trips dataframe with the Jupyter notebook store magic.| Added a readme to the Error_bars folder

changed all instances of "Gas Car, sensed" to "Car, sensed"

Added to the Error_bars folder readme

to find trips dataframes 1 participant at a time

Also documented more in the error bars folder README

Commented out the confusion matrix editing now that we have the MobilityNet ebike row.

@allenmichael099

…ed trip This should make it easy to compute the metrics using "count every trip". @allenmichael099 implemented this in the "Error bars" branch e-mission/e-mission-eval-private-data#33 and uses it in his calculations This improves that computation by computing the distance, duration and count summaries, and doing so from a single pandas dataframe. It also computes the summaries while creating the confirmed trip, so that we don't need to recompute manually after the fact, an operation which is very, very slow. Testing done: - new test cases pass - existing TestUserInput on real data generates ``` "inferred_section_summary": { "distance": {}, "duration": {}, "count": {} }, "cleaned_section_summary": { "distance": { "ON_FOOT": 1047.1630675866315 }, "duration": { "ON_FOOT": 792.4609999656677 }, "count": { "ON_FOOT": 1 } }, ``` Note that the test subsequently fails with the following backtrace ``` ====================================================================== ERROR: testTripUserInput (__main__.TestUserInput) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/analysisTests/userInputTests/TestUserInput.py", line 169, in testTripUserInput self.checkConfirmedTripsAndSections(dataFile, ld, preload=True, File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/analysisTests/userInputTests/TestUserInput.p y", line 127, in checkConfirmedTripsAndSections self.compare_confirmed_objs_result(confirmed_trips, expected_confirmed_trips, manual_keys=["trip_user_input"] if trip_user_inputs else None) File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/analysisTests/userInputTests/TestUserInput.py", line 76, in compare_confirmed_objs_result self.assertEqual(len(rt.data['additions']), len(et.data['additions'])) KeyError: 'additions' ---------------------------------------------------------------------- ``` This is because the `emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips.manual_trip_user_input` does not have any additions. But the code always adds additions ``` confirmed_object_data["additions"] = \ esdt.get_additions_for_timeline_entry_object(ts, tce) ``` And checking a production instance that does not use time-use, I still see ``` >>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/confirmed_trip", "data.additions": {"$exists": True}}) 107 >>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/confirmed_trip", "data.additions": {"$exists": False}}) 0 ``` So it seems like we need to fix the ground truth. However, this should then be failing on master as well, and it is not. Is this test not running by default? Let's push this obviously broken test and see if it fails. Otherwise, we need to fix the CI and then fix the ground truth

hlu109 and others added 30 commits June 14, 2022 11:26

explore a modified homogeneity score

9de0062

Explore various improvements for clustering

dbca367

Evaluate performance of current label assist alg

6e82a0f

Further clustering work:

b37e7f2

- add quick & dirty clustering model that gives predictions in the correct new format - update clustering.py & mapping.py with additional helper functions and params

fix incorrect parameter name being passed

1f9bb8f

update old_clustering_predictor to accept dataframes

c9ebf9e

Compare performance of various classifiers

b16de85

tune random forest hyperparameters

7e8968a

Implement basic classifier with random forests (no clustering)

2ac5f0b

Visualize decision trees

bc7b959

Evaluate model performance by dataset size; add comments

17b1e2c

fix missing import

81b3316

Visualize SVM subclusters without any threshold requirements

d028970

- Also add additional parameters in mapping so we can vary the SVM threshold levels

Update map aesthetic and remove street names

4ed223e

Set default DBSCAN minPts to 1

8ef2ed5

Fix discrepancy in prediction count

698fb18

Fix indexing issue for labels from oursim

e546445

Simplify DBSCAN+SVM clustering

136cfc8

Merge branch 'master' of https://github.com/hlu109/e-mission-eval-pri…

07fac51

…vate-data into hannah_clustering

Plot the results from a csv into a nice plot

d6c7184

Merge pull request #1 from shankari/hannah_clustering

8beede3

Plot the results from the csv into a nice plot

Change the text on the plots to be larger

905540c

Merge branch 'master' of https://github.com/hlu109/e-mission-eval-pri…

7156df2

…vate-data

remove extraneous code

67eb26b

update hyperparameters for final analyses

5ad4226

update model parameters for analysis

f768edf

Put a y label for the trips without predicted labels

14ad5fb

shankari and others added 22 commits November 23, 2022 17:04

Initial notebook to explore the differences in error between programs

7c6530a

And identify the sources of error This is a snapshot upto e-mission/e-mission-docs#798 (comment)

Build linear models and try to fit them properly

73b6cbf

This represents the snapshot at e-mission/e-mission-docs#798 (comment)

Identify the main_mode labels dynamically instead of hardcoding the list

cf66271

Check in a version of the notebook with results

e4e83ac

So that we can verify results before making additional changes

Cleaned up store_errors.ipynb

f6b1585

Code Cleanup

fb1217d

Documented helper_functions.py and get_EC.py Moved old functions to "extra_unused_functions.py"

Merge branch 'error_bars' into explore_program_variability

10966bc

Merge pull request #2 from shankari/explore_program_variability

326d05f

Explore program variability

added functions for sensitivity analysis

ff2e3d4

with different priors

Merge branch 'error_bars' of https://github.com/allenmichael099/e-mis…

a885e08

…sion-eval-private-data into error_bars

Included electric car EI in sensed car EI

66eb3b9

changed all instances of "Gas Car, sensed" to "Car, sensed"

More work with primary mode

4028013

Added to the Error_bars folder readme

Edited the add_sensed_sections script

7d0fec8

to find trips dataframes 1 participant at a time

Adding notebooks for the plots in the paper.

7bb4c71

Also documented more in the error bars folder README

Added a MWH option to get_energy_dict

3184ad0

Commented out the confusion matrix editing now that we have the MobilityNet ebike row.

Changes to the "store_*" notebooks

5ef99c2

Updating confusion matrices and length errors

35e6bf9

add trip simulations files

fa07cde

Added to setup documentation

c56c025

Push remaining untracked changes

a89a056

allenmichael099 added 7 commits June 27, 2023 11:13

Add mode and purpose labels

d54cdbf

Update readme

9cee348

Update notebook markdowns to match with readme

1ba39b7

Update notebooks with traj dist CM

2042b54

more updated notebooks

2e0c689

Plots from Trajectory dist CM

c5df99f

More plots frm trajectory dist CM

6173376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Count every trip (the "Error bars" branch) #33

Count every trip (the "Error bars" branch) #33

Uh oh!

allenmichael099 commented Nov 8, 2022

Uh oh!

Uh oh!

Count every trip (the "Error bars" branch) #33

Are you sure you want to change the base?

Count every trip (the "Error bars" branch) #33

Uh oh!

Conversation

allenmichael099 commented Nov 8, 2022

Uh oh!

Uh oh!