Skip to content

Count every trip (the "Error bars" branch) #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 94 commits into
base: master
Choose a base branch
from

Conversation

allenmichael099
Copy link

This branch is to implement the ideas in the "Estimate energy consumption mean and variance" issue and test them out. Currently the focus is on the sensing model. I have to add some more files that I use soon. Beware the paths specific to my computer for output files and the emission server location.

hlu109 and others added 30 commits June 14, 2022 11:26
- add quick & dirty clustering model that gives predictions in the correct
new format
- update clustering.py & mapping.py with additional helper functions and
params
- Implement a random-forest-based classifier and an adaboost-based
classifier
- Clean up the new clustering-only classifier
- Remove confidence from the old clustering-only classifier

Other minor changes:
- Update the single_cluster_purity function to accept custom names for
the column containing labels
- In expand_coords, resolve an issue of improper indexing, and simplify
function params
- Add TripGrouper, a helper class to generate trip-level clusters
- Add OneHotWrapper, a helper class to generate one-hot encodings
- rename classifiers to be in camel case

- in ClusterForestPredictor,
-- add method to get class probabilities
-- add options to use start clusters and trip-level clusters as input features for the ensemble algorithm
-- add option to drop predictions for trips without an end cluster
-- refactor to use TripGrouper and OneHotWrapper
-- fix incorrect default radius in set_params
-- calculate class probabilities in predict()
-- store predictions on the test set in a Dataframe instance variable
-- separate out the pre-processing for  test data
-- update docstrings and comments

- In ClusterOnlyPredictor,
-- implement new prediction strategies (option 1: use destination cluster only, option 2: use trip-level cluster only, option 3: use a combination of trip-level and destination clusters)
-- pull out helper function to get distribution of labels in a cluster
--implement set_params()

- In DBSCANSVM_Clustering,
-- add location_type to the column names containing clusters so that we can distinguish between start and end clusters if we want to use both
-- update docstrings and comments

- In OldClusteringPredictor,
-- Add some instance variables so we can access results outside the class
-- Update docstrings

- Update the AdaBoost-based predictor
- Add ClusterForestSlimPredictor, a version of ClusterForestPredictor using fewer trip features
classification performance:
- add ability to pass model parameters via a dictionary for cross-validation
- add functions to run cross-validation for all users and for all models
- handle case where model requires trip data to be in a list structure (as in OldClustering)
- handle case where users input partial labels
- add dictionary of predictors and parameters to evaluate
- In get_clf_metrics,
-- ensure that our list of labels is up to date
-- verify predicted labels are valid
-- return number of trips without predictions
-- remove unnecessary performance metrics, which are printed directly in print_clf_metrics

clustering performance:
- implement function to evaluate clustering hyperparameters
- implement modified H-score calculation

general:
- update default modes and purposes
- update comments
- Also add additional parameters in mapping so we can vary the SVM threshold levels
Plot the results from the csv into a nice plot
shankari and others added 22 commits November 23, 2022 17:04
- Create larger splits through bootstrapping
- Scale the input features before creating the models
- We are now at close to a 90% score for the cross-validation
- However, we appear to be overfitting because the score on the real data is only 62%

This is as of
e-mission/e-mission-docs#798 (comment)
So that we can verify results before making additional changes
Documented helper_functions.py and get_EC.py
Moved old functions to "extra_unused_functions.py"
Correlation_demonstration_with_fake_programs finds correlations with
data characteristics and percent error,
starting with arbitrary subsets of all ceo.
store_expanded_labeled_trips.ipynb saves the expanded_labeled_trips dataframe
with the Jupyter notebook store magic.|
Added a readme to the Error_bars folder
changed all instances of "Gas Car, sensed" to "Car, sensed"
Added to the Error_bars folder readme
to find trips dataframes 1 participant at a time
Also documented more in the error bars folder README
Commented out the confusion matrix editing
now that we have the MobilityNet ebike row.
shankari added a commit to JGreenlee/e-mission-server that referenced this pull request Jun 18, 2023
…ed trip

This should make it easy to compute the metrics using "count every trip".

@allenmichael099 implemented this in the "Error bars" branch
e-mission/e-mission-eval-private-data#33
and uses it in his calculations

This improves that computation by computing the distance, duration and count
summaries, and doing so from a single pandas dataframe. It also computes the
summaries while creating the confirmed trip, so that we don't need to recompute
manually after the fact, an operation which is very, very slow.

Testing done:
- new test cases pass
- existing TestUserInput on real data generates

```
        "inferred_section_summary": {
            "distance": {},
            "duration": {},
            "count": {}
        },
        "cleaned_section_summary": {
            "distance": {
                "ON_FOOT": 1047.1630675866315
            },
            "duration": {
                "ON_FOOT": 792.4609999656677
            },
            "count": {
                "ON_FOOT": 1
            }
        },
```

Note that the test subsequently fails with the following backtrace

```
======================================================================
ERROR: testTripUserInput (__main__.TestUserInput)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/analysisTests/userInputTests/TestUserInput.py", line 169, in testTripUserInput
    self.checkConfirmedTripsAndSections(dataFile, ld, preload=True,
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/analysisTests/userInputTests/TestUserInput.p
y", line 127, in checkConfirmedTripsAndSections
    self.compare_confirmed_objs_result(confirmed_trips, expected_confirmed_trips, manual_keys=["trip_user_input"] if trip_user_inputs else None)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/analysisTests/userInputTests/TestUserInput.py", line 76, in compare_confirmed_objs_result
    self.assertEqual(len(rt.data['additions']), len(et.data['additions']))
KeyError: 'additions'

----------------------------------------------------------------------
```

This is because the `emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips.manual_trip_user_input` does not have any additions.

But the code always adds additions

```
    confirmed_object_data["additions"] = \
        esdt.get_additions_for_timeline_entry_object(ts, tce)
```

And checking a production instance that does not use time-use, I still see

```
>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/confirmed_trip", "data.additions": {"$exists": True}})
107
>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/confirmed_trip", "data.additions": {"$exists": False}})
0
```

So it seems like we need to fix the ground truth.

However, this should then be failing on master as well, and it is not. Is this test not running by default?
Let's push this obviously broken test and see if it fails.
Otherwise, we need to fix the CI and then fix the ground truth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants