Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-Use Tab (Issue) #840

Closed
sebastianbarry opened this issue Dec 12, 2022 · 40 comments
Closed

Time-Use Tab (Issue) #840

sebastianbarry opened this issue Dec 12, 2022 · 40 comments
Assignees
Labels

Comments

@sebastianbarry
Copy link

Creating an issue to keep track of the time-use PR's

  1. feat: a time-use UI e-mission-phone#913
  2. Time-Use tab (Sebastian) e-mission-phone#915
@shankari
Copy link
Contributor

shankari commented Jan 12, 2023

At this point, we have finished the end-to-end trip level changes on the phone (e-mission/e-mission-phone#917)

We need to start making place level changes. One high level design question is whether we should represent place additions in the confirmed_trip object or create a new confirmed_place object.

I have three main thoughts around this design:

  • as part of the underlying data model, we should definitely have a separate confirmed_place object that stores the place semantic information. We want to data model to be logical and consistent.
  • since we are expanding and generalizing the surveys anyway, we might want to add the option for a place_user_input survey as well.
    • The difference between *_user_input and *_addition objects is that there is 1 user_input object per trip/place and there are multiple addition objects per trip/place
    • The new place UI element would display a single "Add place details" and/or the "Add activity" button depending on how the surveys are configured
  • we could consider decoupling the server representation and the transmission to the client. That means that we have two approaches for implementation on the client:
    • we read the confirmed_place in addition to the confirmed_trip using the existing /datastreams/find_entries/<time_type> call. Note that we will then need to interleave the confirmed_place and confirmed_trip entries after we have received them
    • we create a new API (something like /datastreams/find_composite/<time_type>) which can return the full timeline with labels.
      • we could expand this to actually return a geojson object (similar to the old diary code!) except with downsampled trajectories and matching the time query instead of being hardcoded by day.
      • pop quiz: what are the tradeoffs between reading a geojson object and the individual objects? What might make one approach more performant than the other?
      • in the long term, we might actually want to compare the two approaches to each other directly by running some performance tests

@JGreenlee
Copy link

Do we anticipate that future projects will need place information? To what extent - and how much detail?

If we are prepared to have a new confirmed_place object, then I agree supporting user_input on places seems logical (even if we don't have a use case at the moment)

However, I am a little lost on when/why we may want to provide a geojson object for a place. Isn't this 'heavier' with more detail? I thought we were moving away from this - what use case might this be useful? (Maybe eventually a details page / show stats for places?)

To my understanding the label screen currently benefits from being relatively lightweight. My only high-level concerns are that we could suffer a network performance hit when places are shown in detail.
If that turns to be a problem, could consider lazy-loading places after trips? (assuming that trips are still going to be the primary focus)

@sebastianbarry
Copy link
Author

sebastianbarry commented Jan 13, 2023

@shankari Jack and I decided that for this change we would rather go with the first option for implemeting the places into the server:

we read the confirmed_place in addition to the confirmed_trip using the existing /datastreams/find_entries/<time_type> call. Note that we will then need to interleave the confirmed_place and confirmed_trip entries after we have received them

The reason we chose this, is because it will be very straighforward to add "analysis/confirmed_place" to the end of the CommHelper.getRawEntries:
Screen Shot 2023-01-13 at 1 54 08 PM

Because getRawEntries treats keyList as a "list", so we can ideally give it multiple keys and it will return multiple lists (ctList, and now cpList):
Screen Shot 2023-01-13 at 1 54 51 PM

This would require us in infinite_scroll_list.js to edit the way that readAllConfirmedTrips is handled here:
Screen Shot 2023-01-13 at 1 56 27 PM

to include cpList, and then we would need to have an additional function in InfiniteDiaryListCtrl that Binds these 2 types of objects together (the completed trips and their associated completed places). The complications we are worried about, might come if there is not a "place" object, for STRICTLY EVERY "trip" object. If this requirement is not met, we may end up with something like this in the Label list:

(imagine Place 2 does not exist, perhaps because Trip 3 immediately followed Trip 2, so no Place 2 got generated)
Trip 1
Place 1
Trip 2
Place 3 (should be Place 2, but there is no Place 2)
Trip 3
Place 4 (should be Place 3, and so on)

@sebastianbarry
Copy link
Author

@shankari I'll also add that if you have a reason why we should go with the other option:

we create a new API (something like /datastreams/find_composite/<time_type>) which can return the full timeline with labels.

We are not adverse to going this route, but we believed that it seems too intense of an implementation to merit this small change for "places". We discussed that if you have larger plans for the "Place"s object, then this may be a better way of doing it but for now and the reasons listed above we would rather go with the other way.

@shankari
Copy link
Contributor

Do we anticipate that future projects will need place information? To what extent - and how much detail?

We've been working on e-mission/OpenPATH for several years now and this is the first time we have displayed the place. However, this is also the first time that we are able to support several different use cases for deployers - in the past, our answer was that "open path is open source" so deployers should write their own clients if they needed.

Having said that, having the place separate opens up the opportunity for a completely different, place-oriented UI. This would show a map for the place, and just a line indicating "travel" by "mode" to the next place. The old Moves app UI used to look like that.

However, I am a little lost on when/why we may want to provide a geojson object for a place. Isn't this 'heavier' with more detail? I thought we were moving away from this - what use case might this be useful? (Maybe eventually a details page / show stats for places?)

wrt the "combined" option, we are not constrained by the long-term data model, so we could copy the start and end place information into each trip, for example. The day geojson is heavier than the raw trip/place objects, but primarily due to the embedded trajectories. (you should actually measure this - retrieve the geojson for a day, and see where the size comes from)

The reason we chose this, is because it will be very straighforward to add "analysis/confirmed_place" to the end of the CommHelper.getRawEntries:

We are not adverse to going this route, but we believed that it seems too intense of an implementation to merit this small change for "places". We discussed that if you have larger plans for the "Place"s object, then this may be a better way of doing it but for now and the reasons listed above we would rather go with the other way.

If we are looking for the best design, we should theoretically pick the one that is best regardless of time and then plan the steps to get there. The steps may be sideways and may involve a temporary implementation initially as long as we have the final design clear and can work towards it when possible.

Having said that, and thinking through this in greater detail while writing this down, I think that there is enough ambiguity in the requirements for displaying the place object that I am fine with your proposal. But we should keep the other option in mind, and if this gets too clunky as we expand the diary, we should re-evaluate switching to the other implementation.

I will add the confirmed_place objects to the server data model this weekend, along with the code to match the inputs, this weekend.

@shankari
Copy link
Contributor

shankari commented Jan 13, 2023

for the record

The complications we are worried about, might come if there is not a "place" object, for STRICTLY EVERY "trip" object.

There is a place-like object for every trip-like object. Note that we (aka OpenPATH) create both place and trip objects, they are not user inputs. If there is not, we should fix the pipeline. The one caveat is that the trip-like object may be untracked time (section 5.3.1 of my thesis), so you should really retrieve that as well in the list and stitch together the place-like and trip-like objects.

Note that I have code to do that on the server - the starting point is here:
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/decorations/timeline.py#L29

@shankari
Copy link
Contributor

and if this gets too clunky as we expand the diary, we should re-evaluate switching to the other implementation.

A concrete example of clunkiness is that we have to reimplement both the trip -> geojson conversion and the timeline creation code on the phone. This is already implemented in the server, so if we returned the combined object, we would not have to maintain two copies of the same codebase.

@shankari
Copy link
Contributor

Although maybe the argument is that after the label <-> diary unification, maybe we can remove the geojson formatting code from the server. So the entire presentation layer will be on the phone.

@sebastianbarry @JGreenlee I hope this gives you an example of how to reason about which design is more principled. We also don't want to make decisions solely based on what is easiest to implement - that how we end up with "ball of wax" code. And even as we choose a decision, we want to keep in mind that we might switch to the other one as we make additional changes, and write code that is easy to switch in the future.

@shankari
Copy link
Contributor

shankari commented Jan 19, 2023

End to end integration testing:

  • delete all prior activities and create a new activity on the 9:20am trip
  • confirm that it is synced to the server
2023-01-18 22:30:59,875:DEBUG:123145437184000:Updated result for user = e20f451e-c619-46c8-a189-54310231bce5, key = manual/trip_addition_input, write_ts = 1674109850.1796498 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('63c8e3a3025909f4cc968942'), 'ok': 1.0, 'updatedExisting': False}

There are three confirmed trips

2023-01-18 22:33:31,237:DEBUG:4505677312:finished querying values for ['analysis/confirmed_trip'], count = 3
2023-01-18 22:33:31,237:DEBUG:4505677312:orig_ts_db_matches = 0, analysis_ts_db_matches = 3

while matching the inputs

2023-01-18 22:33:31,241:DEBUG:4505677312:Comparing user input 1 Domestic, : 2016-08-11T14:05:09.694147-07:00 -> 2016-08-11T14:45:45.271000-07:00, trip 2016-08-11T09:20:14.817000-07:00 -> 2016-08-11T11:29:50.464000-07:00, start checks are (True && False) and end checks are (False || False)
2023-01-18 22:33:31,242:DEBUG:4505677312:Handling corner case where start check matches, but end check does not
2023-01-18 22:33:31,245:DEBUG:4505677312:Second level of end checks when the next trip is defined (1470951945.271 <= 1470940950.700465) = False

2023-01-18 22:33:31,246:DEBUG:4505677312:Comparing user input 1 Domestic, : 2016-08-11T14:05:09.694147-07:00 -> 2016-08-11T14:45:45.271000-07:00, trip 2016-08-11T11:42:30.700465-07:00 -> 2016-08-11T11:52:29.534000-07:00, start checks are (True && False) and end checks are (False || False)
2023-01-18 22:33:31,246:DEBUG:4505677312:Handling corner case where start check matches, but end check does not
2023-01-18 22:33:31,250:DEBUG:4505677312:Second level of end checks when the next trip is defined (1470951945.271 <= 1470949509.6941469) = False

2023-01-18 22:33:31,250:DEBUG:4505677312:Comparing user input 1 Domestic, : 2016-08-11T14:05:09.694147-07:00 -> 2016-08-11T14:45:45.271000-07:00, trip 2016-08-11T14:05:09.694147-07:00 -> 2016-08-11T14:45:45.271000-07:00, start checks are (True && True) and end checks are (True || True)
2023-01-18 22:33:31,251:DEBUG:4505677312:sorted candidates are [{'write_fmt_time': '2023-01-18T22:29:50.252379-08:00', 'detail': '2016-08-11T14:05:09.694147-07:00'}]
2023-01-18 22:33:31,251:DEBUG:4505677312:most recent entry is 2023-01-18T22:29:50.252379-08:00, 2016-08-11T14:05:09.694147-07:00

the matching entry has a write_ts of 2023-01-18T22:29:50.252379-08:00 and a start_ts of 2016-08-11T14:05:09.694147-07:00

And then handle_multi_non_deleted_match fails

2023-01-18 22:33:31,251:ERROR:4505677312:Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 19, in match_incoming_user_inputs
    last_user_input_done = match_incoming_inputs(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 46, in match_incoming_inputs
    handle_multi_non_deleted_match(confirmed_trip, ui)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 68, in handle_multi_non_deleted_match
    confirmed_trip["data"]["trip_addition"].append(ui)
AttributeError: 'NoneType' object has no attribute 'append'

@shankari
Copy link
Contributor

shankari commented Jan 19, 2023

ok so this is pretty bizarre. I create the confirmed trip array just before the append
AND I tested this earlier in e-mission/e-mission-server@2129075

    if "trip_addition" not in confirmed_trip["data"]:
        confirmed_trip["data"]["trip_addition"] = []
    if "status" not in ui.data or ui.data.status == ecwtui.InputStatus.ACTIVE:
        confirmed_trip["data"]["trip_addition"].append(ui)

And I test the function in the unit tests, which currently pass

emission/tests//analysisTests/userInputTests/TestUserInputFakeData.py:        eaum.handle_multi_non_deleted_match(fake_ct, fake_match)
emission/tests//analysisTests/userInputTests/TestUserInputFakeData.py:            eaum.handle_multi_non_deleted_match(fake_ct, fm)
emission/tests//analysisTests/userInputTests/TestUserInputFakeData.py:            eaum.handle_multi_non_deleted_match(fake_ct, fm)
emission/tests//analysisTests/userInputTests/TestUserInputFakeData.py:            eaum.handle_multi_non_deleted_match(fake_ct, fm)
emission/tests//analysisTests/userInputTests/TestUserInputFakeData.py:            eaum.handle_multi_non_deleted_match(fake_ct, fm)
emission/tests//analysisTests/userInputTests/TestUserInputFakeData.py:            eaum.handle_multi_non_deleted_match(fake_ct, fm)

@shankari
Copy link
Contributor

Also when I run the pipeline, I get an error, which did not show up when @JGreenlee ran the pipeline, so I don't think this is the root case for Jack's issue, but it needs to be fixed regardless.

2023-01-19T08:54:43.023948-08:00**********UUID e20f451e-c619-46c8-a189-54310231bce5: moving to long term**********
2023-01-19T08:54:43.055966-08:00**********UUID e20f451e-c619-46c8-a189-54310231bce5: updating incoming user inputs**********
Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 19, in match_incoming_user_inputs
    last_user_input_done = match_incoming_inputs(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 46, in match_incoming_inputs
    handle_multi_non_deleted_match(confirmed_trip, ui)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 69, in handle_multi_non_deleted_match
    confirmed_trip["data"]["trip_addition"].append(ui)
AttributeError: 'NoneType' object has no attribute 'append'
2023-01-19T08:54:43.218456-08:00**********UUID e20f451e-c619-46c8-a189-54310231bce5: filter accuracy if needed**********

@shankari
Copy link
Contributor

Hm, so the issue is that the trip addition is set to None, not missing

{'trip_addition': None}

Ok, so now I think that I understand it a bit better. As part of the change to add in the previous labeled entries while creating the confirmed trip, we are setting trip_additions to None, and on the input matching side, we only initialize it if it is not present. So this was a regression caused by e-mission/e-mission-server@a7ee02c

The fix is fairly trivial, we should also check for None in addition to presence.

shankari added a commit to shankari/e-mission-server that referenced this issue Jan 19, 2023
Matching for new user inputs broke (regression due to
e-mission@a7ee02c)

The main issue was that we were filling in the `trip_additions` with None but only checking for existence.
More details at:
e-mission/e-mission-docs#840 (comment)

Fixing it in two different ways:
- Returning `[]` instead of None from `get_not_deleted_candidates` if there are no filtered candidates
  This will ensure that we don't have a None value
- check for both None and existence. This will ensure that even if we do have a None value, we can recover gracefully

Testing done:

Original failure:

```
2023-01-19T09:58:57.198828-08:00**********UUID e20f451e-c619-46c8-a189-54310231bce5: updating incoming user inputs**********
Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 19, in match_incoming_user_inputs
    last_user_input_done = match_incoming_inputs(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 46, in match_incoming_inputs
    handle_multi_non_deleted_match(confirmed_trip, ui)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 70, in handle_multi_non_deleted_match
    confirmed_trip["data"]["trip_addition"].append(ui)
AttributeError: 'NoneType' object has no attribute 'append'
```

New failure (updating incoming), no `match_id`

```
2023-01-19T09:59:26.680014-08:00**********UUID e20f451e-c619-46c8-a189-54310231bce5: updating incoming user inputs**********
Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 19, in match_incoming_user_inputs
    last_user_input_done = match_incoming_inputs(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 46, in match_incoming_inputs
    handle_multi_non_deleted_match(confirmed_trip, ui)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 72, in handle_multi_non_deleted_match
    after_del_list = [ta for ta in confirmed_trip["data"]["trip_addition"] if ta["match_id"] != ui["match_id"]]
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 72, in <listcomp>
    after_del_list = [ta for ta in confirmed_trip["data"]["trip_addition"] if ta["match_id"] != ui["match_id"]]
KeyError: 'match_id'
```

New failure (creating objects), no `match_id`

```
2023-01-19T14:47:59.883681-08:00**********UUID e20f451e-c619-46c8-a189-54310231bce5: creating confirmed objects **********
Error while creating confirmed objects, timestamp is unchanged
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 81, in create_confirmed_objects
    last_expected_trip_done = create_confirmed_trips(user_id, time_query)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/analysis/userinput/matcher.py", line 110, in create_confirmed_trips
    esdt.get_additions_for_trip_object(ts, tct)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/storage/decorations/trip_queries.py", line 191, in get_additions_for_trip_object
    return get_not_deleted_candidates(valid_user_input(ts, trip_obj), potential_candidates)
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/storage/decorations/trip_queries.py", line 168, in get_not_deleted_candidates
    not_deleted_active = [efpc for efpc in all_active_list if efpc["match_id"] not in all_deleted_id]
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/storage/decorations/trip_queries.py", line 168, in <listcomp>
    not_deleted_active = [efpc for efpc in all_active_list if efpc["match_id"] not in all_deleted_id]
KeyError: 'match_id'
```
@shankari
Copy link
Contributor

Checked in the fix. Had some errors in testing from old-style trip addition inputs as outlined in e-mission/e-mission-server@fcf5a9c, but after I purged and reloaded the user, everything worked correctly. The actions were retained on reload.

@shankari
Copy link
Contributor

#840 (comment)
wrt retrieving data from the server for places, we had two options:
(1) retrieve confirmed_trip and confirmed_place objects
(2) retrieve a composite_trip object instead of confirmed_trip that includes the start and end place

I think you had preferred (1) because it is easier to implement and were concerned about performance with composite_trip and I had said that performance might not be as much of an issue if we don't retrieve the entire trajectory. For performance, we should actually implement both and instrument and see which one is faster, including over a cellular network or with artificially induced latency.

We could just start with what is easier to implement and then do the performance evaluation later - there's a lot of other performance stuff we can do with caching for example.

But while thinking about choosing between (1) and (2), we may want to look at some longer term considerations as well, if they are very close in terms of ease of implementation.

  • We do want to finish the diary and label unification. The big differences between diary and label right now are:
    (1) diary displays the inferences from the sensor data (5% walk, 80% train, 15% bike), and it displays the trajectory in different colors in the map to correspond to those
    (2) diary has draft trips
    (3) diary has date selection

Three high level considerations:
(a) For (1), the inferences from the sensor data are actually in sections if you remember the data model, which are part of a trip. So if we wanted to display the sensor data inferences, we would need to pull the sections as well, or incorporate section data into the trip in some way, at which point we are maybe getting towards a composite_trip-kind of implementation.

(b) You might also want to think about where the code lives and how many copies we have to keep in sync.. So if we got the composite trip, the trip -> geojson conversion could live on the server and we wouldn't have to do the trip2geojson conversion on the phone. Although maybe you argue that after the diary is removed, we don't need to do the trip -> geojson conversion on the server and it can live only on the phone.

(c) If we do want to cache trips locally for better performance and to work even when there is no network, then again, would one option work better than the other? We can cache the previous request, but we can also periodically push the most recent trips to the phone, at the same time as we are pushing up data from the phone to the server (as part of a "periodic sync"). So if you look at the server code, the phone -> server happens through usercache/put, but there is also a usercache/get which downloads the geojson for the past 3 days into the phone. Again, this was designed to work with the diary, so it downloads one day at a time, and because we need to work with draft trips, it doesn't really work that well any more, but we can redesign for the label screen and actually implement it properly.

@sebastianbarry
Copy link
Author

sebastianbarry commented Feb 1, 2023

@JGreenlee @shankari
WRT deciding on whether we will use:

  • confirmed_trip and confirmed_place

OR

  • composite_trip which contains info on the trip and place

confirmed_trip/place composite_trip
Pros - We can ony grab the information that we only need; tailor the request to only grab trips, or trips and places at the same time; There is an array of keys we are looking for in the Phone code already so if we needed trips we could only grab trips - In the future as we're unifying features over from the Diary to the Label, there are "guessed" modes for bike%/walk%/drive% which comes from the sections data type. Sections are currently not part of confirmed trip objects, so this could be a use for composite_trip; Another pro is that this way relies on the server to put the required data inside of composite_trip (for example confirmed_trip + inferred_sections + user_input), that way e-mission-phone's job is to simply check if those objects it needs exists, and then can parse the data from there
Cons - All of the trip/place matching needs to happen on the phone code, resulting in greater complexity in e-mission-phone; If we had to grab section information separately, then it will need to be tied into the trips; E-mission-phone will need to explicitly ask e-mission-server for exactly what information it needs, whether it be confirmed_trip and/or sections and/or confirmed_place - We may not use place on most studies/programs, so adding a whole new data type called composite_trip may be too much; Because composite_trip will contain all the possible information about a trip in one object, this payload will be quite large and will require a lot of netowork usage to load this for every single trip

Thoughts:

  • If the inferred_sections that the diary screen currently uses will not be too complex to add into confirmed_trip, then we should go with confirmed_trip/place
  • Otherwise, if adding the inferred_sections data to be a part of confirmed_trip will be too complicated, then we should opt for composite_trip instead

WE WILL MAKE THIS FINAL DECISION ON FRIDAY MORNING 9:00AM MST

@sebastianbarry
Copy link
Author

sebastianbarry commented Feb 3, 2023

Diary screen Label screen
In js/diary/services.js Calls CommHelper.getTimelineForDay(day).then(... Calls timeline.readAllConfirmedTrips = function(endTs, deltaTs) {...
Server key read (in CommHelper service) Calls window.cordova.plugins.BEMServerComm.getUserPersonalData("/timeline/getTrips/"+dateString, resolve, reject); with key manual/incident Calls window.cordova.plugins.BEMServerComm.pushGetJSON("/datastreams/find_entries/timestamp", msgFiller, resolve, reject); with key analysis/confirmed_trip
Server data model core/wapper Uses incident.py Uses confirmedplace.py

I can see the incident getting read from in the webserver.log, SPECIFICALLY FOR the diary screen (clear webserver.log, load only the diary screen, see what POST requests are returned)
Screen Shot 2023-02-03 at 8 45 54 AM

The problem is that I don't see why incident would return geojson files, because core/wrapper/entries.py says this:

# incidents (smiley/frownie) reported by the user from the phone
"manual/incident": "incident",

@shankari
Copy link
Contributor

shankari commented Feb 3, 2023

If the inferred_sections that the diary screen currently uses will not be too complex to add into confirmed_trip, then we should go with confirmed_trip/place

You should check with @allenmichael099 - I believe he plans to also store some summary of the section into the trip for the "count every trip" calculations. You should work with him on creating a standard representation that works both for "count every trip" and for displaying the sections in the UI

look carefully at the diary screen; we don't just have the % at the top, we also color the trajectory with different colors.
(1) we may want to design what a few merged diary/label screens might look like - e.g. the % with icons may not be the best option, and we may or may not want to have the colors in the trajectory, (2) we should make sure that the representation does support the range of final designs

wrt #840 (comment)
the first two rows are correct, but I am not sure how you went from the second row to the third.
It is true that the call reads incidents, but is that the only call that it makes? Recall that the diary call constructs a full geojson and returns it to the server.

@shankari
Copy link
Contributor

shankari commented Feb 3, 2023

on the label screen: we retrieve data model objects directly and construct the geojson on the phone
on the diary: the server constructs the geojson and returns it directly

@shankari
Copy link
Contributor

shankari commented Feb 3, 2023

we will have confirmed_place in the data model. We do not want to tie the internal data model to the representation on the phone or the communication between the server and the phone because it does not give us sufficient flexibility to change if we need to in the future, and is bad from a architectural standpoint.

Note that we use geojson for the actual UI representation regardless of the options. The geojson can be complex or simple. The label geojson is simple, the diary geojson is complex.

what we are deciding now is whether

  • we will stop with confirmed_place and do the mapping from the raw data model -> representation on the phone
  • or create an additional composite_trip object on the server to represent the geojson. this would be similar to the current geojson returned from the server, but could potentially be simplified to include fewer properties and smaller trajectories
    • create the composite trip on the fly when requested OR
    • create the composite trip as part of the processing and store in the database for retrieval

Lots of options

but once we have the data model (confirmed_place), we can swap between these fairly easily

@shankari
Copy link
Contributor

shankari commented Feb 3, 2023

To test whether confirmed_place has been created properly, you can just reset the pipeline and re-run and see if it is created (assuming you have created everything upto and including the pipeline)

@allenmichael099
Copy link

allenmichael099 commented Feb 3, 2023

My approach has been to store the section modes and distances as their own 'data' fields in an "analysis/confirmed_trip" document:'data': {.... , 'section_modes': ['bicycling', 'car'], 'section_distances': [4553.37, 8721.14]}.

I added these fields to my copy of the database by fetching all of the inferred sections ("analysis/inferred_section") for each trip. To fetch them, I used I add up mean section energy intensities * section distances for a trip to get the trip level mean energy consumption. I keep the distances in meters currently.When I find the variance of the energy consumption estimate for a trip or set of trips, I find total distances in each mode based on the section distances. Calculating other metrics will probably involve these section modes and distances as well.

@sebastianbarry
Copy link
Author

sebastianbarry commented Feb 3, 2023

The answer to our first question (Should we use confirmed_trip + confirmed_place OR the composite_trip?) we have answered:

We will go with composite_trip and here's why:

image

Considering the diary/label screen combination:
For performance, (from the phone) we will only grab the composite_trip object (containing all the data which is currently contained in confirmed_trip, as we are essentially copying the way the "Label" screen currently generates the geoJSON). The difference between the current "Label" screen confirmedTrip2Geojson() and this new proposed solution, is that we will use the section_distances values from the count every trip stuff, to generate the multiple different geoJSON objects per each detected mode on e-mission-phone (performant because we are not generating these on the server, less hangup time on the server).


For example, if a composite_trip has 3 modes (bike for 1 mile, walk for 1.5 miles, car for 3 miles), then e-mission-phone will

  • grab the object shown in this data model (composite_trip) from e-mission-server,
  • then it will run a function compositeTrip2Geojson which will follow the raw_trip_data until it reaches the first mile, combining all those first 1 mile of points into 1 geoJSON object which will be colored GREEN for bike.
  • Then it will follow the next 1.5 miles of raw trip data and construct the second geoJSON object which will be colored BLUE for walk.
  • Then it will finish with the remaining 3 miles of points and construct the final geoJSON object RED for car.
  • Then, these 3 geoJSON objects will be displayed on the map with their respective colors on the Label screen

@shankari
Copy link
Contributor

shankari commented Feb 4, 2023

Can either of @sebastianbarry and @JGreenlee list out what you understood by composite_trip versus cleaned_trip and `cleaned_place? Because your update in #840 (comment) appears to be self-contradictory

@sebastianbarry
Copy link
Author

sebastianbarry commented Feb 6, 2023

list out what you understood by composite_trip versus cleaned_trip and `cleaned_place?

@shankari my understanding is that composite_trip is a composition conatining the important data of conformed_trip, confirmed_place, and any other data that the label screen would need in the future such as the count every trip data (like my diagram shows). Are you suggesting that composite_trip does not composite the confirmed_trip and confirmed_place together?

@sebastianbarry
Copy link
Author

sebastianbarry commented Feb 6, 2023

After @shankari @JGreenlee call today, we found out this:

Regarding the way that the label screen currently generates Geojson

  • In the confirmedTrip2Geojson funciton, it makes a CommHelper call to the server for all "analysis/recreated_locaiton" points in the range of timestamps between the start_ts and end_ts found in our confirmed_trip object
    • The getRawEntries() function receives a max_entries argument which the server then uses to grab an equal dispertion of the max_entries throughout the trip
  • Then, it organizes these all in a tidy way, to match the Geojson standard format

What about trajectories (the line we display in the map, typically the line object in the geojson)?

  1. Having the server give us the trajectories as a part of the composite_trip object (one call to the server returns the geojson representation of the recreated_location points in between the start_ts/end_ts)

    • One server call, one composite trip object
  2. Generating the trajectories locally on the phone (composite_trip only contains start_ts/end_ts, separate function to getRawEntries and convert to Geojson; receiving raw data points, formatting them into geojson)

    • Composite trip object contains the start_ts and end_ts and all the information
    • 2 separate server calls in the code: 1 for getting composite_trip, 1 for getting the recreated_location points to get the trajectories to display on the label screen

Pros to # 1:

  • Performance benefit to the phone (because the phone will not need to format the points into Geojson)
  • Less complexity on the phone code (KEEP IN MIND, we need to do this on the phone anyway to read draft trips)
  • This is structured more clearly

Pros to # 2:

  • Performance benefit to the server
  • If we consider the current Label screen implementation, we can display the composite_trips
  • We need to do it on the phone anyway, to handle trips

How do we decide between #1 and #2? What should be the determining factor?

  • Is there any additional information we need? How do we get it? (setting up performance judgement on the phone and server)

Question on # 2: Recreated location points already contain information about what section/mode they belong to? If so, do we not need to put that in composite_trip? (as a part of figuring out the multi-modal sections to generate their respective trajectories)

  • The answer to this, is that the recreated_location points that we grab in getRawEntries contains the section and associated guessed mode:

core/wrapper/motionactivity.py

class MotionTypes(enum.Enum):    IN_VEHICLE = 0    BICYCLING = 1    ON_FOOT = 2    STILL = 3    UNKNOWN = 4    TILTING = 5     WALKING = 7    RUNNING = 8
  • Because we know that the recreated_location points contain this information, we no longer need to contain the count_every_trip information as a part of the composite_trip object that will be returned from the server, because we will be able to generate the multi-modal trajectories using this information (and it will likely make it easier to do so)

@sebastianbarry
Copy link
Author

Here is what we have decided:

We will go with # 2, because this is the currently implemented way on the Label screen, and the label screen has some key benefits over the way the Diary screen functions

  • Label screen loads much faster, because it does not have to wait for all the trajectories to be returned in order to display the infinite_scroll_trip_objects
  • It does not require us to specifically code the server to what the phone needs

With this, composite_trip will just be a way of linking trips together with their associated location points/trajectories, similar to how comfirmed_trip handles this

The composite_trip object will be generated on the fly, meaning we do not need to store composite_trip as it's own data object on the server. We simply will just receive a call on the server for analysis/composite_trip and then return the information inside of confirmed_trip, confirmed_place, and count_every_trip_info

Then, later on in the phone code when we are ready to get the trajectories, we can make an additional asynchronous call to the server to get the raw entries, and then on the phone we will generate the multi-modal trajectories to be displayed on the maps

@shankari
Copy link
Contributor

shankari commented Feb 6, 2023

@sebastianbarry @JGreenlee I don't think we are ready to make the decision yet. (This is at least the third decision we have made, and this is a flip from the first decision, #840 (comment)).

with these pros/cons, there are some assumptions. what are those assumptions and how can we validate them?

aka

(setting up performance judgement on the phone and server)

and even with the pros/cons, are we thinking through all of them?
For example: is the usability of the current label screen super good? The latency is kind of good, but are there drawbacks in other aspects?

since this is a long-term design that will form the foundation for diary/label unification, I would encourage you to not treat label screen as perfection. It has some drawbacks too. @sebastianbarry I am not sure if you remember the drawbacks from the user interviews that you attended and the email that I forwarded from Jeanne.

@shankari
Copy link
Contributor

shankari commented Feb 6, 2023

Not sure if this is going to help or confuse you further, but there is some related community discussion here:
e-mission/e-mission-server#899 (comment)

@sebastianbarry
Copy link
Author

My thoughts in regards to comparing the magnitude of performance between # 1 method of retrieving trip data vs the # 2 method of retrieving trip data:

Time comparison

Typically, in my classes (such as Algorithms and Parallel Computing) we can measure the temporal performance of a method by putting some sort of executionTime = timer.start() at the beginning of the function you wish to measure, and then an executionTime = timer.end(), and log the result. Is this how we would want to measure the performance in this case too? (If so, do we have any examples of doing this in any Pull Requests or anything?)

Accuracy comparison

The other method we would measure in my classes, were how accurate is the function? This can be done by inputting a set of data which you already know the expected output, and comparing the received output with the expected output. From what I know from reading your thesis, e-mission-server will always give the same output with the same input, so I can trust that we know this is not a factor on the server. Where this might come into play, is how the received raw data from the server will be handled on the phone. I am not sure how to test this in regards to OpenPATH, as typically in my classes we have only done accuracy comparisons on input and output sets of integer numbers, not Latitude/Longitude decimal points.

Features comparison(?)

The only other way I can think of comparing the 2 methods are by qualitatively measuring the features between them. Here is my comparison table:

Method # 1 Method # 2
Receives the (truncated, where max_entries determines the truncation amount) raw trip trajectories from the server
Receives the geojson trajectories from the server
Where does the geojson formatting happen? Server Phone
Number of calls to the server 1 (updateForDay's getTimelineForDay) 2 (readAllConfirmedTrips's getRawEntries with parameters for the confirmed_trip general data & confirmedTrip2Geojson's getRawEntries with different parameters for the raw trip trajectory data

@shankari
Do you have any other methods for measuring the difference between these 2 methods? Or can you confirm that I am on the right track with measuring the difference, and if so, do you know any examples of us doing this before?

@shankari
Copy link
Contributor

shankari commented Feb 8, 2023

@sebastianbarry this is on the right track.

But first, I am not sure you are measuring the same things: you say that you are comparing #1 vs. #2 from
#840 (comment)
but is that what your table is comparing? It looks like your table is comparing diary (getTimelineForDay) as one of the options.

wrt the time option, use the breakpoints and search through the code to find your answer. You have access to the full source code, and you should know how to search through it.

wrt the accuracy option, you are right that it is not relevant here.

@shankari
Copy link
Contributor

shankari commented Feb 8, 2023

Option 1: separate cleaned_trip and cleaned_place
Option 2a: composite trip with downsampled trajectories
Option 2b: composite trip without trajectories, trajectories are retrieved separately

For Option 2:
Option 2, impl 1: pre-created and pre-saved composite trip objects
Option 2, impl 2: create composite trips on the fly

Assumptions in your decision:

  • slowness is due to the the size of the object being retrieved
  • making multiple calls is not a performance or complexity hit or UX hit
    • Note that in option 2a, we will make one call to get a set of c* trips and in option 2b, we then make a O(t) number of calls to retrieve the trajectories
      const readPromises = [
        CommHelper.getRawEntries(["analysis/confirmed_trip"],
            endTs - deltaTs, endTs, "data.end_ts"),
      ];

This reads all confirmed_trip objects in the range of (endTs - deltaTs, endTs)

        const fillPromises = [
            CommHelper.getRawEntries(["analysis/recreated_location"], trip.start_ts, trip.end_ts, "data.ts", 100)
        ];

So this is called for every trip

Even just having a bunch of server calls outstanding at the same time can be performance intensive on the phone as well. Complaints were "app is hanging" after trips load, so I do some bottlenecking on the phone.

  const placeLimiter = new Bottleneck({ maxConcurrent: 2, minTime: 500 });
  const mapLimiter = new Bottleneck({ maxConcurrent: 3, minTime: 100 });

My intuition is that the slowness in the diary is not just due to trajectory retrieval, but is primarily due to the database calls required to construct the geojson. Note that while constructing the geojson, we first get all the trips, and then we read all the sections for each trip, and then we read all the trajectories for each section.

So if you have 5 trips, each with 3 sections, you will end up with 5 + 5 + 15 = 25 calls to the database before you can construct the geojson and return it.
In the label screen, for the initial CommHelper.getRawEntries(["analysis/confirmed_trip"], call, we have one call to the DB.

If part of the performance issue is the trajectory size, that is easily fixable by including the downsampled trajectory in the composite trip object
If the bigger part of the performance issue is the number of database calls, we can go with 2 (a), but even with 2 (a), to construct the composite trip for each confirmed trip after you have the list of trips, you will need to make 4 calls (we can get this down to 2 calls by retrieving all sections and then all trajectory points for that trip and then matching them downstream). But an even better fix is Option 2, impl 1.

With option 2, impl 1:

  • how many network calls? 1 call to the server (also for option 2 in general both impl)
  • how many DB calls? 1 call (similar to the current confirmed_trip call in terms of speed, and it will just get everything that we need to display)

for the core functionality that you have (trip diary/label screen) you want to do as little as possible on the critical path before returning to the user; that is the motivation for option 2, impl 1 vs. option 2, impl 2

To downsample diary calls:
File: emission/analysis/plotting/geojson/geojson_feature_converter.py
Function: section_to_geojson after we read the locations, downsample them like we do in the cfc_webapp

Measurement:

  • downsampled extremely versus not-downsampled in the diary call: How much does it affect the response time? Could add some timing lines on the phone just to check response time on the phone as well because the server instrumentation will not capture the serialization/deserialization overhead.
  • look at the logs for both getTrips from the diary and trip find entries from the label screen (since they are the calls on the user visible latency path) and see where the time is going

@sebastianbarry
Copy link
Author

sebastianbarry commented Feb 10, 2023

For Diary Screen (time before promise to .then / time on webserver.logs):
Using the Jul 26 2016 data

Not Downsampled Downsampled to 100 points Downsampled to 5 points
Local machine speed time before CommHelper.getTimelineForDay(day) to .then: 5.067999999999 seconds
Slow internet speeds
Fast internet speeds

@sebastianbarry you can look along a spectrum of downsampling instead of a binary (edited to clarify)

@shankari
Copy link
Contributor

FYI, AWS instance to DocumentDB is ~ 3x slower than laptop -> docker on laptop
#721 (comment)

@shankari
Copy link
Contributor

There are 3 collections we read from:

  • usercache
  • timeseries
  • analysis_timeseries

all are collections in the same database

this is when we start the call to the timeseries/analysis_timeseries

2023-02-10 17:00:38,175:DEBUG:140490587621120:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('71d6fa29-f596-4b3a-920e-67680f423537'), '$or': [{'metadata.key': 'manual/purpose_confirm'}], 'metadata.write_ts': {'$lte': 1676048447, '$gte': 1676044322.325}}, sort_key = metadata.write_ts

this is when the timeseries/analysis timeseries calls return (purpose_confirm is a user input so it will be in the timeseries and not in the analysis outputs)

2023-02-10 17:00:38,211:DEBUG:140490587621120:finished querying values for ['manual/purpose_confirm'], count = 0
2023-02-10 17:00:38,211:DEBUG:140490587621120:finished querying values for [], count = 0

This is when the usercache returns

2023-02-10 17:00:38,427:DEBUG:140490587621120:Found 0 messages in response to query {'user_id': UUID('71d6fa29-f596-4b3a-920e-67680f423537'), '$or': [{'metadata.key': 'manual/purpose_confirm'}], 'metadata.write_ts': {'$lte': 1676048447, '$gte': 1676044322.325}}

Note that if we haven't yet processed the user inputs (like purpose_confirm) they will be in the usercache, so we do have to check it

In the geojson code, where we are working with processed data, we should not have to check the usercache and we don't - for example:

2023-02-10 17:02:47,843:DEBUG:140490886870784:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('71d6fa29-f596-4b3a-920e-67680f423537'), '$or': [{'metadata.key': 'analysis/recreated_location'}], 'data.ts': {'$lte': 1675952702.563, '$gte': 1675950561.8190508}}, sort_key = data.ts
2023-02-10 17:02:47,843:DEBUG:140490886870784:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/recreated_location']
2023-02-10 17:02:47,843:DEBUG:140490886870784:finished querying values for [], count = 0
2023-02-10 17:02:47,907:DEBUG:140490886870784:finished querying values for ['analysis/recreated_location'], count = 73
2023-02-10 17:02:47,907:DEBUG:140490886870784:orig_ts_db_matches = 0, analysis_ts_db_matches = 73

@JGreenlee
Copy link

JGreenlee commented Feb 11, 2023

using July 25 test data (2 trips) and no network throttling

label screen load

Total time to fulfill all requests: 523ms

  • Queries to timeseries/analysis_timeseries: 6
    -- 34ms total | avg <6ms per query
  • Queries to usercache: 6
    -- 29ms total | avg <5ms per query

Database queries are not taking a significant amount of time for the label screen flow.
The majority of time is actually spent waiting for the next request (since trajectories are requested by the phone only after trips are received)

diary screen load

Total time to fulfill all requests: 726ms

  • Queries to timeseries/analysis_timeseries: 8
    -- 27ms total | avg <4ms per query
  • Queries to usercache: 4
    -- 15ms total | avg <3ms per query
  • Queries for stops and sections: 6
    -- 419ms total | avg ~70ms per query

Similarly to the label screen, the queries for trips, user inputs, and locations do not take much time.
Most of the time comes from the queries for stops and sections (keys analysis/inferred_section, analysis/cleaned_stop, and analysis/cleaned_section were counted)

Conclusion

The database calls referenced in #840 (comment) do not significantly impact our performance on either the label screen nor the diary screen.
Instead, the expensive queries (and likely a reason for diary screen slowness) are for analysis/inferred_section, analysis/cleaned_stop, analysis/cleaned_section

Next discussion

Would pre-saving sections and stops in a composite_trip object help us limit these expensive calls?
Why do those queries take so long in the first place?

@shankari
Copy link
Contributor

@JGreenlee the database calls in #840 (comment) are examples.

  • We make user input calls for mode_confirm in addition to purpose_confirm
  • We make calls to analysis/cleaned_place analysis/inferred_section and analysis_cleaned_stop in addition to analysis/recreated_location

To answer your questions:

  • I am not sure why calls to retrieve the sections and stops are that much slower than the call to retrieve the locations.
  • Yes, because we will make the calls while generating the objects as part of the intake pipeline, which happens in the background. So the calls will still be expensive, but the user won't see the slowness.

@shankari
Copy link
Contributor

shankari commented Feb 15, 2023

wrt throttling, we are trying to determine what the network does with large object transfers
the diary is currently a large object transfer
so if this is a problem, then when we connect our real phones to the real server (not on WiFi), and access the diary, then the response time on the phone will be >>> the time taken on the server. In other words, there will be "unaccounted" time between what we see on the server and what we see on the phone. If possible, we could check this by looking at the difference between the END POST on the server and the corresponding Received response 200 (or similar) on the phone.

That will let us estimate the serialization/deserialization overhead on a realistic system. If that is small, I would just say that we don't want to worry about the object size right now.

We are currently leaning towards (#840 (comment)) option 2, impl 1

@MaliheTabasi
Copy link

Hi @shankari @sebastianbarry

Following our meeting today, I wanted to share a couple of UI-related comments:

1- We suggest that time-use activities be left-aligned instead of centered.

2- Currently, when tapping the delete button, the activity is instantly deleted. It would be better to have a pop-up confirmation message asking the person if they are sure they want to delete the file. This would prevent accidental deletion of activities.

Thanks.

@shankari
Copy link
Contributor

@sebastianbarry can you deal with these minor UI issues tomorrow so that @JGreenlee can deal with the hanging UI once he returns?

@shankari
Copy link
Contributor

Closing this since we have merged the related change to master
e-mission/e-mission-phone#953

@github-project-automation github-project-automation bot moved this from Current two week sprint to Sprint tasks completed in OpenPATH Tasks Overview Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants