-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rides that didn't happen - Issue linking siri rides to gtfs rides #10
Comments
Might not be related, but as another sanity, I tried a simple comparison of the counts, and the results are weird (sometimes there are more siri rides than gtfs)
Maybe the drivers had some spare time 😅 |
regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref. |
this ETL depends on a lot of other data from SIRI and GTFS, so there could be different places which cause the mismatch in rides. I suggest to go back to the source raw SIRI/GTFS data and try to match those rides there. |
one more point to take into account when looking for missing data is if all the relevant data was loaded for the given times. If there was an error in the MOT interfaces or in our ETL you might see missing data for rides that did happen.. I started working on adding DB tables that show this but didn't finish. However, there are some tables which show this data -
|
So I took a specific siri ride that didn't match (26074926) and if we look at it's journey_ref (2145383_121222) we can see that it is not unique for some reason (we're looking in the raw data to find if we can find this duplication there as well) For some reason, the one that's related to the 12.12 siri_ride is the gtfs_ride from the 13.13, even though I see you do filter by date -
Do you have any idea why would it happen? |
If I try to manually join siri_rides to gtfs_ride using the same reformatting you do in the ETL, I see that only 10 out of 39 siri rides in that day link based on the journey_ref. Meaning that the majority are linked based on the route_gtfs_ride_id (which as you said is more heuristic).
If I take of of the journey refs from the siri rides that didn't match, they indeed don't match to any existing gtfs_ride |
very good work! I think it shows a lot of different problems here..
so, a lot of stuff to investigate and fix here :) |
Update: after more research, we found out that I'm currently trying to fix the ETL, these are the changes I'm doing -
@OriHoch would like your help here regarding how to test this (I will open a PR later on)-
|
I think the gtfs_date filter can be removed, we have other filters on scheduled_start_time / ride start_time which are enough (both contain also the date). The only logical problem I can think of is that we might have the same ride multiple times from different gtfs_dates (possibly with different ride start_times?). Regarding testing, I don't see a problem to test on production.. given that current data is also not accurate |
wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily? Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column? |
BTW, do you know how the current syntax you used for updating works id there is more than 1 match? |
yes but the process of getting and processing the gtfs data is complex, it's not a simple download for a specific date, so need to try to make as few assumptions as possible about the correctness of the data as possible.. the gtfs date is just the date when we downloaded the data, so what happens if for example the MOT data was not updated at a given day? Or if we missed a download day? We try to handle all these situations but just need to take into account that there might be problems..
sounds good |
you need to make sure there is only 1 match.. if there is more then 1 match it will work but you can't know for sure which match will be used for the update |
I wanted to use the absence of a link between gtfs ride to a siri ride (
siri_ride.gtfs_ride_id is null
) to categorize this planned ride as a ride that did not happen.As far as I understand from the data model, this hypothesis makes sense (of course that there are other cases in which there is siri data and the ride did not happen \ finish successfully, but I wanted to start with the easiest case).
But when verifying the data, I noticed gaps between the siri ride counts to the siri rides I found unlinked.
This is the flow I checked on Eged's line 470:
Output -
Now I zoomed in to the 12.12.22 which seemed to have 5 missing rides-
Both returned 39, meaning that probably all the planned rides did happen (maybe in different timings than expected?).
Than, I went looking at the ETL code that updates
siri_ride.gtfs_ride_id
(open-bus-stride-etl/open_bus_stride_etl/siri/update_rides_gtfs.py
)I manually ran the query.
A subtraction of all the siri ride ids that did have gtfs relation from all the siri rides that day, gave me the unlinked ids -
{‘26074926’, ‘26077170’, ‘26094950’, ‘26100800’, ‘26106606’}
ETL Verification
Running the ETL query that links by
route_gtfs_ride_id
returns no results becausesiri_ride.journey_gtfs_ride_id
is not nullRunning the ETL query that links by
journey_gtfs_ride_id
returns no results either because for some reason in that case the gtfs_route.data is '2022-12-13'@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between
journey_gtfs_ride_id
androute_gtfs_ride_id
and why when we link only when one of them is valid?The text was updated successfully, but these errors were encountered: