Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many siri_rides are missing a matching gtfs_ride_id #9

Open
ShayAdler opened this issue Dec 18, 2022 · 3 comments
Open

Many siri_rides are missing a matching gtfs_ride_id #9

ShayAdler opened this issue Dec 18, 2022 · 3 comments

Comments

@ShayAdler
Copy link
Contributor

It might have been reasonable in small numbers, but it is more than 50,000 rides out of 273,547 rides in the last week -

select count(*) from siri_ride where gtfs_ride_id is null
                        and scheduled_start_time < '2022-12-15 00:45:00.000000 -- outputs 52311' 

The relevant ETL should be running every hour.
@OriHoch where can I see its logs? Trying to understand if the ETL doesn't finish well or that the logic is broken.

@OriHoch
Copy link
Contributor

OriHoch commented Dec 18, 2022

sent you credentials to airflow in slack, you can see logs and etl runs there

regarding the problem, I ran the following query:

select date_trunc('day', scheduled_start_time), count(*)
from siri_ride where gtfs_ride_id is null
and scheduled_start_time < '2022-12-15' and scheduled_start_time > '2022-11-01'
group by date_trunc('day', scheduled_start_time)

it looks reasonable:

date_trunc count
2022-11-01 00:00:00.000000 1290
2022-11-02 00:00:00.000000 1325
2022-11-03 00:00:00.000000 5186
2022-11-04 00:00:00.000000 876
2022-11-05 00:00:00.000000 386
2022-11-06 00:00:00.000000 1607
2022-11-07 00:00:00.000000 1308
2022-11-08 00:00:00.000000 1451
2022-11-09 00:00:00.000000 1254
2022-11-10 00:00:00.000000 1668
2022-11-11 00:00:00.000000 720
2022-11-12 00:00:00.000000 359
2022-11-13 00:00:00.000000 411
2022-11-14 00:00:00.000000 1231
2022-11-15 00:00:00.000000 1195
2022-11-16 00:00:00.000000 1413
2022-11-17 00:00:00.000000 1198
2022-11-18 00:00:00.000000 732
2022-11-19 00:00:00.000000 411
2022-11-20 00:00:00.000000 1149
2022-11-21 00:00:00.000000 1298
2022-11-22 00:00:00.000000 1282
2022-11-23 00:00:00.000000 1220
2022-11-24 00:00:00.000000 1328
2022-11-25 00:00:00.000000 612
2022-11-26 00:00:00.000000 345
2022-11-27 00:00:00.000000 446
2022-11-28 00:00:00.000000 1112
2022-11-29 00:00:00.000000 1137
2022-11-30 00:00:00.000000 1053
2022-12-01 00:00:00.000000 1310
2022-12-02 00:00:00.000000 612
2022-12-03 00:00:00.000000 307
2022-12-04 00:00:00.000000 1093
2022-12-05 00:00:00.000000 1055
2022-12-06 00:00:00.000000 1076
2022-12-07 00:00:00.000000 1171
2022-12-08 00:00:00.000000 1155
2022-12-09 00:00:00.000000 621
2022-12-10 00:00:00.000000 344
2022-12-11 00:00:00.000000 964
2022-12-12 00:00:00.000000 1137
2022-12-13 00:00:00.000000 1457
2022-12-14 00:00:00.000000 1295

@OriHoch
Copy link
Contributor

OriHoch commented Dec 18, 2022

for reference, this is the number of rows where gtfs_ride_id is not null:

date_trunc count
2022-11-01 00:00:00.000000 84368
2022-11-02 00:00:00.000000 102566
2022-11-03 00:00:00.000000 102905
2022-11-04 00:00:00.000000 49700
2022-11-05 00:00:00.000000 22703
2022-11-06 00:00:00.000000 95849
2022-11-07 00:00:00.000000 93432
2022-11-08 00:00:00.000000 103404
2022-11-09 00:00:00.000000 103936
2022-11-10 00:00:00.000000 100097
2022-11-11 00:00:00.000000 49913
2022-11-12 00:00:00.000000 22987
2022-11-13 00:00:00.000000 102801
2022-11-14 00:00:00.000000 102954
2022-11-15 00:00:00.000000 103777
2022-11-16 00:00:00.000000 103419
2022-11-17 00:00:00.000000 103065
2022-11-18 00:00:00.000000 50217
2022-11-19 00:00:00.000000 22300
2022-11-20 00:00:00.000000 100661
2022-11-21 00:00:00.000000 102834
2022-11-22 00:00:00.000000 103020
2022-11-23 00:00:00.000000 102198
2022-11-24 00:00:00.000000 103254
2022-11-25 00:00:00.000000 50025
2022-11-26 00:00:00.000000 23164
2022-11-27 00:00:00.000000 104564
2022-11-28 00:00:00.000000 104452
2022-11-29 00:00:00.000000 104342
2022-11-30 00:00:00.000000 104482
2022-12-01 00:00:00.000000 102095
2022-12-02 00:00:00.000000 50354
2022-12-03 00:00:00.000000 23033
2022-12-04 00:00:00.000000 104298
2022-12-05 00:00:00.000000 104853
2022-12-06 00:00:00.000000 103888
2022-12-07 00:00:00.000000 104130
2022-12-08 00:00:00.000000 104083
2022-12-09 00:00:00.000000 49862
2022-12-10 00:00:00.000000 22726
2022-12-11 00:00:00.000000 103725
2022-12-12 00:00:00.000000 103929
2022-12-13 00:00:00.000000 102961
2022-12-14 00:00:00.000000 103381

@ShayAdler
Copy link
Contributor Author

Thanks :)

You are right, this is not a significant precent (1-2% per day), but still..
I actually got to this from the opposite - I'm trying to search if any gtfs_ride that doesn't have a matching siri_ride is indeed a ride that was never performed. I assumed there will be no siri rides without planned gtfs almost at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants