Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remake mart_ntd.dim_monthly_ntd_ridership_with_adjustments with new data #3519

Open
csuyat-dot opened this issue Oct 29, 2024 · 10 comments
Open
Assignees

Comments

@csuyat-dot
Copy link
Contributor

User story / feature request

As a data analyst, I want mart_ntd.dim_monthly_ntd_ridership_with_adjustments remade as a new table that uses data from external_ntd__ridership.historical__complete_monthly_ridership_with_adjustments_and_estimates__upt

mart_ntd.dim_monthly_ntd_ridership_with_adjustments data is used in the Monthly NTD Ridership Report. Previously, this table was manually updated when FTA publishes new data from the complete monthly ridership report. However, the FTA report is in a wide format (a new column is added every month). dim_monthly_ntd_ridership_with_adjustment converted the FTA data to a long format which works well with the report site scripts.

The new external_ntd__ridership.historical__complete_monthly_ridership_with_adjustments_and_estimates__upt ingest data in initial the wide format, which is not compatible with the report site scripts.

Acceptance Criteria

A model as close as possible to initial dim_monthly_ntd_ridership_with_adjustments model where each rows is an agency's UPT for that month-year

Notes

@erikamov
Copy link
Contributor

erikamov commented Nov 5, 2024

As discussed during weekly meeting, this work is important and urgent. So I moving to our current iteration to work on it.

erikamov added a commit that referenced this issue Nov 6, 2024
estimates to bring data for new available months (July and August)

[#3519]
@erikamov
Copy link
Contributor

erikamov commented Nov 7, 2024

The URL for the xlsx file changes every month, so we will need someone to get the new URL and update it.

  • If we keep the current way, we need to change complete_monthly_ridership_with_adjustments_and_estimates.yml and create a PR, get approval and then merge the change.
  • Other option that I am thinking, if we use a environmental variable on Airflow (like we use for bucket's name) we could just replace the URL there every month. It would be easier and faster, but the downside is the change is not visible as the PR.

For new month_year columns on new files, currently we just need to add them to the yml file. Depending on the path we use for the URL change, this is just part of the changes on PR... see any new column and add them.

If we want it to be automatic, I could remove the columns definition, so any new columns would be imported. It would impact on the order of the columns that gets a little messy on External and Staging tables, also we cannot pre-define the typecast for each column the process created the typecast based on the data.

@evansiroky @csuyat-dot Would love to hear if you have any thoughts or preferences.

@evansiroky
Copy link
Member

Cc @vevetron for comments too

@erikamov
Copy link
Contributor

erikamov commented Nov 7, 2024

Some notes:

  • DAG runs: schedule_interval: "0 10 * * *" # 10am UTC every day
    We don't need it to run everyday, so I think we can change it to once a week. Generating less data on the bucket and less time on the daily external table update.

  • Weird difference between NTD ID and Legacy NTD ID after the scrape process
    If NTD ID set to string on yml file it converts to "1.0" instead of "00001" like in I see on the xlsx file.
    Then to format we need to cast to NUMERIC and then to INT: format("%05d", CAST(CAST(ntd_id AS NUMERIC) AS INT64))

    If NTD ID set to NUMERIC on yml file it converts correctly to "1" and we can use only one CAST AS INT64.

    The Legacy NTD works correctly, converting to "0001", maybe because there are some cases with string values, then the conversion works correctly.
    I think the problem is on excel conversion, will create an issue for that and for date types that I noticed a problem when working on Agency Information.

  • Even though the information bellow come from the Complete Monthly Ridership with adjustments and estimates file, it contains yearly information:

    • Displays per year - Incrementing the amount every month of the year
      • historical__complete_monthly_ridership_with_adjustments_and_estimates__calendar_year_upt
      • historical__complete_monthly_ridership_with_adjustments_and_estimates__calendar_year_vrm
      • historical__complete_monthly_ridership_with_adjustments_and_estimates__upt_estimates
      • historical__complete_monthly_ridership_with_adjustments_and_estimates__vrm_estimates
    • Average for the fiscal year
      • historical__complete_monthly_ridership_with_adjustments_and_estimates__master

@erikamov
Copy link
Contributor

erikamov commented Nov 7, 2024

Cc @vevetron for comments too

Thank you! I was going to add him and end up forgetting to tag him.

erikamov added a commit that referenced this issue Nov 7, 2024
- Also Format SQL files and reorder columns

[#3519]
@erikamov
Copy link
Contributor

erikamov commented Nov 8, 2024

@csuyat-dot @vevetron I have some questions about Mode...

The Intermediate tables are currently returning only records with the following modes:

AG, AR, CB, CC, CR, DR, FB, HR, IP, LR, MB, MG, MO, OR*, PB, RB, SR, TB, TR, VP, YR

The dim_monthly_ridership_with_adjustments is created from those intermediate tables, but it has the rules for modes:

  • Case the mode IN (AG, AR, CB, CC, CR, FB, HR, IP, IP, LR, MB, MG, MO, RB, SR, TB, TR, YR) the service_type = Fixed Route
  • Case the mode IN (DR, DT, VP, JT, PB) the service_type = Demand Response
  • For any other service_type = Unknown

QUESTION 1 For the mode OR the service_type = Unknown.
There is no description for OR on ntd_modes_to_full_names.csv, so maybe this is the reason, but I want to confirm.

QUESTION 2 DT and JT are described as Demand Response on dim_monthly_ridership_with_adjustments but since those two are not part of the list on the intermediate tables they will never show up.
Also there is no records on external tables with those modes. So can we remove DT and JT from that Case clause?

The existing modes on Complete Monthly Ridership with adjustments and estimates are:
AG, AR, CB, CC, CR, DR, FB, HR, IP, LR, MB, MG, MO, OR, PB, RB, SR, TB, TR, VP, YR

QUESTION 3 There column year is used for anything?
We don't have this same column on the new external monthly tables.
Looks like this column year just displays 2024 that is the year of the file imported. So could be removed or getting the year from other available dates like _dt (date the data was imported).

erikamov added a commit that referenced this issue Nov 8, 2024
estimates to bring data for new available months (July and August)

[#3519]
erikamov added a commit that referenced this issue Nov 8, 2024
- Also Format SQL files and reorder columns

[#3519]
@csuyat-dot
Copy link
Contributor Author

@erikamov here are my thoughts, though im not too familiar with modes

Q1: there was an old thread on Slack regarding OR. Nobody could totally confirm what OR means, but we are guessing it means Other. When looking at the Complete Monthly Ridership with adjustments and estimates excel report, Mode: OR is accompanied by the column 3 mode and Other. So maybe change Unknown to Other to be more consistent?

Q2: im personally unsure about removing/keeping DT and JT. My gut feeling is to keep both in the rare chance an agency does use that mode.

Q3: i dont use that year column for my report site. I am unsure if year is used anywhere else, but others can weigh in on it.

@erikamov
Copy link
Contributor

Note for future references:

External monthly ridership tables have some records with empty NTD_ID for modes: Reduced Reporters, Rolling 12-Month Sum, Rolling 12-Month Sum with Reduced and Rural Reporter Estimates, Rolling 12-Month Sum with Reduced Reporters, Rural Reporters, Subtotal with Reduced and Rural Reporter Estimates, Subtotal with Reduced Reporters, Total.

They are lines from the summary added to the bottom of each monthly data worksheet inside the Complete Monthly Ridership with adjustments and estimates xlsx file for estimated monthly industry totals for Rural reporters.

Image

So I am adding filters to staging tables to only return records where NTD_ID IS NOT NULL.
Intermediate and Mart tables are based on those staging tables so they will be filtering out too.

erikamov added a commit that referenced this issue Nov 13, 2024
instead of every day.

There is no need to run every day since it is a
monthly file, but having the process
running once a week give us a better window to
add and process new files once it is released.

[#3519]
erikamov added a commit that referenced this issue Nov 13, 2024
estimates to bring data for new available months (July and August)

[#3519]
erikamov added a commit that referenced this issue Nov 13, 2024
- Also Format SQL files and reorder columns

[#3519]
erikamov added a commit that referenced this issue Nov 13, 2024
instead of every day.

There is no need to run every day since it is a
monthly file, but having the process
running once a week give us a better window to
add and process new files once it is released.

[#3519]
erikamov added a commit that referenced this issue Nov 13, 2024
estimates to bring data for new available months:
July, August, and September

[#3519]
erikamov added a commit that referenced this issue Nov 13, 2024
Mart tables

- Also Format SQL files and reorder columns
- Remove irrelevant year column from dim model
- Filter out null NTD_ids
- Edit description on documentation

[#3519]
erikamov added a commit that referenced this issue Nov 13, 2024
Mart tables

- Also Format SQL files and reorder columns
- Remove irrelevant year column from dim model
- Filter out null NTD_ids
- Edit description on documentation

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
instead of every day.

There is no need to run every day since it is a
monthly file, but having the process
running once a week give us a better window to
add and process new files once it is released.

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
estimates to bring data for new available months:
July, August, and September

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
models.

- Add documentation for models and columns
- Add Test for NTD_ID

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
the new process running on Airflow.

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
and Mart tables

- Also Format SQL files and reorder columns
- Remove irrelevant year column from dim model
- Update documentation for models and columns

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
and Mart tables

- Format SQL files and reorder columns
- Remove irrelevant year column from dim model
- Rename columns to follow the same pattern as 
  the other new NTD models
- Add test for not Null NTD_ID
- Update documentation for models and columns

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
estimates to bring data for new available months:
July, August, and September

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
models.

- Add documentation for models and columns
- Add Test for NTD_ID

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
the new process running on Airflow.

[#3519]
erikamov added a commit that referenced this issue Nov 14, 2024
and Mart tables

- Format SQL files and reorder columns
- Remove irrelevant year column from dim model
- Rename columns to follow the same pattern as 
  the other new NTD models
- Add test for not Null NTD_ID
- Update documentation for models and columns

[#3519]
erikamov added a commit that referenced this issue Nov 16, 2024
estimates to bring data for new available months:
July, August, and September

[#3519]
erikamov added a commit that referenced this issue Nov 16, 2024
models.

- Add documentation for models and columns
- Add Test for NTD_ID

[#3519]
erikamov added a commit that referenced this issue Nov 16, 2024
the new process running on Airflow.

[#3519]
erikamov added a commit that referenced this issue Nov 16, 2024
- Format SQL files and reorder columns
- Remove irrelevant year column

[#3519]
erikamov added a commit that referenced this issue Nov 16, 2024
dim_monthly_ridership_with_adjustments (removed _ntd_).

- Format SQL file and reorder columns
- Remove irrelevant year column
- Rename columns to follow the same pattern as 
  the other new NTD models
- Add surrogate key

[#3519]
erikamov added a commit that referenced this issue Nov 16, 2024
dim_monthly_ridership_with_adjustments (removed _ntd_).

- Format SQL file and reorder columns
- Remove irrelevant year column
- Rename columns to follow the same pattern as 
  the other new NTD models
- Add surrogate key

[#3519]
@erikamov
Copy link
Contributor

Remake completed:

  • dim_monthly_NTD_ridership_with_adjustments was renamed to dim_monthly_ridership_with_adjustments
  • dim_monthly_ridership_with_adjustments is now using the new NTD ridership tables
  • Added documentation

Remaining work:

  • Change sync_ntd_data_xlsx schedule to once a week instead of every day
  • Drop deprecated models
    • external_ntd_data_products.monthly_ridership_with_adjustments_upt
    • external_ntd_data_products.monthly_ridership_with_adjustments_vrm
    • external_ntd_data_products.monthly_ridership_with_adjustments_vrh
    • external_ntd_data_products.monthly_ridership_with_adjustments_voms
    • staging.stg_ntd__monthly_ridership_with_adjustments_upt
    • staging.stg_ntd__monthly_ridership_with_adjustments_vrm
    • staging.stg_ntd__monthly_ridership_with_adjustments_vrh
    • staging.stg_ntd__monthly_ridership_with_adjustments_voms
    • dim_monthly_NTD_ridership_with_adjustments

erikamov added a commit that referenced this issue Nov 22, 2024
instead of every day.

There is no need to run every day since it is a
monthly file, but having the process
running once a week give us a better window to
add and process new files once it is released.

[#3519]
@erikamov
Copy link
Contributor

Schedule changed and dropped tables.
Waiting to check if the sync_ntd_data_xlsx DAG will run only next Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants