Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#Feature: Build the complete database #47

Open
J-glove opened this issue Nov 20, 2024 · 10 comments
Open

#Feature: Build the complete database #47

J-glove opened this issue Nov 20, 2024 · 10 comments
Assignees
Labels
feature new code/logic to be written preprocessing issue pertains to the preprocessing portion of the project

Comments

@J-glove
Copy link
Collaborator

J-glove commented Nov 20, 2024

Upload all available preprocessed exposome files into the database. This will likely be a large file, please create it on the on prem exposome server.

Be wary of persistent issues with zip_9 adding .0 on the entries

@J-glove J-glove added feature new code/logic to be written preprocessing issue pertains to the preprocessing portion of the project labels Nov 20, 2024
@J-glove J-glove assigned J-glove and allfires and unassigned J-glove Nov 20, 2024
@J-glove
Copy link
Collaborator Author

J-glove commented Dec 4, 2024

The database file is being built, but the data is not being loaded. There are issues with the schema versus the flat file.

Not recognizing the UCR table.

Not spent a ton of time due to holidays - will continue this work in sprint 19

Suggestion for mismatched columns: Create a sub dataframe containing exactly the headers in the schema and then load from that sub dataframe.

@allfires , please upload screenshot of error output about finding the correct database table.

@allfires
Copy link
Collaborator

allfires commented Dec 4, 2024

@J-glove I no longer have the screenshot of the error, but I was able to determine the issue. I was not initially creating the tables correctly. The solution was to ad this function to the db_session.py script:
def create_tables():
# Create all tables based on the models defined in models.py
engine = get_engine()
exposome_base.metadata.create_all(engine)

However, I am calling create_tables() in my test driver script, so I believe I'll need to find a way to have this function called automatically. There are some additional issues that I am currently working through and will post any blockers that arise, especially with loading other exposome data.
I would also like to clarify that I am able to load the test preprocess_ucr.csv data to the UCR table though. The script is functioning correctly.

@J-glove
Copy link
Collaborator Author

J-glove commented Dec 10, 2024

Continuing to work on the upload. Had trouble locating files to upload to db. Chengrong has them. @chengrong-us please upload these preprocessed files to the exposome server and tell Ali where they are located.

There is a good deal of updating that needs to be done to scripts to allow the datasets to be uploaded.

@chengrong-us
Copy link
Collaborator

The preprocess_ucr_zip9.csv file is saved in exposome server at /data/exposome_csvs/preprocess_ucr_zip9.csv.

@J-glove
Copy link
Collaborator Author

J-glove commented Dec 18, 2024

@allfires is currently having to prioritize 17r1 Characterization ahead of holiday, bringing this into next sprint.

@J-glove
Copy link
Collaborator Author

J-glove commented Jan 8, 2025

Not finished yet, working on upload scripts to match the datasource to defined model.

Just ucr is present in the db as of 1/8.

@J-glove
Copy link
Collaborator Author

J-glove commented Jan 8, 2025

@allfires - Please move the database to the on-prem Exposome server.

@allfires
Copy link
Collaborator

allfires commented Jan 8, 2025

@allfires - Please move the database to the on-prem Exposome server.

@J-glove I have been working on moving the database to the exposome server, however, I am running into some issues. The biggest issue right now is that I am not able to run the latest version of SQLAlchemy on the server due to the current python version. In order to access libraries for SQLAlchemy that include "DeclarativeBase" and "mapped_column", I need to be using sqlalchemy version2 which is only compatible with python 3.7 or higher. What would you suggest as a fix to this? Should we create a docker container?

@allfires
Copy link
Collaborator

@J-glove I wanted to give you an update on this. I was able to figure out how to use pyenv and install the necessary python version that is compatible with the sqlalchemy dependencies. I was able to get the script running, however, it looks like the headers in the preprocess_ucr.csv file are not following the standard model for zip code and time identifiers, therefore I am not able to load the ucr data.

@allfires
Copy link
Collaborator

allfires commented Jan 10, 2025

@J-glove the model for ucr specifically is this: class ucr(exposome_base):
tablename='UCR'
ZIP_9: Mapped[str] = mapped_column(String(9), primary_key=True)
YEAR: Mapped[str] = mapped_column(String(4), primary_key=True)
p_assualt: Mapped[Float] = mapped_column(Float)
p_burglary: Mapped[Float] = mapped_column(Float)
p_fso: Mapped[Float] = mapped_column(Float)
p_larceny: Mapped[Float] = mapped_column(Float)
p_murder: Mapped[Float] = mapped_column(Float)
p_mvt: Mapped[Float] = mapped_column(Float)
p_rob: Mapped[Float] = mapped_column(Float)

If the preprocess files don't match exactly the model, then the file will not load. Do you want me to change the scripts so that it will handle other columns? or should we change the script for the preprocess to include only columns that will read into the model correctly? Please let me know how to proceed.

preprocess_ucr_zip9.csv currently has these headers: ZIP_9,YEAR,pop,murder,fso,robbery,assault,burglary,larceny,mvt,total,p_total,p_murder,p_fso,p_rob,p_assault,p_burglary,p_larceny,p_mvt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature new code/logic to be written preprocessing issue pertains to the preprocessing portion of the project
Projects
None yet
Development

No branches or pull requests

3 participants