Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidewalk Data Quality #1

Open
aseemdeodhar opened this issue Jul 7, 2020 · 3 comments
Open

Sidewalk Data Quality #1

aseemdeodhar opened this issue Jul 7, 2020 · 3 comments

Comments

@aseemdeodhar
Copy link
Contributor

The sidewalks width field is a text field (!) and has to be cleaned up. The vast majority of rows are clean, with only 20 out of ~24000 rows having undecipherable text values. Some entries with widths in the 90s (widths are in feet) seem to be erroneous entries where a decimal point is missing.

image

@aseemdeodhar
Copy link
Contributor Author

Did manual cleaning of widths data.

Example errors:

  • letter O instead of numeral 0
  • > sign instead of numeral 7
  • Manual edits required for checking if decimal point was missing or not. eg 30ft sidewalks exist, but some 3ft sidewalks are entered as 30 with decimal point missing
  • ',' as decimal separator instead of '.'

Used the sf package in R to convert SWK_WIDTH column from text to numeric:

library(sf)
library(dplyr)

sidewalk_shp <- read_sf("sidewalks_clean/Sidewalk_Inventory.shp")
sidewalk_shp <- sidewalk_shp %>% mutate(SWK_WIDTH = as.numeric(SWK_WIDTH))
st_write(sidewalk_shp, "sidewalks_clean/Sidewalk_Inventory.shp", delete_layer = TRUE)

@patwater
Copy link

patwater commented Jul 7, 2020

O instead of 0 is a major facepalm! Great stuff

@vr00n vr00n changed the title Sidewalk Data Qaulity Sidewalk Data Quality Jul 8, 2020
@vr00n
Copy link
Member

vr00n commented Jul 12, 2020

@aseemdeodhar can you install and run https://github.com/pandas-profiling/pandas-profiling on the sidewalk datafile and upload the resulting html here? Thanks!

vr00n pushed a commit that referenced this issue Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants