Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New] Specify strftime format with CountryData.cleaned(date_format) when we use local dataset (Fix: Using Own Dataset Not Work Anymore) #856

Closed
subi10 opened this issue Jun 27, 2021 · 36 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Comments

@subi10
Copy link

subi10 commented Jun 27, 2021

Hi, im Subi from Malaysia, thank you very much for this outstanding package and for the last month i have been using the package to upload a dataset from a province in Malaysia and it work like charm. RIght now I try to do similiar step but the "scenario" instance return error look like it didnt read my datasetsets properly.

This is now
image

this is back then
image

am i doing something wrong?? this is how i do it.

image
image

@lisphilar lisphilar changed the title Using Own Dataset Not Work Anymore [FIx] Using Own Dataset Not Work Anymore Jun 27, 2021
@lisphilar lisphilar added this to the Release v2.22.0 milestone Jun 27, 2021
@lisphilar lisphilar added the bug Something isn't working label Jun 27, 2021
@lisphilar
Copy link
Owner

Thank you for reaching out to us!
Could you check country_data.cleaned() has the all data you had in the CSV file? .head() is not used in In[10], but only five rows are shown in Out[10].

Additionally, please try auto_complement=False (skip automatic data complement) when creating Scenario instance. i.e. Please replace

my_scenario = cs.Scenario(jhu_data, population_data, "Malaysia", "Selangor")

with

my_scenario = cs.Scenario(jhu_data, population_data, "Malaysia", "Selangor", auto_complement=False)

If they do not work, is it possible to share the CSV file and version number of Python and CovsirPhy? (Kindly use "Request fixing a bug" issue template at the next time.)

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@lisphilar
Copy link
Owner

Dear @subi10 ,
Thank you for your trying, but I missed images and CSVs because files are removed when we reply to GitHub Notification e-mails. Please return to GitHub Issues with your browser and attach them :-)
#856

@lisphilar
Copy link
Owner

You can move to GitHub Issues by clicking "view it on GitHub" link at the bottom of the notification e-mails.
キャプチャ

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Hi,

Sorry for sending it via email. I did try the suggestion and I get this ,

1624776855508blob

attached is the file of the dataset im working with.
Selangor.xlsx

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@lisphilar
Copy link
Owner

Thank you for uploading!
Hmm...I tried it with CovsirPhy 2.21.0 (the latest stable version), CSV file converted from the Excel file you attached and Google Colab. Actually, it worked.
https://gist.github.com/lisphilar/e7697ae512bdb7220c4bccbf6c2beeb7

I noticed the first date of the records you showed in the first comment was 2020-01-05 and column names of the CSV file was "Confirmed", "Recovered" and "Death". However, the excel file I received has 2020-04-20 at the first record. Column names were "confirmed", "recovered" and "fatal".

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@lisphilar
Copy link
Owner

lisphilar commented Jun 27, 2021

I’m not sure, but do you have CSV file and Excel file in the directory where codes were executed?
If so, please confirm the files have the same first date, 2020-04-20, and column names are ”confirmed", "recovered" and "fatal".

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Hi,

These are the files in my directory. I did change it to small letter thinking that it can fix my issue.

image

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Yes,

The first date is indeed 20-4-2020

image

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

I still get this, it throw me -1 number

image

@lisphilar
Copy link
Owner

Could you share Selangor_S-R.ipynb?

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Hi,

I upload it here

https://github.com/subi10/Selangor

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Also, I pass the link of the collab u give to me to my colleague and ask to run it they also got the same error.

@lisphilar
Copy link
Owner

Thank you for creating the repository! Sorry for the trouble.

I noticed that the last date was "2021-06-14" in CSV and that was "2021-12-06" (future date!) in Out[4] (country_data.cleaned().tail()). I will investigate it with source codes.

Could you add the following lines to the script?

print(covsirphy.__version__)
country_data._raw.tail()

If the output of country_data._raw.tail() is not the same as Out[4], something is wrong with data cleaning.

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

This is head and tail in the country_data

image

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

This is the version i currently on

image

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Yes, look like the tail here has something issue with the last date.

@lisphilar
Copy link
Owner

Thank you for sharing. It appears that "12/6/2021" is converted to "2021-12-06" (=06Dec2021) in your PC.
Apart from CovsirPhy, please share the output of the next codes.

import pandas as pd
pd.to_datetime("12/6/2021")

My PC (in Japan) returns Timestamp('2021-12-06 00:00:00').

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Same goes here, my start date is April 20th 2020

image

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Oh, this is the timestamp

image

@lisphilar
Copy link
Owner

This is expected to be Timestamp('2021-06-12 00:00:00')...
To fix this issue, we may need to set time format appropriately.

import pandas as pd
pd.to_datetime("12/6/2021", format="%d/%m/%Y")

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

I try to run it

image

@lisphilar
Copy link
Owner

lisphilar commented Jun 27, 2021

The reason Google Colab successed is not clear...but, to test it, could you try the following?

import pandas as pd
# Remove cleaned data with wrong time format
country_data._cleaned_df = pd.DataFrame()
# Update raw dataframe with appropreate time format
country_data._raw["Date"] = pd.to_datetime(country_data._raw["Date"], format="%d/%m/%Y")
# Data cleaning
country_data.cleaned()

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Great stuff!! Thank you so much. !!

image
image

@lisphilar
Copy link
Owner

lisphilar commented Jun 27, 2021

Thank you for your cooperation!!!
I will add time_format date_format argument to CountryData.cleaned() later.

@subi10
Copy link
Author

subi10 commented Jun 27, 2021

Thank you very much to you as well. You are a genius and great person.

@lisphilar lisphilar changed the title [FIx] Using Own Dataset Not Work Anymore [New] Specify strftime format with CountryData.cleaned(date_format) when we use local dataset (Fix: Using Own Dataset Not Work Anymore) Jun 27, 2021
@lisphilar lisphilar added enhancement New feature or request bug Something isn't working documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 27, 2021
@lisphilar
Copy link
Owner

With #857, CountryData.cleaned(date_format=None) (default) was implemented at development 2.21.0-delta. This will be included in the next stable version 2.22.0 (planed in Jul2021). Becuase we only use "date", argument name is date_format, not time_format.

For a while, please use the code (country_data._raw["Date"] = pd.to_datetime(country_data._raw["Date"], format="%d/%m/%Y")) with the latest stable version.
Or, use country_data.cleaned(date_format="%d/%m/%Y") with the development version.

New documentation will be deployed in some hours.
https://lisphilar.github.io/covid19-sir/markdown/INSTALLATION.html#use-a-local-csv-file-which-has-the-number-of-cases

I will close this issue, thank you.

FYI:
With issue #851, LocalDataLoader may be created to read local datasets more easily. Date format should be considered there.

@subi10
Copy link
Author

subi10 commented Jun 27, 2021 via email

@lisphilar
Copy link
Owner

lisphilar commented Jul 31, 2021

Dear @subi10,
Hello again.
Stable version 2.22.0 was released and DataLoader class was improved.
https://lisphilar.github.io/covid19-sir/markdown/LOADING.html

With 2.22.0, we can use DataLoader to read local CSV files without CountryData.

import covsirphy as cs
loader = cs.DataLoader(update_interval=None)
loader.read_csv("Selangor.csv", parse_dates=["Date"], dayfirst=True)
# loader.local
loader.assign(country="Malaysia", state="Selangor", population=6_530_000)
loader.lock(
    date="Date", country="country", province="state",
    confirmed="confirmed", fatal="fatal", recovered="recovered", population="population")
# loader.locked
jhu_data = loader.jhu()
snl = cs.Scenario(country="Malaysia", province="Selangor")
snl.register(jhu_data)
snl.records()

@subi10
Copy link
Author

subi10 commented Aug 28, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants