Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing extremely slow reads on TickStore -- fully executable example included. #895

Open
jeffneuen opened this issue Mar 20, 2021 · 19 comments

Comments

@jeffneuen
Copy link

Arctic Version

arctic==1.79.4
pandas=1.1.5

Arctic Store

TickStore

Platform and version

Ubuntu Linux 20.04, Python 3.8.8 (Anaconda), running JupyterLab
Modern CPU w/ NVMe

Description of problem and/or code sample that reproduces the issue

I am experiencing very slow Tickstore reads. In my sample code below, the write operation clocks at 1.2s for 5 million rows, which seems good. However, when I read the data, the read operation clocks at 59s.

import arctic
import pandas as pd

arctic_host  = 'localhost:27017'
test_library_name = 'dev_speed_testing_library'
test_store = arctic.Arctic(arctic_host)
test_store.delete_library(test_library_name)
test_store.initialize_library(test_library_name, 'TickStoreV3')

test_library = test_store[test_library_name]
test_library._chunk_size = 1000000
test_library.list_symbols()

[]

from numpy.random import default_rng
data_length = 5000000
sample_index = pd.date_range(start='1990-01-01', periods=data_length, freq='1ms', tz='UTC')
rng = default_rng()
sample_data = rng.standard_normal(data_length)
sample_data = sample_data * sample_data #get rid of negative #s
test_df=pd.DataFrame(sample_data, sample_index, columns=['price'])
test_df.dtypes

price float64
dtype: object

%%time #don't use this if you're not running in jupyter
test_library.write('testsymbol', test_df)

CPU times: user 1.16 s, sys: 64.1 ms, total: 1.22 s
Wall time: 1.26 s

%%time #don't use this if you're not running jupyter
tmp = test_library.read('testsymbol')

CPU times: user 59.3 s, sys: 3.14 s, total: 1min 2s
Wall time: 59.7 s

On the read operation, the process seems to be cpu bound, with a single python thread pegged at 100%.

Not sure if I'm missing something obvious here, like using the wrong data types or something, but writes that are that many multiples faster than reads seems odd.

@jeffneuen
Copy link
Author

I also ran a profiler on a read from tickstore, and below are a few relevant lines... Is there a different type that I should be saving the datetime index in?

ncalls tottime percall cumtime percall filename:lineno(function)

    1    0.585    0.585  135.354  135.354 <string>:1(<module>)
    1    0.000    0.000  122.954  122.954 datetimes.py:259(_convert_listlike_datetimes)
    1    0.000    0.000  122.955  122.955 datetimes.py:605(to_datetime)
    1    0.205    0.205  134.769  134.769 tickstore.py:265(read)
    1    0.000    0.000  135.354  135.354 {built-in method builtins.exec}
    1  122.954  122.954  122.954  122.954 {pandas._libs.tslib.array_with_unit_to_datetime}

@jeffneuen
Copy link
Author

Is there any other information I could provide that would help to better describe this issue?

It is still a problem for me. Thanks!

@jeffneuen
Copy link
Author

I did more testing on this, and the problem disappears when running pandas .25.3 It also disappears if you change the freq of the the sample index generator to '1ns' (although the datetime values returned by TickStore will be incorrect if you feed it the nanosecond level data via the write method).

It looks like pandas is handing the to_datetime call on line 388 of tickstore.py:

        index = pd.to_datetime(np.concatenate(rtn[INDEX]), utc=True, unit='ms')

differently enough in v .25.3 vs the 1.x branch that the new version of pandas is spending a lot of time on pandas._libs.tslib.array_with_unit_to_datetime, as shown by the profiler, where the old version is not.

@crazy25000
Copy link

crazy25000 commented Mar 29, 2021

I just installed and setup Arctic and noticed the slow read performance with the Tickstore + Pandas 1.2.3. However, reads are really fast with VersionStore and I thought it would be the opposite 😅

@jeffneuen
Copy link
Author

jeffneuen commented Mar 30, 2021

Yes, this makes me wonder if man financial (the creator and maintainer of this package) is really still using the old pandas .25 branch internally, or if nobody there is using tickstore, otherwise certainly someone else would have discovered this.

@crazy25000
Copy link

It's unfortunate, would've been nice to use. I also tested https://github.com/alpacahq/marketstore and it performs well. Would recommend trying it.

I've been testing different libraries to determine which one to keep, continue using, and if it's outdated and unmaintained, update it myself. Have you tried other libraries?

@jeffneuen
Copy link
Author

@crazy25000 thanks for the tip! Happy to continue this discussion, but I don't want to clutter up the github issue with it. My email is on my profile if you'd like to chat datastores further!

@jeffneuen
Copy link
Author

This is still an outstanding issue for me, if there is anything else I can provide to help clarify this issue, please let me know.

@bmoscon
Copy link
Collaborator

bmoscon commented Apr 15, 2021

I'd bet if you downgrade pandas it will work better, this library isnt extensively used or tested on very recent pandas releases and there have been cases in the past where behavior changed in pandas (for the worse) and made trivial operations in arctic take incredibly long (i.e. 5 ms to 30 seconds)

@jeffneuen
Copy link
Author

@bmoscon you are correct, if I use the pandas .25 branch, the problem is solved. However, this creates pretty serious workflow issues. If a user wants to pull data with .25, but then work with the data using a current 1.x version of pandas, you need two different venvs, and end up storing the data in some kind of intermediate layer, unless I"m missing an obvious and simpler workaround

It's possible, just seems to defeat a lot of the benefit of arctic if I am dumping eveything into parquet files with code running .25 and then loading the parquet files with a 1.x pandas to do the work. The most recent version of .25 pandas is from Oct 2019 -- a little stale at this point.

But, thank you for the reply, the point is taken that perhaps arctic just doesn't have complete support for 1.x pandas yet.

@vargaspazdaniel
Copy link

I'm having issues with the read speed with Tickstore too. Only around 205k rows takes around 1 min, while writing the data is working perfectly and without issues. Any way to read tick data (which usuallys have thousands and thousands rows) getting the data faster? Maybe using dask, modin or another pandas version with higher speed.

@JunyueLiu
Copy link

My solution is to replace line 338 to
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True, unit='ms')

@jeffneuen
Copy link
Author

@JunyueLiu I tweaked your suggestion just a little bit to:

index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)

and now the reads are back to a normal speed, about 6.5M rows/sec.

@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.

@JunyueLiu
Copy link

@JunyueLiu I tweaked your suggestion just a little bit to:

index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)

and now the reads are back to a normal speed, about 6.5M rows/sec.

@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.

Feel free to submit the PR.

@CmpCtrl
Copy link

CmpCtrl commented Nov 1, 2021

I found this issue as well, and found that it can be quicker still by omitting the pd.to_datetime altogether.
index = (np.concatenate(rtn[INDEX])).astype("datetime64[ms]")
However, this solution also requires a second change to where the timezone is converted, line 359. The following worked for me.
rtn.index = rtn.index.tz_localize(dt.now().astimezone().tzinfo)

@jeffneuen
Copy link
Author

@CmpCtrl I tried your solution, and did indeed did get about 50% faster reads, about 9.5M rows/sec. Thanks!

@CmpCtrl
Copy link

CmpCtrl commented Nov 2, 2021

I started a branch to work on a couple other things as well, branch. The mktz() seems really slow, my first call to get the max or min date from a symbol took ~0.6 seconds and it seemed like most of that was in finding the local timezone. I also brought in the fixes from #887 so i could get back to the latest python and pandas versions. I haven't done much testing and i am only using a small portion of the functionality so i'm not sure how relevant these changes are to others.

@jeffneuen
Copy link
Author

Thanks, I am checking out your branch, those functions are useful to me! I need to be on the 1.x version of pandas for other reasons, and min and max date are also useful to me. I hope that at some point this project will be able to standardize on the more recent versions of python and pandas, but my feeling is that the main corporate owner of the project probably has their own internal versions that they use, and that's what it's being maintained for.

jeffneuen added a commit to jeffneuen/arctic that referenced this issue Nov 3, 2021
@burrowsa
Copy link
Contributor

I'm also seeing this problem. Picking up the fix from @jeffneuen 's repo fixed it for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants