Experiencing extremely slow reads on TickStore -- fully executable example included. #895

jeffneuen · 2021-03-20T01:19:13Z

Arctic Version

arctic==1.79.4
pandas=1.1.5

Arctic Store

TickStore

Platform and version

Ubuntu Linux 20.04, Python 3.8.8 (Anaconda), running JupyterLab
Modern CPU w/ NVMe

Description of problem and/or code sample that reproduces the issue

I am experiencing very slow Tickstore reads. In my sample code below, the write operation clocks at 1.2s for 5 million rows, which seems good. However, when I read the data, the read operation clocks at 59s.

import arctic
import pandas as pd

arctic_host  = 'localhost:27017'
test_library_name = 'dev_speed_testing_library'
test_store = arctic.Arctic(arctic_host)
test_store.delete_library(test_library_name)
test_store.initialize_library(test_library_name, 'TickStoreV3')

test_library = test_store[test_library_name]
test_library._chunk_size = 1000000
test_library.list_symbols()

[]

from numpy.random import default_rng
data_length = 5000000
sample_index = pd.date_range(start='1990-01-01', periods=data_length, freq='1ms', tz='UTC')
rng = default_rng()
sample_data = rng.standard_normal(data_length)
sample_data = sample_data * sample_data #get rid of negative #s
test_df=pd.DataFrame(sample_data, sample_index, columns=['price'])
test_df.dtypes

price float64
dtype: object

%%time #don't use this if you're not running in jupyter
test_library.write('testsymbol', test_df)

CPU times: user 1.16 s, sys: 64.1 ms, total: 1.22 s
Wall time: 1.26 s

%%time #don't use this if you're not running jupyter
tmp = test_library.read('testsymbol')

CPU times: user 59.3 s, sys: 3.14 s, total: 1min 2s
Wall time: 59.7 s

On the read operation, the process seems to be cpu bound, with a single python thread pegged at 100%.

Not sure if I'm missing something obvious here, like using the wrong data types or something, but writes that are that many multiples faster than reads seems odd.

The text was updated successfully, but these errors were encountered:

jeffneuen · 2021-03-20T06:14:47Z

I also ran a profiler on a read from tickstore, and below are a few relevant lines... Is there a different type that I should be saving the datetime index in?

ncalls tottime percall cumtime percall filename:lineno(function)

    1    0.585    0.585  135.354  135.354 <string>:1(<module>)
    1    0.000    0.000  122.954  122.954 datetimes.py:259(_convert_listlike_datetimes)
    1    0.000    0.000  122.955  122.955 datetimes.py:605(to_datetime)
    1    0.205    0.205  134.769  134.769 tickstore.py:265(read)
    1    0.000    0.000  135.354  135.354 {built-in method builtins.exec}
    1  122.954  122.954  122.954  122.954 {pandas._libs.tslib.array_with_unit_to_datetime}

jeffneuen · 2021-03-24T13:40:06Z

Is there any other information I could provide that would help to better describe this issue?

It is still a problem for me. Thanks!

jeffneuen · 2021-03-25T19:08:42Z

I did more testing on this, and the problem disappears when running pandas .25.3 It also disappears if you change the freq of the the sample index generator to '1ns' (although the datetime values returned by TickStore will be incorrect if you feed it the nanosecond level data via the write method).

It looks like pandas is handing the to_datetime call on line 388 of tickstore.py:

        index = pd.to_datetime(np.concatenate(rtn[INDEX]), utc=True, unit='ms')

differently enough in v .25.3 vs the 1.x branch that the new version of pandas is spending a lot of time on pandas._libs.tslib.array_with_unit_to_datetime, as shown by the profiler, where the old version is not.

crazy25000 · 2021-03-29T23:06:25Z

I just installed and setup Arctic and noticed the slow read performance with the Tickstore + Pandas 1.2.3. However, reads are really fast with VersionStore and I thought it would be the opposite 😅

jeffneuen · 2021-03-30T19:30:56Z

Yes, this makes me wonder if man financial (the creator and maintainer of this package) is really still using the old pandas .25 branch internally, or if nobody there is using tickstore, otherwise certainly someone else would have discovered this.

crazy25000 · 2021-04-01T03:37:49Z

It's unfortunate, would've been nice to use. I also tested https://github.com/alpacahq/marketstore and it performs well. Would recommend trying it.

I've been testing different libraries to determine which one to keep, continue using, and if it's outdated and unmaintained, update it myself. Have you tried other libraries?

jeffneuen · 2021-04-01T16:15:12Z

@crazy25000 thanks for the tip! Happy to continue this discussion, but I don't want to clutter up the github issue with it. My email is on my profile if you'd like to chat datastores further!

jeffneuen · 2021-04-15T12:35:06Z

This is still an outstanding issue for me, if there is anything else I can provide to help clarify this issue, please let me know.

bmoscon · 2021-04-15T13:33:15Z

I'd bet if you downgrade pandas it will work better, this library isnt extensively used or tested on very recent pandas releases and there have been cases in the past where behavior changed in pandas (for the worse) and made trivial operations in arctic take incredibly long (i.e. 5 ms to 30 seconds)

jeffneuen · 2021-04-15T14:20:12Z

@bmoscon you are correct, if I use the pandas .25 branch, the problem is solved. However, this creates pretty serious workflow issues. If a user wants to pull data with .25, but then work with the data using a current 1.x version of pandas, you need two different venvs, and end up storing the data in some kind of intermediate layer, unless I"m missing an obvious and simpler workaround

It's possible, just seems to defeat a lot of the benefit of arctic if I am dumping eveything into parquet files with code running .25 and then loading the parquet files with a 1.x pandas to do the work. The most recent version of .25 pandas is from Oct 2019 -- a little stale at this point.

But, thank you for the reply, the point is taken that perhaps arctic just doesn't have complete support for 1.x pandas yet.

vargaspazdaniel · 2021-07-26T12:14:11Z

I'm having issues with the read speed with Tickstore too. Only around 205k rows takes around 1 min, while writing the data is working perfectly and without issues. Any way to read tick data (which usuallys have thousands and thousands rows) getting the data faster? Maybe using dask, modin or another pandas version with higher speed.

JunyueLiu · 2021-08-03T09:07:57Z

My solution is to replace line 338 to
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True, unit='ms')

jeffneuen · 2021-11-01T01:47:23Z

@JunyueLiu I tweaked your suggestion just a little bit to:

index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)

and now the reads are back to a normal speed, about 6.5M rows/sec.

@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.

JunyueLiu · 2021-11-01T03:43:58Z

@JunyueLiu I tweaked your suggestion just a little bit to:
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)
and now the reads are back to a normal speed, about 6.5M rows/sec.

@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.

Feel free to submit the PR.

CmpCtrl · 2021-11-01T18:49:42Z

I found this issue as well, and found that it can be quicker still by omitting the pd.to_datetime altogether.
index = (np.concatenate(rtn[INDEX])).astype("datetime64[ms]")
However, this solution also requires a second change to where the timezone is converted, line 359. The following worked for me.
rtn.index = rtn.index.tz_localize(dt.now().astimezone().tzinfo)

jeffneuen · 2021-11-02T19:12:26Z

@CmpCtrl I tried your solution, and did indeed did get about 50% faster reads, about 9.5M rows/sec. Thanks!

CmpCtrl · 2021-11-02T19:28:00Z

I started a branch to work on a couple other things as well, branch. The mktz() seems really slow, my first call to get the max or min date from a symbol took ~0.6 seconds and it seemed like most of that was in finding the local timezone. I also brought in the fixes from #887 so i could get back to the latest python and pandas versions. I haven't done much testing and i am only using a small portion of the functionality so i'm not sure how relevant these changes are to others.

jeffneuen · 2021-11-02T19:41:03Z

Thanks, I am checking out your branch, those functions are useful to me! I need to be on the 1.x version of pandas for other reasons, and min and max date are also useful to me. I hope that at some point this project will be able to standardize on the more recent versions of python and pandas, but my feeling is that the main corporate owner of the project probably has their own internal versions that they use, and that's what it's being maintained for.

, see man-group#895

burrowsa · 2022-01-18T15:03:28Z

I'm also seeing this problem. Picking up the fix from @jeffneuen 's repo fixed it for me.

jeffneuen added a commit to jeffneuen/arctic that referenced this issue Nov 3, 2021

Fixing slow reads with pandas 1.x for tickstore, using fix from @CmpCtrl

3d142ee

, see man-group#895

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiencing extremely slow reads on TickStore -- fully executable example included. #895

Experiencing extremely slow reads on TickStore -- fully executable example included. #895

jeffneuen commented Mar 20, 2021

jeffneuen commented Mar 20, 2021

jeffneuen commented Mar 24, 2021

jeffneuen commented Mar 25, 2021

crazy25000 commented Mar 29, 2021 •

edited

Loading

jeffneuen commented Mar 30, 2021 •

edited

Loading

crazy25000 commented Apr 1, 2021

jeffneuen commented Apr 1, 2021

jeffneuen commented Apr 15, 2021

bmoscon commented Apr 15, 2021

jeffneuen commented Apr 15, 2021

vargaspazdaniel commented Jul 26, 2021

JunyueLiu commented Aug 3, 2021

jeffneuen commented Nov 1, 2021

JunyueLiu commented Nov 1, 2021

CmpCtrl commented Nov 1, 2021

jeffneuen commented Nov 2, 2021

CmpCtrl commented Nov 2, 2021

jeffneuen commented Nov 2, 2021

burrowsa commented Jan 18, 2022

Experiencing extremely slow reads on TickStore -- fully executable example included. #895

Experiencing extremely slow reads on TickStore -- fully executable example included. #895

Comments

jeffneuen commented Mar 20, 2021

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

jeffneuen commented Mar 20, 2021

jeffneuen commented Mar 24, 2021

jeffneuen commented Mar 25, 2021

crazy25000 commented Mar 29, 2021 • edited Loading

jeffneuen commented Mar 30, 2021 • edited Loading

crazy25000 commented Apr 1, 2021

jeffneuen commented Apr 1, 2021

jeffneuen commented Apr 15, 2021

bmoscon commented Apr 15, 2021

jeffneuen commented Apr 15, 2021

vargaspazdaniel commented Jul 26, 2021

JunyueLiu commented Aug 3, 2021

jeffneuen commented Nov 1, 2021

JunyueLiu commented Nov 1, 2021

CmpCtrl commented Nov 1, 2021

jeffneuen commented Nov 2, 2021

CmpCtrl commented Nov 2, 2021

jeffneuen commented Nov 2, 2021

burrowsa commented Jan 18, 2022

crazy25000 commented Mar 29, 2021 •

edited

Loading

jeffneuen commented Mar 30, 2021 •

edited

Loading