Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SQL for generation #11

Open
grishy opened this issue Jul 19, 2021 · 10 comments
Open

Use SQL for generation #11

grishy opened this issue Jul 19, 2021 · 10 comments

Comments

@grishy
Copy link

grishy commented Jul 19, 2021

What about a using SQLite for test data generation?

Example:

CREATE TABLE test (
  id INTEGER PRIMARY KEY NOT NULL, 
  x REAL NOT NULL,
  y REAL NOT NULL,
  z REAL NOT NULL
);
INSERT INTO test
    WITH RECURSIVE
      cnt( id, x, y, z) AS (
      VALUES(1 , random(), random(), random()) UNION ALL 
      SELECT id+1,random(),random(), random() FROM cnt WHERE ID<1000)
    select * from cnt;

https://paulbradley.org/sqlite-test-data/
https://stackoverflow.com/questions/17931320/how-to-insert-random-data-into-a-sqlite-table-using-only-queries

@WorldMaker
Copy link

I was thinking about the same thing. It would be interesting to compare if everything was offloaded to a single transaction script like this and see if offloading all the math to SQLite makes sense. While there are range restrictions on the values, there aren't distribution restrictions (normal, poisson, what have you) and the post already says that any regular PRNG is fine and crypto-strength not necessary, so SQLite random() should be capable of meeting the constraints.

Having a quick scratch at the math, it would look something like:

Insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000
)
select * from usr

@avinassh
Copy link
Owner

@WorldMaker hey, could you send this code as a PR?

@avinassh
Copy link
Owner

I ran this locally, this is quite slow. Do let me know if i am missing something.

  1. First I manually created a DB, created a table.
  2. Then I ran this following script:
#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
END_SQL

saved this in a script and ran it with time:

time ./sq.sh
./sq.sh  111.59s user 2.35s system 96% cpu 1:58.00 total

@grishy
Copy link
Author

grishy commented Jul 25, 2021

@avinassh It turns out - is it at least faster than PyPy?

@grishy
Copy link
Author

grishy commented Jul 25, 2021

PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;

Do they influence here?

@avinassh
Copy link
Owner

I had tried with them too, but it did not make much difference:

#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
END_SQL

output:

$ time ./sq.sh

off
exclusive
./sq.sh  114.55s user 3.31s system 99% cpu 1:58.93 total

@WorldMaker
Copy link

Interesting results. We're about at the limit of my personal experience with SQLite directly and a lot of the things you might try to speed things up in other SQL environments obviously don't apply to SQLite.

Reviewing the Stack Overflow posted above, the sqlite3 shell supports an extension named generate_series which the documentation says is faster, so you could try that too (since you are already using the sqlite3 shell):

#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
END_SQL

@asg017
Copy link

asg017 commented Nov 4, 2021

generate_series is definitely faster than a recursive query, about 2x in my test.

Using generate_series:

-- load.sql
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
$ time sqlite3 t1.db '.read schema.sql' '.read load.sql'

real	0m55.747s
user	0m45.945s
sys	0m5.894s

Using recursive:

-- load-rec.sql
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
$ time sqlite3 t2.db '.read schema.sql' '.read load-rec.sql'

real	1m42.659s
user	1m35.587s
sys	0m4.820s

Adding in the pragma statements didnt make a difference in my test, in fact is slowed it down a few seconds.

Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:

INSERT INTO user (area, age, active) 
  SELECT 
    1000000, 
    5, 
    1 
  FROM generate_series(1, 100000000, 1);
real	0m27.924s
user	0m17.080s
sys	0m6.985s

So generate_series comes out at 55 seconds, still slower than the best rust method, but seemingly faster than the other ones. And there's about 25 seconds room for improvement if there's a faster random generator!

Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!

@avinassh
Copy link
Owner

avinassh commented Nov 5, 2021

Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:

Someone else did a flamegraph, I think some good time is being spent in random (in rust solution)

Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!

I don't have a pure SQL solution, so I will be happy to accept a PR on this since is close to python one :)

@asg017
Copy link

asg017 commented Dec 13, 2021

Created a PR for this in #23! Based on of my machine, it appears to be faster than even the fastest rust solution, but would love to see a more official benchmark run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants