A library for connecting to a remote Bookworm instance through Python.
There are two main classes to know about:
BWQuery
takes the Bookworm server URL and wraps Bookworm's JSON query format (described in the API docs). You can run a query withBWQuery.run()
.BWResults
is an object holding the Bookworm results, with functions that allow display of the results ascsv
,json
, or PandasDataFrame
.
There is also a set_options
class, which allows global database and endpoint setting`
To start:
import bwypy
jsonq = '''{
"database": "hathipd",
"method": "return_json",
"search_limits": {
"date_year": {"$gt": 1790, "$lt": 1923 }
},
"counttype": ["TextCount"],
"groups": ["date_year"]
}'''
bw = bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
bw.json
{'counttype': ['TextCount'],
'database': 'hathipd',
'groups': ['date_year'],
'method': 'return_json',
'search_limits': {'date_year': {'$gt': 1790, '$lt': 1923}}}
bw.groups
['date_year']
bw.search_limits
{'date_year': {'$gt': 1790, '$lt': 1923}}
bw.database
'hathipd'
Query results are returns as a BWResults object
bw.groups = ['page_count_bin', 'is_gov_doc']
bw_results = bw.run()
bw_results.json()
{'L - Between 350 and 550': {'': [563222], 'No': [30973]},
'M - Between 150 and 350': {'': [549374], 'No': [30020]},
'S - Less than 150': {'': [466445], 'No': [25737]},
'XL - Greater than 550': {'': [529501], 'No': [28435]},
'unknown': {'': [1325704], 'No': [73659]}}
bw_results.dataframe()
TextCount | ||
---|---|---|
page_count_bin | is_gov_doc | |
XL - Greater than 550 | 529501 | |
No | 28435 | |
unknown | 1325704 | |
No | 73659 | |
L - Between 350 and 550 | 563222 | |
No | 30973 | |
M - Between 150 and 350 | 549374 | |
No | 30020 | |
S - Less than 150 | 466445 | |
No | 25737 |
print(bw_results.csv())
page_count_bin,is_gov_doc,TextCount
XL - Greater than 550,,529501
XL - Greater than 550,No,28435
unknown,,1325704
unknown,No,73659
L - Between 350 and 550,,563222
L - Between 350 and 550,No,30973
M - Between 150 and 350,,549374
M - Between 150 and 350,No,30020
S - Less than 150,,466445
S - Less than 150,No,25737
bw_results.tolist()
[{'TextCount': 529501,
'is_gov_doc': '',
'page_count_bin': 'XL - Greater than 550'},
{'TextCount': 28435,
'is_gov_doc': 'No',
'page_count_bin': 'XL - Greater than 550'},
{'TextCount': 1325704, 'is_gov_doc': '', 'page_count_bin': 'unknown'},
{'TextCount': 73659, 'is_gov_doc': 'No', 'page_count_bin': 'unknown'},
{'TextCount': 563222,
'is_gov_doc': '',
'page_count_bin': 'L - Between 350 and 550'},
{'TextCount': 30973,
'is_gov_doc': 'No',
'page_count_bin': 'L - Between 350 and 550'},
{'TextCount': 549374,
'is_gov_doc': '',
'page_count_bin': 'M - Between 150 and 350'},
{'TextCount': 30020,
'is_gov_doc': 'No',
'page_count_bin': 'M - Between 150 and 350'},
{'TextCount': 466445,
'is_gov_doc': '',
'page_count_bin': 'S - Less than 150'},
{'TextCount': 25737,
'is_gov_doc': 'No',
'page_count_bin': 'S - Less than 150'}]
bw_results.tuples()
[('XL - Greater than 550', '', 529501),
('XL - Greater than 550', 'No', 28435),
('unknown', '', 1325704),
('unknown', 'No', 73659),
('L - Between 350 and 550', '', 563222),
('L - Between 350 and 550', 'No', 30973),
('M - Between 150 and 350', '', 549374),
('M - Between 150 and 350', 'No', 30020),
('S - Less than 150', '', 466445),
('S - Less than 150', 'No', 25737)]
Rather than entering an already constructed json query, BWQuery can be used to construct from scratch.
An endpoint and database are required, at minimum.
newq = bwypy.BWQuery()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
NameError: No endpoint. Provide to BWQuery on initialization or set globally.
newq = bwypy.BWQuery(database='hathipd', endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
newq.json
{'compare_limits': [],
'counttype': ['TextCount', 'WordCount'],
'database': 'hathipd',
'groups': [],
'method': 'return_json',
'search_limits': {},
'words_collation': 'Case_Sensitive'}
newq.run().dataframe()
TextCount | WordCount | |
---|---|---|
0 | 4552862 | 7.328341e+11 |
newq.groups
[]
newq.groups = ['foo']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
KeyError: 'The following groups are not supported in this BW: foo'
newq.groups = ['publication_country']
newq.run().dataframe()
TextCount | WordCount | |
---|---|---|
publication_country | ||
No place, unknown, or undetermined | 8144 | 1.137803e+09 |
United Kingdom Misc. Islands | 2 | 4.984400e+04 |
Australia | 799 | 2.968196e+08 |
United States | 1962339 | 3.212052e+11 |
Wales | 41 | 1.241756e+07 |
England | 10656 | 1.831402e+09 |
unknown | 1937740 | 2.954869e+11 |
Latvia | 64 | 1.987412e+07 |
Northern Ireland | 10 | 2.987728e+06 |
Scotland | 863 | 1.483882e+08 |
Soviet Socialist Republic | 152 | 1.947023e+07 |
United Kingdom | 536001 | 9.987234e+10 |
Canada | 78663 | 9.747256e+09 |
Russian S.F.S.R. | 3427 | 8.390290e+08 |
South Australia | 29 | 4.146067e+06 |
Victoria | 50 | 1.064348e+07 |
Estonia | 25 | 4.242192e+06 |
New South Wales | 5 | 5.819920e+05 |
Georgian S.S.R. | 0 | 0.000000e+00 |
Ukraine | 59 | 7.314677e+06 |
Soviet Union | 13784 | 2.186202e+09 |
Tasmania | 1 | 8.426800e+04 |
Lithuania | 8 | 9.289520e+05 |
Since it's unlikely be be consistently switching databases or endpoints, these settings can be set globally with set_options
:
bwypy.set_options(endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py',
database='global')
bwypy.BWQuery(verify_fields=False).database
'global'
Or in a with
block:
with bwypy.set_options(endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py', database='with_block'):
bw = bwypy.BWQuery(verify_fields=False)
bw.database
'with_block'
The priority for variables is:
- set with an init argument
- set within the query json (for database)
- set within a
with
block withset_options
- set globally with
set_options
Parser for getAvailableFields
, used internally on initialization if integrity_check=True
:
bw = bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
bw.fields()
anchor | dbname | description | name | tablename | type | |
---|---|---|---|---|---|---|
0 | bookid | lc_classes | lc_classes | lc_classesLookup | character | |
1 | bookid | lc_subclasses | lc_subclasses | lc_subclassesLookup | character | |
2 | bookid | fiction_nonfiction | fiction_nonfiction | fiction_nonfictionLookup | character | |
3 | bookid | genres | genres | genresLookup | character | |
4 | bookid | languages | languages | languagesLookup | character | |
5 | bookid | format | format | formatLookup | character | |
6 | bookid | is_gov_doc | is_gov_doc | is_gov_docLookup | character | |
7 | bookid | page_count_bin | page_count_bin | page_count_binLookup | character | |
8 | bookid | word_count_bin | word_count_bin | word_count_binLookup | character | |
9 | bookid | publication_country | publication_country | publication_countryLookup | character | |
10 | bookid | publication_state | publication_state | publication_stateLookup | character | |
11 | bookid | publication_place | publication_place | publication_placeLookup | character | |
12 | bookid | date_year | date_year | fastcat | integer |
Return all possible values for the field.
bw.field_values(field='lc_classes')
['unknown',
'Language and Literature',
'General and Old World History',
'Social Sciences',
'Science',
'Philosophy, Psychology, and Religion',
'Law',
'Technology',
'General Works',
'History of the United States and British, Dutch, French, and Latin America',
'Political Science',
'Agriculture',
'History of America',
'Education',
'Bibliography, Library Science, and General Information Resources',
'Medicine',
'Fine Arts',
'Geography, Anthropology, and Recreation',
'Music',
'Auxiliary Sciences of History',
'Military Science',
'Naval Science']
bw.field_values(field='is_gov_doc')
['', 'No']
If BWQuery was initialized without turning off verify_fields
, or if the fields
method was run at any point, it will check queries against the known fields for that database.
Much of the time, field validation throws an automatic error:
bw = bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
bw.search_limits = { 'fake_field': 'whatever_value'}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
KeyError: 'The following search_limit fields are not supported in this BW: fake_field'
There are some fancy ways that you can set values where the validation isn't run. In those cases, next time validation runs, if it crashes the query is reverted to an older versions.
bw.search_limits['date_year_wrong'] = 1
print("Uh oh, we got a bad field set! -- ", bw.search_limits)
try:
bw._validate()
except:
print("But it reverted after a failure! -- " , bw.search_limits)
Uh oh, we got a bad field set! -- {'date_year': {'$lt': 1923, '$gt': 1790}, 'date_year_wrong': 1}
But it reverted after a failure! -- {'date_year': {'$lt': 1923, '$gt': 1790}}
Checking allowable fields means an extra call to the database. If you know the schema already, just turn off verify_fields
.
%%time
bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py', verify_fields=False)
Wall time: 0 ns
<bwypy.core.BWQuery at 0x1fb45630358>