Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to DuckDB #146

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
4e661de
Allowing binding to 0.0.0.0 instead of 127.0.0.1
bmschmidt Oct 14, 2020
5134324
change user to allow docker
bmschmidt Oct 14, 2020
27551d3
config file tweak
bmschmidt Oct 14, 2020
70cfbc7
switch to re from regex
bmschmidt Oct 14, 2020
1bc1347
handle remote hosts in API better
bmschmidt Oct 14, 2020
2769c2c
handling of remove mysql servers
bmschmidt Oct 14, 2020
6165481
Allow password files
bmschmidt Oct 19, 2020
89ab79a
Better field name validation
bmschmidt Feb 3, 2021
372a392
Aesthetic changes, document pre-tokenized data
bmschmidt Feb 3, 2021
e69e9ba
More informative error message
bmschmidt Feb 3, 2021
f81108e
Restore folder ingest of multiple txt files; Housekeeping
bmschmidt Feb 3, 2021
ef739d1
Merge branches
bmschmidt Feb 3, 2021
75d18b6
Make remove file tree
bmschmidt Feb 3, 2021
d1adf0d
Fix input.txt parsing
bmschmidt Feb 3, 2021
e2117b2
Raise error on invalid catalog entries
bmschmidt Feb 3, 2021
b00e6a9
changes to tokenization pipeline internals
bmschmidt Feb 5, 2021
f0bd78e
Add default parsing of ISO date fields where not specified at yearly …
bmschmidt Feb 8, 2021
c086b5c
better testing
Mar 7, 2021
0256a20
Add caching and API extensions to support it.
Mar 7, 2021
f9d6b17
register json type for json_c
bmschmidt Mar 7, 2021
088eeae
More error handling on read stage
bmschmidt Mar 7, 2021
1d52bfc
More explicit in proofing user queries.
bmschmidt Mar 7, 2021
79282dc
throw more errors
bmschmidt Mar 7, 2021
a6153ec
Reconcile merges
bmschmidt Mar 7, 2021
4924755
Fix cache trimming
bmschmidt Mar 7, 2021
0ea5871
'push' is js, not python. Ugh.
bmschmidt Mar 7, 2021
dc1ec67
explain wsgi changes
bmschmidt Apr 25, 2021
29e060f
Preliminarily working duckdb fetches. No build yet.
bmschmidt Apr 29, 2021
fc04100
Closer to DuckDB, interim patch
May 14, 2021
52b8c66
Integrate nonconsumptive, ~50% passage of the old query API tests.
May 17, 2021
a11204a
Refactorings
bmschmidt May 24, 2021
3d908e8
Most tests fixed, one weird unicode problem remaining
bmschmidt May 25, 2021
a552229
Logging refactor, shift nc input expectations
bmschmidt May 29, 2021
81fc5d9
Passes all existing tests!
bmschmidt Jun 10, 2021
e10b68e
Align to nc
bmschmidt Jun 18, 2021
bbedb0d
broken, waiting on upstream fix in duckdb
bmschmidt Jun 18, 2021
f0c7413
Link to latest nonconsumptive version
bmschmidt Jul 10, 2021
86372e6
full nonconsumptive integration
bmschmidt Jul 14, 2021
fba5d49
Add API method to list all bookworms on endpoint
bmschmidt Jul 18, 2021
53ff321
Change json standard to something more modern
bmschmidt Aug 2, 2021
433d2fc
bigrams partially
bmschmidt Aug 3, 2021
e370fde
Improved time handling
bmschmidt Aug 20, 2021
ec01df0
Use duckdb to sort the big tables on ingest instead of custom sort ro…
bmschmidt Sep 10, 2021
3d785d6
Change sorting strategy to a more scalable version of quacksort
bmschmidt Sep 27, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions .vscode/.ropeproject/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# The default ``config.py``
# flake8: noqa


def set_prefs(prefs):
"""This function is called before opening the project"""

# Specify which files and folders to ignore in the project.
# Changes to ignored resources are not added to the history and
# VCSs. Also they are not returned in `Project.get_files()`.
# Note that ``?`` and ``*`` match all characters but slashes.
# '*.pyc': matches 'test.pyc' and 'pkg/test.pyc'
# 'mod*.pyc': matches 'test/mod1.pyc' but not 'mod/1.pyc'
# '.svn': matches 'pkg/.svn' and all of its children
# 'build/*.o': matches 'build/lib.o' but not 'build/sub/lib.o'
# 'build//*.o': matches 'build/lib.o' and 'build/sub/lib.o'
prefs['ignored_resources'] = ['*.pyc', '*~', '.ropeproject',
'.hg', '.svn', '_svn', '.git', '.tox']

# Specifies which files should be considered python files. It is
# useful when you have scripts inside your project. Only files
# ending with ``.py`` are considered to be python files by
# default.
# prefs['python_files'] = ['*.py']

# Custom source folders: By default rope searches the project
# for finding source folders (folders that should be searched
# for finding modules). You can add paths to that list. Note
# that rope guesses project source folders correctly most of the
# time; use this if you have any problems.
# The folders should be relative to project root and use '/' for
# separating folders regardless of the platform rope is running on.
# 'src/my_source_folder' for instance.
# prefs.add('source_folders', 'src')

# You can extend python path for looking up modules
# prefs.add('python_path', '~/python/')

# Should rope save object information or not.
prefs['save_objectdb'] = True
prefs['compress_objectdb'] = False

# If `True`, rope analyzes each module when it is being saved.
prefs['automatic_soa'] = True
# The depth of calls to follow in static object analysis
prefs['soa_followed_calls'] = 0

# If `False` when running modules or unit tests "dynamic object
# analysis" is turned off. This makes them much faster.
prefs['perform_doa'] = True

# Rope can check the validity of its object DB when running.
prefs['validate_objectdb'] = True

# How many undos to hold?
prefs['max_history_items'] = 32

# Shows whether to save history across sessions.
prefs['save_history'] = True
prefs['compress_history'] = False

# Set the number spaces used for indenting. According to
# :PEP:`8`, it is best to use 4 spaces. Since most of rope's
# unit-tests use 4 spaces it is more reliable, too.
prefs['indent_size'] = 4

# Builtin and c-extension modules that are allowed to be imported
# and inspected by rope.
prefs['extension_modules'] = []

# Add all standard c-extensions to extension_modules list.
prefs['import_dynload_stdmods'] = True

# If `True` modules with syntax errors are considered to be empty.
# The default value is `False`; When `False` syntax errors raise
# `rope.base.exceptions.ModuleSyntaxError` exception.
prefs['ignore_syntax_errors'] = False

# If `True`, rope ignores unresolvable imports. Otherwise, they
# appear in the importing namespace.
prefs['ignore_bad_imports'] = False

# If `True`, rope will insert new module imports as
# `from <package> import <module>` by default.
prefs['prefer_module_from_imports'] = False

# If `True`, rope will transform a comma list of imports into
# multiple separate import statements when organizing
# imports.
prefs['split_imports'] = False

# If `True`, rope will remove all top-level import statements and
# reinsert them at the top of the module when making changes.
prefs['pull_imports_to_top'] = True

# If `True`, rope will sort imports alphabetically by module name instead
# of alphabetically by import statement, with from imports after normal
# imports.
prefs['sort_imports_alphabetically'] = False

# Location of implementation of
# rope.base.oi.type_hinting.interfaces.ITypeHintingFactory In general
# case, you don't have to change this value, unless you're an rope expert.
# Change this value to inject you own implementations of interfaces
# listed in module rope.base.oi.type_hinting.providers.interfaces
# For example, you can add you own providers for Django Models, or disable
# the search type-hinting in a class hierarchy, etc.
prefs['type_hinting_factory'] = (
'rope.base.oi.type_hinting.factory.default_type_hinting_factory')


def project_opened(project):
"""This function is called after opening the project"""
# Do whatever you like here!
8 changes: 8 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.nosetestsEnabled": false,
"python.testing.pytestEnabled": true
}
2 changes: 2 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,5 @@ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Further revisions to the Python3 2020 Benjamin Schmidt
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,24 @@ Once this works, you can use various libraries to query the endpoint,
or create an HTML page that builds off the endpoint. See
the (currently underdeveloped) Bookworm-Vega repository for some examples.

## Pre-tokenized data.

If you're using data that's already been tokenized, it can be ingested
by using a different file than 'input.txt' or 'input.txt.gz'.

```
bookworm --feature-counts unigrams.txt --feature-counts bigrams.txt build all
```

The format for `unigrams.txt` is a little wonky. It should consist of one row
per document. The first element is the identifier, followed by a tab. The next element
should be a CSV file that uses the formfeed character (`\f`) instead of the newline
to separate records.

```
id\t{word,count csv}

```

## Production servers

Expand Down
Loading