Skip to content

Commit

Permalink
2024.0.0 (#78)
Browse files Browse the repository at this point in the history
* Move schemas around for easier packaging

* Various conf changes

* Fix drill/clickhouse

* Add baseball.computer note
  • Loading branch information
droher authored Dec 14, 2023
1 parent 0df9d3f commit 3ff33bb
Show file tree
Hide file tree
Showing 19 changed files with 263 additions and 180 deletions.
9 changes: 5 additions & 4 deletions .env
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
CHADWICK_VERSION=v0.9.5
BASEBALLDATABANK_VERSION=ccb3cef05e68f0085db4ada6d4a9ebab9435b452
RETROSHEET_VERSION=48334a58f7446d59746d81aa73c3e9fa9b2676e9
RETROSHEET_VERSION=8449632be02cdf743932600f3218d77e059d5c91
CHADWICK_VERSION=aff8d779500da16521542e084c35cc3e159fd536
BASEBALLDATABANK_VERSION=28169eaf9007200d7f51160713c647eac64f9aa8

EXTRACT_DIR=extract
REPO=doublewick/boxball
VERSION=2023.0.0
VERSION=2024.0.0
BUILD_ENV=prod
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@
<br>
</p>

**Update**: I have released a new project, [baseball.computer](https://baseball.computer), which is designed
as the successor to boxball. It is much easier to use (no Docker required, runs entirely in your browser/program)
and includes many more tables, features, and quality controls. The event schema is different, which will be the main migration pain point in
migration. _I aim to continue Boxball maintenence and updates as long as people are still using it,_ and I may try to rebase
boxball on top of the new project to make maintaining both easier. Please let me know if there are things you can do in Boxball that you can't do yet in baseball.computer by filing an issue on the [repo](https://github.com/droher/baseball.computer) or reaching me at [email protected].

## Introduction
**Boxball** creates prepopulated databases of the two most significant open source baseball datasets:
[Retrosheet](http://retrosheet.org) and the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank).
Expand Down
6 changes: 2 additions & 4 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,7 @@ x-clickhouse:
x-drill:
&drill
build:
context: load/drill
dockerfile: ../Dockerfile
context: load
target: drill
platforms:
- "linux/amd64"
Expand Down Expand Up @@ -126,8 +125,7 @@ x-mysql:
x-sqlite:
&sqlite
build:
context: load/sqlite
dockerfile: ../Dockerfile
context: load
target: sqlite
platforms:
- "linux/amd64"
Expand Down
9 changes: 5 additions & 4 deletions extract/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ARG BUILD_ENV
ARG RETROSHEET_IMAGE=get-retrosheet-${BUILD_ENV}
ARG BASEBALLDATABANK_IMAGE=get-baseballdatabank-${BUILD_ENV}

FROM python:3.11-alpine3.17 AS build-common
FROM python:3.11-alpine3.19 AS build-common
RUN apk add --no-cache \
parallel \
libtool \
Expand All @@ -22,14 +22,15 @@ ENV PYTHONPATH="/"
# `prod` gets the full datasets, while `test` provides fixtures with small sample data for each file
FROM build-common as get-retrosheet-prod
ARG RETROSHEET_VERSION
RUN wget https://github.com/droher/retrosheet/archive/${RETROSHEET_VERSION}.zip -O retrosheet.zip
RUN wget https://github.com/droher/retrosheet-mirror/archive/${RETROSHEET_VERSION}.zip -O retrosheet.zip

FROM build-common as get-retrosheet-test
COPY fixtures/raw/retrosheet.zip .

FROM build-common as get-baseballdatabank-prod
ARG BASEBALLDATABANK_VERSION
RUN wget https://github.com/chadwickbureau/baseballdatabank/archive/${BASEBALLDATABANK_VERSION}.zip -O baseballdatabank.zip
# Temporarily grab from old fork until 2023 data appears
RUN wget https://github.com/tom-719/baseballdatabank/archive/${BASEBALLDATABANK_VERSION}.zip -O baseballdatabank.zip

FROM build-common as get-baseballdatabank-test
COPY fixtures/raw/baseballdatabank.zip .
Expand Down Expand Up @@ -71,7 +72,7 @@ RUN python -u /parsers/baseballdatabank.py


# Use a skinny build for deployment
FROM alpine:3.9.3
FROM alpine:3.19.0
RUN apk add zstd
WORKDIR /extract
COPY --from=extract-baseballdatabank /parsed ./baseballdatabank
Expand Down
41 changes: 28 additions & 13 deletions extract/parsers/retrosheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import sys
from functools import lru_cache
from pathlib import Path
import shutil

import fileinput
from typing import Callable, Set
Expand All @@ -15,8 +16,8 @@
RETROSHEET_PATH = Path("retrosheet")
CODE_TABLES_PATH = Path("code_tables")

RETROSHEET_SUBDIRS = "gamelog", "schedule", "misc", "rosters", "event"
EVENT_FOLDERS = "asg", "post", "regular"
RETROSHEET_SUBDIRS = "gamelogs", "schedules", "rosters"
EVENT_FOLDERS = "allstar", "postseason", "events"

PARSE_FUNCS = {
"daily": "cwdaily -q -y {year} {year}*",
Expand Down Expand Up @@ -112,47 +113,61 @@ def concat_files(input_path: Path, output_file: Path, glob: str = "*",
prepend_filename: bool = False,
strip_header: bool = False,
check_dupes: bool = True):
files = (f for f in input_path.glob(glob) if f.is_file())
files = [f for f in input_path.glob(glob) if f.is_file()]
if not files:
raise ValueError(f"No files found under {input_path} with glob {glob}")
with open(output_file, 'wt') as fout, fileinput.input(files) as fin:
lines = set()
for line in fin:
year = Path(fin.filename()).stem[-4:]
# Remove DOS EOF character (CRTL+Z)
new_line = line.strip(DOS_EOF)
original_line = new_line
if not new_line or new_line.isspace():
continue
if fin.isfirstline() and strip_header:
continue
if prepend_filename:
year = Path(fin.filename()).stem[-4:]
new_line = "{},{}".format(year, new_line)
new_line = f"{year},{new_line}"
if new_line in lines:
print("Duplicate row in {}: {}".format(fin.filename(), new_line), file=sys.stderr)
print(f"Duplicate row in {fin.filename()}: {original_line.strip()}")
continue
# TODO: Fix NLB roster file shape in raw data
if "roster" in output_file.name and len(new_line.split(",")) == 7:
print(f"Fixing row in file {fin.filename()} with missing data: " + original_line.strip())
new_line = new_line.strip() + ","
elif "roster" in output_file.name and len(new_line.split(",")) < 7:
print(f"Skipping row in file {fin.filename()} with missing data: " + original_line.strip())
continue
if check_dupes:
lines.add(new_line)
fout.write(new_line)
return compress(output_file, OUTPUT_PATH)
fout.write(new_line.strip() + "\n")
return compress(output_file, OUTPUT_PATH)

retrosheet_base = Path(RETROSHEET_PATH)
output_base = Path(OUTPUT_PATH)
output_base.mkdir(exist_ok=True)
subdirs = {subdir: retrosheet_base / subdir for subdir in RETROSHEET_SUBDIRS}

print("Writing simple files...")
concat_files(subdirs["gamelog"], output_base / "gamelog.csv", glob="*.TXT", check_dupes=False)
concat_files(subdirs["schedule"], output_base / "schedule.csv", glob="*.TXT")
concat_files(subdirs["misc"], output_base / "park.csv", glob="parkcode.txt", strip_header=True)
concat_files(subdirs["gamelogs"], output_base / "gamelog.csv", glob="gl*.txt", check_dupes=False)
# TODO: Figure out how to integrate 2020-orig (leave out for now)
concat_files(subdirs["schedules"], output_base / "schedule.csv", glob="*schedule.csv", strip_header=True)
concat_files(retrosheet_base, output_base / "park.csv", glob="ballparks.csv", strip_header=True)
concat_files(retrosheet_base, output_base / "bio.csv", glob="biofile.csv", strip_header=True)
concat_files(subdirs["rosters"], output_base / "roster.csv", glob="*.ROS", prepend_filename=True)

@staticmethod
def parse_event_types(use_parallel=True) -> None:
def parse_events(output_type: str, clean_func: Callable = None):
event_base = RETROSHEET_PATH / "event"
event_base = RETROSHEET_PATH
output_file = OUTPUT_PATH.joinpath(output_type).with_suffix(".csv")
command_template = PARSE_FUNCS[output_type]
f_out_inflated = open(output_file, 'w')
for folder in EVENT_FOLDERS:
print(output_type, folder)
# Copy (not move) all teamfiles to each subdir
for teamfile in event_base.glob("teams/TEAM*"):
shutil.copy(teamfile, event_base.joinpath(folder))
data_path = event_base.joinpath(folder)
years = {re.match("[0-9]{4}", f.stem)[0] for f in data_path.iterdir()
if re.match("[0-9]{4}", f.stem)}
Expand Down
2 changes: 1 addition & 1 deletion extract/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
pyhumps==1.6.1
zstandard==0.15.2
zstandard==0.22.0
16 changes: 9 additions & 7 deletions load/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
ARG VERSION
FROM doublewick/boxball:ddl-${VERSION} as ddl
FROM doublewick/boxball:csv-${VERSION} as csv
FROM doublewick/boxball:parquet-${VERSION} as parquet

FROM yandex/clickhouse-server:22.9.7.34 as clickhouse
FROM clickhouse/clickhouse-server:23.11.2.11 as clickhouse
COPY z_load.sh /docker-entrypoint-initdb.d/
COPY --chown=clickhouse:clickhouse --from=ddl /ddl/clickhouse.sql /docker-entrypoint-initdb.d/
COPY --chown=clickhouse:clickhouse --from=parquet /transform/parquet /data

FROM drill/apache-drill:1.17.0 as drill
COPY --from=parquet /transform/parquet /data

FROM mysql:8.0.31-debian as mysql
FROM mysql:8.0.35-debian as mysql
ENV MYSQL_ALLOW_EMPTY_PASSWORD=yes
COPY my.cnf /etc/mysql/conf.d/
COPY A_unzip_csvs.sh z_remove_csvs.sh /docker-entrypoint-initdb.d/
Expand All @@ -19,15 +21,15 @@ RUN apt-get update && apt-get install -y --no-install-recommends zstd zip && \
COPY --chown=mysql:mysql --from=ddl /ddl/mysql.sql /docker-entrypoint-initdb.d/
COPY --chown=mysql:mysql --from=csv /transform/csv /data

FROM postgres:15.1 as postgres
FROM postgres:16.1-bookworm as postgres
RUN apt-get update && apt-get install -y --no-install-recommends zstd zip && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
COPY A_build_conf.sql z_run_conf.sql /docker-entrypoint-initdb.d/
COPY --chown=postgres:postgres --from=ddl /ddl/postgres.sql /docker-entrypoint-initdb.d/
COPY --chown=postgres:postgres --from=csv /transform/csv /data

FROM postgres:13.2 as postgres-cstore-fdw-build
FROM postgres:13.13-bookworm as postgres-cstore-fdw-build
RUN apt-get update && apt-get install -y --no-install-recommends postgresql-server-dev-13 build-essential zstd libprotobuf-c-dev protobuf-c-compiler wget ca-certificates unzip make gcc libpq-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
Expand All @@ -45,7 +47,7 @@ RUN cat /docker-entrypoint-initdb.d/postgres_cstore_fdw.sql

FROM postgres-cstore-fdw-build as postgres-cstore-fdw

FROM alpine:3.17 as sqlite-build
FROM alpine:3.19.0 as sqlite-build
RUN apk add --no-cache \
zstd \
sqlite
Expand All @@ -60,10 +62,10 @@ RUN echo "Decompressing fies..." && \
zstd --rm boxball.db


FROM python:3.11-alpine3.17 AS sqlite
FROM python:3.11-alpine3.19 AS sqlite
RUN apk add --no-cache \
zstd \
sqlite
RUN pip install sqlite-web==0.4.1
COPY --from=build boxball.db.zst /tmp/
COPY --from=sqlite-build boxball.db.zst /tmp/
ENTRYPOINT zstd --rm -d /tmp/boxball.db.zst -fo /db/boxball.db && sqlite_web -H 0.0.0.0 -x /db/boxball.db
2 changes: 1 addition & 1 deletion load/postgres_cstore_fdw/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ARG VERSION
FROM doublewick/boxball:ddl-${VERSION} as ddl
FROM doublewick/boxball:csv-${VERSION} as csv

FROM postgres:13.2 as build
FROM postgres:13.13-bookworm as build
RUN apt-get update && apt-get install -y --no-install-recommends postgresql-server-dev-13 build-essential zstd libprotobuf-c-dev protobuf-c-compiler wget ca-certificates unzip make gcc libpq-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
Expand Down
2 changes: 1 addition & 1 deletion tests/test_transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from pathlib import Path

from src import OUTPUT_PATH
from src.schemas import retrosheet_metadata, baseballdatabank_metadata, all_metadata
from src.boxball_schemas import retrosheet_metadata, baseballdatabank_metadata, all_metadata
from src.ddl_factories import all_factories
from src.parquet import write_files, PARQUET_PREFIX

Expand Down
2 changes: 1 addition & 1 deletion transform/csv.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
ARG VERSION
FROM doublewick/boxball:extract-${VERSION} as extract

FROM alpine:3.9.3
FROM alpine:3.19.0
COPY --from=extract /extract /transform/csv
4 changes: 2 additions & 2 deletions transform/ddl.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.11-slim-bullseye AS build-common
FROM python:3.11-slim-bookworm AS build-common
COPY requirements.txt .
RUN pip install -r requirements.txt
ENV PYTHONPATH="/"
Expand All @@ -7,5 +7,5 @@ COPY src/ src/
FROM build-common as build-ddl
RUN python -u src/ddl_maker.py

FROM alpine:3.9.3
FROM alpine:3.19.0
COPY --from=build-ddl /ddl /ddl
2 changes: 1 addition & 1 deletion transform/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ SQLAlchemy==1.3.23
sqlalchemy-fdw==0.3.0
clickhouse-sqlalchemy==0.1.5
pyarrow==14.0.1
zstandard==0.17.0
zstandard==0.22.0
6 changes: 6 additions & 0 deletions transform/src/boxball_schemas/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from typing import List
from sqlalchemy import MetaData
from .retrosheet import metadata as retrosheet_metadata
from .baseballdatabank import metadata as baseballdatabank_metadata

all_metadata: List[MetaData] = [baseballdatabank_metadata, retrosheet_metadata]
File renamed without changes.
Loading

0 comments on commit 3ff33bb

Please sign in to comment.