Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting missing code filtering rules to dolma repo #86

Merged
merged 109 commits into from
Nov 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
4da316e
testing warc
soldni Aug 28, 2023
7fc2c9c
ignore
soldni Aug 28, 2023
b35e6ee
testing slow
soldni Aug 28, 2023
4979ea7
langdetect
soldni Aug 28, 2023
b6f96c3
optional import
soldni Aug 28, 2023
bd903b2
refactoring
soldni Aug 29, 2023
8c9e00a
wip
soldni Aug 30, 2023
2d97f76
style
soldni Aug 30, 2023
d25287d
wip
soldni Sep 6, 2023
0350a5f
test
soldni Sep 7, 2023
c08bc6f
wip
soldni Sep 8, 2023
294ffca
configs
soldni Sep 20, 2023
32610e5
hash sample
soldni Sep 20, 2023
ab0d741
small improvements
soldni Sep 20, 2023
d0cde79
updated with output
soldni Sep 21, 2023
5562666
more details
soldni Sep 21, 2023
3909b7f
updated readme
soldni Sep 21, 2023
f1e463a
decon wip
soldni Sep 24, 2023
f9ed26d
new confits
soldni Sep 24, 2023
a3b08c3
taggging content
soldni Sep 24, 2023
07a745c
Merge pull request #49 from allenai/main
soldni Sep 24, 2023
ba2a413
changed name of file
soldni Sep 24, 2023
4b2fb1b
fixes
soldni Sep 24, 2023
534c3c2
deal with empty docs/local files
soldni Sep 24, 2023
0ccf67c
increased bloom size
soldni Sep 24, 2023
59559d8
configs for rest of splits
soldni Sep 25, 2023
39405e8
switching to option2
soldni Sep 25, 2023
8c2af40
forgot to do two more
soldni Sep 25, 2023
c1c5b54
finding puctuation
soldni Sep 26, 2023
abaf44d
tokenizer porting
soldni Sep 26, 2023
1363dff
configs
soldni Sep 27, 2023
637ee26
books config
soldni Sep 27, 2023
b387f6a
more sources
soldni Sep 27, 2023
a099d59
configs
soldni Sep 27, 2023
5fe9e2b
updated paths
soldni Sep 27, 2023
f7796d7
new c4
soldni Sep 27, 2023
a8ff9dc
cleaned up
soldni Sep 27, 2023
33cb671
sampling
soldni Sep 27, 2023
9cd6dcc
sample
soldni Sep 27, 2023
6b369b4
sampling
soldni Sep 28, 2023
2cebbe2
added tokenizer
soldni Sep 28, 2023
7168c8b
update all
soldni Sep 28, 2023
e60bb46
style
soldni Sep 28, 2023
cae806a
updated
soldni Sep 28, 2023
d64f225
configs
soldni Sep 28, 2023
383c1cb
tokenizer cli wip
soldni Sep 28, 2023
14b4724
cli
soldni Oct 2, 2023
1e17a8f
wip big refactor
soldni Oct 5, 2023
110eaee
fixed small bugs
soldni Oct 6, 2023
e2d3f75
tokenizer log
soldni Oct 6, 2023
9c76d04
fixed tokenizer paths
soldni Oct 6, 2023
b1d48a9
added tokenizer small
soldni Oct 6, 2023
d1659f1
fixed glob issue
soldni Oct 13, 2023
bb6d310
removed temporary directory
soldni Oct 13, 2023
f1a2f59
added todo
soldni Oct 13, 2023
126da54
conversion script
soldni Oct 13, 2023
876e9f9
more writing
soldni Oct 13, 2023
e29cb74
more docs
soldni Oct 13, 2023
034c452
more docs
soldni Oct 13, 2023
70f3982
logos
soldni Oct 13, 2023
173631a
pipelines
soldni Oct 13, 2023
da0ee2f
datasheet
soldni Oct 13, 2023
354760f
wip
soldni Oct 13, 2023
e723610
adding script to make wikipedia
soldni Oct 14, 2023
ebc3cdf
wip
soldni Oct 14, 2023
42049c4
more text
soldni Oct 14, 2023
031d8f2
more docs!
soldni Oct 14, 2023
55b8962
new examples.
soldni Oct 14, 2023
1bbf863
documentation
soldni Oct 14, 2023
18c08e2
fixed bug local file
soldni Oct 14, 2023
edf5806
Merge branch 'main' into soldni/warc
soldni Oct 15, 2023
a58387f
lint
soldni Oct 15, 2023
c7f0965
Merge branch 'main' into soldni/warc
soldni Oct 15, 2023
d797c55
added warc back
soldni Oct 27, 2023
d6c5e69
using tokens command
soldni Oct 29, 2023
49d8583
new warc loc
soldni Nov 10, 2023
7138783
Remove unused import statement in __main__.py
soldni Nov 10, 2023
c1966b3
Add multiprocessing support and improve logging in WarcProcessor
soldni Nov 10, 2023
7e04993
first commit
soldni Nov 22, 2023
101bd59
Merge branch 'main' into soldni/missing-starcoder
soldni Nov 22, 2023
486b23c
added metadata option in runtime
soldni Nov 22, 2023
51bfe60
removed files not ready for prime time
soldni Nov 22, 2023
7018e94
wrapped under wrong indent
soldni Nov 22, 2023
4fd6a1c
fixing imports
soldni Nov 22, 2023
bdca7be
all setup
soldni Nov 22, 2023
4c1e9a0
completed tests
soldni Nov 22, 2023
1b7eaa3
reformatting data
soldni Nov 22, 2023
ca3aba0
building for all
soldni Nov 22, 2023
501f3b0
added simple script to check offsets
soldni Nov 22, 2023
a65fe54
trying new install
soldni Nov 22, 2023
8e0f264
new rule to match repetitions
soldni Nov 24, 2023
4f89693
adding grouping tools
soldni Nov 24, 2023
825d0cb
repeating sequences algorithm
soldni Nov 24, 2023
f50ce07
renamed; reformatted
soldni Nov 24, 2023
121aaf6
new wandb to plot script
soldni Nov 25, 2023
1e933df
adding configs
soldni Nov 26, 2023
263b07d
more tools
soldni Nov 26, 2023
c62b0e6
adding experiments
soldni Nov 27, 2023
deb53fb
fixing CI?
soldni Nov 27, 2023
a9018a8
sharing cache
soldni Nov 27, 2023
82b0f10
invalid cache
soldni Nov 27, 2023
37541de
moving off deprecated rust toolkit
soldni Nov 27, 2023
64d0a4f
forgot to set rust channel
soldni Nov 27, 2023
70676ad
fixing cert?
soldni Nov 27, 2023
4acb8b9
Merge branch 'main' into soldni/missing-starcoder
soldni Nov 27, 2023
76a0f9b
added new tests
soldni Nov 27, 2023
117a7c5
hash code
soldni Nov 27, 2023
04391c4
fixed cache, style
soldni Nov 27, 2023
6fd9823
only install rust if cache miss
soldni Nov 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 66 additions & 37 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ permissions:
env:
DOLMA_TESTS_SKIP_AWS: ${{ secrets.AWS_ACCESS_KEY_ID == '' && 'true' || 'false' }}
DOLMA_TEST_S3_PREFIX: s3://dolma-tests
RUST_CHANNEL: stable


jobs:
Expand All @@ -38,17 +39,69 @@ jobs:
echo "PR base repo: ${{ github.event.pull_request.base.repo.full_name }}/tree/${{ github.event.pull_request.base.ref }}"
echo "PR head repo: ${{ github.event.pull_request.head.repo.full_name }}/tree/${{ github.event.pull_request.head.ref }}"

prepare-venv:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3

- name: Cache Virtual Env
uses: actions/cache@v3
# name for referring later
id: cache-venv
with:
# what we cache: the virtualenv
path: ./.venv/
# The cache key depends on pyproject.toml and Cargo.toml
key: ${{ runner.os }}-venv-${{ hashFiles('**/pyproject.toml', '**/Cargo.toml, **/Cargo.lock') }}--${{ hashFiles('python/**', 'src/**') }}

- name: Setup system libraries
if: steps.cache-venv.outputs.cache-hit != 'true'
run: |
sudo apt-get update
sudo apt-get install --yes --upgrade build-essential cmake protobuf-compiler libssl-dev glibc-source

- name: Install Rust toolchain
if: steps.cache-venv.outputs.cache-hit != 'true'
run: |
rustup update ${{ env.RUST_CHANNEL }}
rustup component add --toolchain ${{ env.RUST_CHANNEL }} rustfmt rust-src
rustup default ${{ env.RUST_CHANNEL }}

- name: Install Python
if: steps.cache-venv.outputs.cache-hit != 'true'
uses: actions/setup-python@v4
with:
python-version: '3.8'
architecture: "x64"
cache: 'pip'

- name: Create a new Python environment & install maturin
if: steps.cache-venv.outputs.cache-hit != 'true'
run: |
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install maturin

- name: Install dolma wheels
if: steps.cache-venv.outputs.cache-hit != 'true'
run: |
source .venv/bin/activate
maturin build --release -i $(which python) --out dist
wheel_path=$(ls dist/*.whl)
pip install "${wheel_path}[all]"

tests:
runs-on: ubuntu-latest
needs: prepare-venv
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
if: ${{ github.event_name == 'pull_request' || github.event_name == 'push' }}
strategy:
fail-fast: true
matrix:
python: [3.8]
task:
- name: Check Python style
run: |
Expand All @@ -73,50 +126,23 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@v1
uses: actions/checkout@v3

- name: Setup system libraries
run: |
sudo apt-get update
sudo apt-get install --yes --upgrade build-essential cmake protobuf-compiler libssl-dev glibc-source

- name: Install Rust
uses: actions-rs/toolchain@v1
- name: Cache Virtual Env
uses: actions/cache@v3
# name for referring later
id: cache-venv
with:
toolchain: stable
components: rustfmt

- name: Install Python
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python }}
architecture: "x64"
sccache: true

- name: Create a new Python environment & install maturin
run: |
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install maturin

- name: Install dolma wheels
run: |
source .venv/bin/activate
maturin develop --extras=dev
# what we cache: the virtualenv
path: ./.venv/
# The cache key depends on pyproject.toml and Cargo.toml
key: ${{ runner.os }}-venv-${{ hashFiles('**/pyproject.toml', '**/Cargo.toml, **/Cargo.lock') }}--${{ hashFiles('python/**', 'src/**') }}

- name: ${{ matrix.task.name }}
run: |
source .venv/bin/activate
${{ matrix.task.run }}

- name: Clean up
if: always()
run: |
source .venv/bin/activate
pip uninstall -y dolma



build-linux:
if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/') }}
Expand All @@ -132,6 +158,7 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Setup environment
run: |
sudo apt-get update
Expand Down Expand Up @@ -165,6 +192,7 @@ jobs:
with:
python-version: '3.10'
architecture: ${{ matrix.target }}
cache: 'pip'
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
Expand All @@ -188,6 +216,7 @@ jobs:
- uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,7 @@ target/

# ignore vscode directory
.vscode

# ignore temporary directories
/tmp/
/temp/
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ test-rust:
rm -rf tests/work/*

develop:
maturin develop --extras=dev
maturin develop --extras=all

style:
rustfmt --edition 2021 src/*.rs
Expand Down
53 changes: 36 additions & 17 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,20 @@ requires-python = ">=3.8"
dependencies = [
"anyascii>=0.3.2",
"blingfire==0.1.8",
"boto3",
"boto3>=1.28",
"cached-path==1.3.4",
"detect-secrets==1.4.0",
# "fasttext==0.9.2", # broken with new version of setuptools; using fasttext-wheel instead
"fasttext-wheel==0.9.2",
"fsspec",
"fsspec>=2023.6.0",
"msgspec>=0.14.2",
"nltk==3.8.1",
"omegaconf>=2.3.0",
"presidio_analyzer==2.2.32",
"pycld2==0.41",
# "pycld3==0.22", # does not install correctly
"pyyaml",
"requests",
"rich",
"s3fs",
"s3fs>=2023.6.0",
"smart-open",
"tokenizers>=0.13.3,<1.0.0",
"tqdm",
Expand Down Expand Up @@ -108,18 +106,39 @@ dev = [
"flake8-pyi>=22.8.1",
"Flake8-pyproject>=1.1.0",
]
warc = [
"warcio>=1.7.4",
"trafilatura>=1.6.1",
"justext>=3.0.0",
"goose3>=3.1.17",

# following are all for speeding up trafilatura
"brotli",
"cchardet >= 2.1.7; python_version < '3.11'", # build issue
"faust-cchardet >= 2.1.18; python_version >= '3.11'", # fix for build
"htmldate[speed] >= 1.4.3",
"py3langid >= 0.2.2",
# extension to process code
code = [
"detect-secrets==1.4.0",
"beautifulsoup4>=4",
"pygments",
"regex"
]
# extension to detect PIIs using presidio
pii = [
"presidio_analyzer==2.2.32",
"regex"
]
# # extension to parse warc files
# warc = [
# "warcio>=1.7.4",
# "trafilatura>=1.6.1",
# "justext>=3.0.0",
# "goose3>=3.1.17",

# # following are all for speeding up trafilatura
# "brotli",
# "cchardet >= 2.1.7; python_version < '3.11'", # build issue
# "faust-cchardet >= 2.1.18; python_version >= '3.11'", # fix for build
# "htmldate[speed] >= 1.4.3",
# "py3langid >= 0.2.2",
# ]

# all extensions
all = [
"dolma[dev]",
"dolma[code]",
"dolma[pii]",
# "dolma[warc]",
]

[build-system]
Expand Down
75 changes: 70 additions & 5 deletions python/dolma/core/data_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,10 @@ class InputSpec(Struct):
text: str
source: str = ""
version: Optional[str] = None
# ignoring metadata for now; taggers run on text only
# metadata: Optional[Dict[str, Any]] = None


class InputSpecWithMetadata(InputSpec):
metadata: Optional[Dict[str, Any]] = None


class OutputSpec(Struct):
Expand All @@ -48,17 +50,67 @@ def to_spec(self) -> InputSpec:
return InputSpec(source=self.source, version=self.version, id=self.id, text=self.text)

@classmethod
def from_json(cls, d: Dict) -> "Document":
def from_json(cls, d: Dict[str, Any]) -> "Document":
return Document(source=d["source"], version=d["version"], id=d["id"], text=d["text"])

def to_json(self) -> Dict:
def to_json(self) -> Dict[str, Any]:
return {"source": self.source, "version": self.version, "id": self.id, "text": self.text}

def __str__(self) -> str:
attributes_string = ",".join([f"{k}:{repr(v)}" for k, v in self.to_json()])
attributes_string = ",".join([f"{k}:{repr(v)}" for k, v in self.to_json().items()])
return f"{self.__class__.__name__}({attributes_string})"


class DocumentWithMetadata(Document):
__slots__ = ("metadata",)

def __init__(self, *args, metadata: Optional[Dict[str, Any]] = None, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.metadata = metadata or {}

@classmethod
def from_spec(cls, spec: InputSpecWithMetadata) -> "DocumentWithMetadata":
return DocumentWithMetadata(
source=spec.source,
version=spec.version,
id=spec.id,
text=spec.text,
metadata=spec.metadata,
)

def to_spec(self) -> InputSpecWithMetadata:
return InputSpecWithMetadata(
source=self.source,
version=self.version,
id=self.id,
text=self.text,
metadata=self.metadata,
)

@classmethod
def from_json(cls, d: Dict) -> "DocumentWithMetadata":
return DocumentWithMetadata(
source=d["source"],
version=d["version"],
id=d["id"],
text=d["text"],
metadata=d["metadata"],
)

def to_json(self) -> Dict:
return {
"source": self.source,
"version": self.version,
"id": self.id,
"text": self.text,
"metadata": self.metadata,
}

def __str__(self) -> str:
repr_ = super().__str__()
return repr_.rstrip(")") + f",metadata={'...' if self.metadata else 'none'})"


class Span:
__slots__ = "start", "end", "type", "score", "experiment", "tagger"

Expand Down Expand Up @@ -127,6 +179,19 @@ def __str__(self) -> str:
cls_name = self.__class__.__name__
return f"{cls_name}(start={self.start},end={self.end},type={repr(self.type)},score={self.score:.5f})"

def __repr__(self) -> str:
return str(self)

def __eq__(self, other: Any) -> bool:
if not isinstance(other, self.__class__):
return False
return (
self.start == other.start
and self.end == other.end
and self.type == other.type
and self.score == other.score
)


class DocResult:
__slots__ = "doc", "spans"
Expand Down
7 changes: 6 additions & 1 deletion python/dolma/core/loggers.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
import logging
import multiprocessing


def get_logger(name: str) -> logging.Logger:
name = f"dolma.{name}"
if (proc_name := multiprocessing.current_process().name) == "MainProcess":
proc_name = "main"
proc_name = proc_name.replace(" ", "_")

name = f"{proc_name}.dolma.{name}"
logger = logging.getLogger(name)
logger.setLevel(logging.WARN)

Expand Down
Loading