Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Switch Models to use Crossfit #58

Merged
merged 43 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
9f1c6fe
Working Domain Classifier with Crossfit
VibhuJawa May 9, 2024
db6fc48
Black on file
VibhuJawa May 9, 2024
b09fc93
fix model path
VibhuJawa May 9, 2024
accea76
style fixes
VibhuJawa May 9, 2024
b6615f0
style fixes
VibhuJawa May 9, 2024
3bb6b73
style fixes
VibhuJawa May 9, 2024
22c2e57
style fixes
VibhuJawa May 9, 2024
b43c906
style fixes
VibhuJawa May 9, 2024
9f6960b
Notebook with dask cuda cluster
VibhuJawa May 9, 2024
8139dc9
Notebook with dask cuda cluster
VibhuJawa May 9, 2024
507eab5
wip benchmark
VibhuJawa May 9, 2024
16f4689
First pass at switching to quality clasifier
VibhuJawa May 16, 2024
c863ce0
Quality Classifier working
VibhuJawa May 16, 2024
3f20781
nb fix
VibhuJawa May 16, 2024
f80f49e
Make both classifiers work with labelling
VibhuJawa May 16, 2024
09670ba
Revert domain_api_example.py to main
VibhuJawa May 16, 2024
3423748
domain classifier
VibhuJawa May 16, 2024
509c85b
Classifier update
VibhuJawa May 16, 2024
97264b0
Added setup.py
VibhuJawa May 16, 2024
3ed74b1
Add crossfit to cpu install
VibhuJawa May 16, 2024
d3e2fad
Address Reviews
VibhuJawa May 20, 2024
399712e
Remove keep_prob_column
VibhuJawa May 20, 2024
b1e2f9a
Remove distributed_data_classification module
VibhuJawa May 20, 2024
7d7d866
Address Sarah's review about read_json
VibhuJawa May 20, 2024
8c46541
Working Domain Classifier with Crossfit
VibhuJawa May 9, 2024
3856ea9
Black on file
VibhuJawa May 9, 2024
5cffb0b
style fixes
VibhuJawa May 9, 2024
1acf345
style fixes
VibhuJawa May 9, 2024
0e42106
First pass at switching to quality clasifier
VibhuJawa May 16, 2024
d010a24
Quality Classifier working
VibhuJawa May 16, 2024
a1b867d
nb fix
VibhuJawa May 16, 2024
9b1cd79
Make both classifiers work with labelling
VibhuJawa May 16, 2024
97e89a0
Revert domain_api_example.py to main
VibhuJawa May 16, 2024
c181b94
Added setup.py
VibhuJawa May 16, 2024
24d046d
Add crossfit to cpu install
VibhuJawa May 16, 2024
cc3088a
Address Reviews
VibhuJawa May 20, 2024
63af368
Fix conflicts from rebase
VibhuJawa May 20, 2024
ecd95d4
Update tutorials/distributed_data_classification/distributed_data_cla…
VibhuJawa May 20, 2024
02fe2ef
Update tutorials/distributed_data_classification/distributed_data_cla…
VibhuJawa May 20, 2024
ca713c8
Update based on reviews
VibhuJawa May 21, 2024
1f39839
Update setup.py based on reviews
VibhuJawa May 21, 2024
c6b5d9e
Add -output-file-type
VibhuJawa May 21, 2024
e1ec135
Align Model Path instead of Model File Name
VibhuJawa May 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/user-guide/DistributedDataClassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,13 @@ Let's see how ``DomainClassifier`` works in a small excerpt taken from ``example
"Travel_and_Transportation",
]

model_file_name = "pytorch_model_file.pth"
model_path = "pytorch_model_file.pth"

files = get_all_files_paths_under("books_dataset/")
input_dataset = DocumentDataset.read_json(files, backend="cudf", add_filename=True)

domain_classifier = DomainClassifier(
model_file_name=model_file_name,
model_path=model_path,
labels=labels,
filter_by=["Games", "Sports"],
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def main(args):
"Travel_and_Transportation",
]

model_file_name = "/path/to/pytorch_model_file.pth"
model_path = "/path/to/pytorch_model_file.pth"

# Input can be a string or list
input_file_path = "/path/to/data"
Expand All @@ -66,7 +66,7 @@ def main(args):
)

domain_classifier = DomainClassifier(
model_file_name=model_file_name,
model_path=model_path,
labels=labels,
filter_by=["Games", "Sports"],
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,20 +25,20 @@ def main(args):
global_st = time.time()

labels = ["High", "Medium", "Low"]
model_file_name = "/path/to/pytorch_model_file.pth"
model_path = "/path/to/pytorch_model_file.pth"

# Input can be a string or list
input_file_path = "/path/to/data"
output_file_path = "./"

client = get_client(args, cluster_type=args.device)

input_dataset = DocumentDataset.from_json(
input_dataset = DocumentDataset.read_json(
input_file_path, backend="cudf", add_filename=True
)

quality_classifier = QualityClassifier(
model_file_name=model_file_name,
model_path=model_path,
labels=labels,
filter_by=["High", "Medium"],
)
Expand Down
13 changes: 0 additions & 13 deletions nemo_curator/distributed_data_classification/__init__.py

This file was deleted.

163 changes: 0 additions & 163 deletions nemo_curator/distributed_data_classification/arg_utils.py

This file was deleted.

Loading