From 6fbc3adce6f15b5e528c5161b1e314d70a6949b4 Mon Sep 17 00:00:00 2001 From: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Date: Sat, 25 May 2024 00:12:59 +0800 Subject: [PATCH] Add jupyter notebook tutorial for single node mulilingual dataset (#30) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Init commit for tutorial notebook Signed-off-by: Nicole Luo * Fix metadata inference with pandas and dask (#35) * Fix metadata inference with pandas and dask Signed-off-by: Ryan Wolf * Fix datatypes for task decontamination Signed-off-by: Ryan Wolf * Use targetted import Signed-off-by: Ryan Wolf --------- Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * Disable PyTorch Compile Multiprocessing (#34) * Move tokenizer import Signed-off-by: Ryan Wolf * Reduce inductor threads Signed-off-by: Ryan Wolf * Change env int to string Signed-off-by: Ryan Wolf * Change location of env var Signed-off-by: Ryan Wolf * Add comment linking issue Signed-off-by: Ryan Wolf --------- Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * Improve speed of AddId module (#36) * Add fast id method Signed-off-by: Ryan Wolf * Add type conversion Signed-off-by: Ryan Wolf * Fix off by one errors in tests Signed-off-by: Ryan Wolf --------- Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * Make GPU dependencies optional (#27) * Move GPU imports and make them optional Signed-off-by: Ayush Dattagupta * Move gpu dependencies to a seperate install Signed-off-by: Ayush Dattagupta * Remove unused import Signed-off-by: Ayush Dattagupta * Switch to placeholder import that raises on usage Signed-off-by: Ayush Dattagupta * Remove deprecated utils usage Signed-off-by: Ayush Dattagupta * Add cuML attribution Signed-off-by: Ayush Dattagupta * Safe import tests, improve install instruction, update gha workflow Signed-off-by: Ayush Dattagupta * Fix pytests due to loc bug Signed-off-by: Ayush Dattagupta * update install instructions Signed-off-by: Ayush Dattagupta * Raise on non module-not-found errors, update logging Signed-off-by: Ayush Dattagupta * Update logging to not change root logger Signed-off-by: Ayush Dattagupta --------- Signed-off-by: Ayush Dattagupta Signed-off-by: Nicole Luo * Fix failing GPU tests with latest pandas bump (#41) Signed-off-by: Ayush Dattagupta Signed-off-by: Nicole Luo * Adds Nemo Curator K8s example (#40) * [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong * black formatting Signed-off-by: Terry Kong * big_english -> my_dataset Signed-off-by: Terry Kong * 24.01 -> 24.03 default container Signed-off-by: Terry Kong * Add help kwarg to all flags Signed-off-by: Terry Kong * Clarify why venv is needed Signed-off-by: Terry Kong * fix precommit failures Signed-off-by: Terry Kong --------- Signed-off-by: Terry Kong Signed-off-by: Nicole Luo * Move common dedup utils and remove unused code (#42) * Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta * More cleanup Signed-off-by: Ayush Dattagupta * More updates/shuffling Signed-off-by: Ayush Dattagupta * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta * Fix typo and relative links Signed-off-by: Ayush Dattagupta * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta --------- Signed-off-by: Ayush Dattagupta Signed-off-by: Nicole Luo * Fix lang id example (#37) * Fix lang id example Signed-off-by: Ryan Wolf * Add classifier unit tests Signed-off-by: Ryan Wolf * Add test for failure Signed-off-by: Ryan Wolf * Remove failure test Signed-off-by: Ryan Wolf --------- Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * Add dataset blending tool (#32) * Add initial dataset blending function Signed-off-by: Ryan Wolf * Add blend unit tests Signed-off-by: Ryan Wolf * Add self parameter Signed-off-by: Ryan Wolf * Fix return type of blend dataset Signed-off-by: Ryan Wolf * Fix blending tests Signed-off-by: Ryan Wolf * Change assert statement for very uneven blend Signed-off-by: Ryan Wolf * Fix key error Signed-off-by: Ryan Wolf * Add proper proportion blending test Signed-off-by: Ryan Wolf * Add four dataset blend and clarify docs Signed-off-by: Ryan Wolf * Add shuffle module Signed-off-by: Ryan Wolf * Add blend example and tests Signed-off-by: Ryan Wolf * Fix random method name Signed-off-by: Ryan Wolf * Wrap return type in DocumentDataset Signed-off-by: Ryan Wolf * Save result of column drop Signed-off-by: Ryan Wolf * Change equality check for shuffle tests Signed-off-by: Ryan Wolf * Fix expected order after shuffle Signed-off-by: Ryan Wolf * Add more documents to shuffle test Signed-off-by: Ryan Wolf * Add assert statement Signed-off-by: Ryan Wolf * Add within partition shuffle Signed-off-by: Ryan Wolf * Refactor add rand column for shuffle Signed-off-by: Ryan Wolf * Fix filename tests Signed-off-by: Ryan Wolf * Add determinism handling for shuffle Signed-off-by: Ryan Wolf * Change numpy random function Signed-off-by: Ryan Wolf * Fix tests with new random method Signed-off-by: Ryan Wolf * Remove length call from blending Signed-off-by: Ryan Wolf * Improve scaling of blending function Signed-off-by: Ryan Wolf * Fix blend tests Signed-off-by: Ryan Wolf * Add blending script Signed-off-by: Ryan Wolf * Add additional file paths call Signed-off-by: Ryan Wolf * Add documentation Signed-off-by: Ryan Wolf * Reformat docs Signed-off-by: Ryan Wolf * Remove backticks Signed-off-by: Ryan Wolf * Add context manager for shuffle tests Signed-off-by: Ryan Wolf * Add better deterministic shuffle path Signed-off-by: Ryan Wolf * Update documentation and reset index Signed-off-by: Ryan Wolf --------- Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * High level fuzzy duplicates module (#46) * Initial pass at fuzzy dedup api Signed-off-by: Ayush Dattagupta * Update deprecated shuffle arg Signed-off-by: Ayush Dattagupta * dask_cuda gpu only import Signed-off-by: Ayush Dattagupta * Move fuzzy_dedup imports to optional Signed-off-by: Ayush Dattagupta * more tests Signed-off-by: Ayush Dattagupta * Move FuzzyDeDupConfig to it's own class Signed-off-by: Ayush Dattagupta * Add example script and config file, fix typo Signed-off-by: Ayush Dattagupta * Remove slurm examples for gpu dedup Signed-off-by: Ayush Dattagupta * Add config module Signed-off-by: Ayush Dattagupta * Rename FuzzyDeDupConfig and minhash_length to FuzzyDuplicatesConfig, num_hashes Signed-off-by: Ayush Dattagupta * Add comments and update example Signed-off-by: Ayush Dattagupta * Write to same format as input in fuzzy dedup example Signed-off-by: Ayush Dattagupta --------- Signed-off-by: Ayush Dattagupta Signed-off-by: Nicole Luo * Fix indexing in PII Modifier (#55) * Fix pii index issue Signed-off-by: Ryan Wolf * Add sequential wrapper Signed-off-by: Ryan Wolf * Fix pii tests Signed-off-by: Ryan Wolf --------- Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * Disable string conversion globally (#56) Signed-off-by: Ryan Wolf Signed-off-by: Nicole Luo * Fix issue #43 (empty files creation) and improve reading/writing speed (#57) This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Signed-off-by: Nicole Luo * [Tutorials] Add a tutorial for PEFT data curation (#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi Signed-off-by: Nicole Luo * Only import PII constants during Curator import (#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta --------- Signed-off-by: Ayush Dattagupta Signed-off-by: Nicole Luo * Deleting links Signed-off-by: Nicoel Luo Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo * Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example Signed-off-by: Nicole Luo * Fixing Style Signed-off-by: Nicole Luo * Updating container version Signed-off-by: Nicole Luo * Fixing style Signed-off-by: Nicole Luo * Update get_client() according to latest version; Update log path for map_bucket section Signed-off-by: Nicole Luo --------- Signed-off-by: Nicole Luo Signed-off-by: Ryan Wolf Signed-off-by: Ayush Dattagupta Signed-off-by: Terry Kong Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Signed-off-by: Mehran Maghoumi Signed-off-by: Nicoel Luo Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Co-authored-by: Ryan Wolf Co-authored-by: Ayush Dattagupta Co-authored-by: Terry Kong Co-authored-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Co-authored-by: Mehran Maghoumi Co-authored-by: Ryan Wolf --- .pre-commit-config.yaml | 0 .../config/heuristic_filter_non-en.yaml | 82 + .../single_node_tutorial/image/jaccard.png | Bin 0 -> 14952 bytes .../image/zeroshot_ablations.png | Bin 0 -> 84269 bytes .../single_gpu_tutorial.ipynb | 3786 +++++++++++++++++ 5 files changed, 3868 insertions(+) mode change 100644 => 100755 .pre-commit-config.yaml create mode 100755 tutorials/single_node_tutorial/config/heuristic_filter_non-en.yaml create mode 100755 tutorials/single_node_tutorial/image/jaccard.png create mode 100755 tutorials/single_node_tutorial/image/zeroshot_ablations.png create mode 100755 tutorials/single_node_tutorial/single_gpu_tutorial.ipynb diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml old mode 100644 new mode 100755 diff --git a/tutorials/single_node_tutorial/config/heuristic_filter_non-en.yaml b/tutorials/single_node_tutorial/config/heuristic_filter_non-en.yaml new file mode 100755 index 000000000..4c1b80905 --- /dev/null +++ b/tutorials/single_node_tutorial/config/heuristic_filter_non-en.yaml @@ -0,0 +1,82 @@ +input_field: text +filters: + # The filters below define a chain of heuristic filters to be applied to each document in a corpus. + # This particular cascade of filters is intended to filter generic non-English data that use spaces for separating words. + # The filter listed at the top will be applied first, and the following filters will be applied in + # the order they appear in this file. Each filter can be removed and re-ordered as desired. + - name: nemo_curator.filters.heuristic_filter.SymbolsToWordsFilter + log_score: True + params: + max_symbol_to_word_ratio: 0.1 + - name: nemo_curator.filters.heuristic_filter.NumbersFilter + log_score: True + params: + max_number_to_text_ratio: 0.15 + - name: nemo_curator.filters.heuristic_filter.UrlsFilter + log_score: True + params: + max_url_to_text_ratio: 0.2 + - name: nemo_curator.filters.heuristic_filter.WhiteSpaceFilter + log_score: True + params: + max_white_space_ratio: 0.25 + - name: nemo_curator.filters.heuristic_filter.ParenthesesFilter + log_score: True + params: + max_parentheses_ratio: 0.1 + - name: nemo_curator.filters.heuristic_filter.BoilerPlateStringFilter + log_score: True + params: + remove_if_at_top_or_bottom: True + max_boilerplate_string_ratio: 0.4 + - name: nemo_curator.filters.heuristic_filter.RepeatedLinesFilter + log_score: True + params: + max_repeated_line_fraction: 0.7 + - name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsFilter + log_score: True + params: + max_repeated_paragraphs_ratio: 0.7 + - name: nemo_curator.filters.heuristic_filter.RepeatedLinesByCharFilter + params: + max_repeated_lines_char_ratio: 0.8 + - name: nemo_curator.filters.heuristic_filter.RepeatedParagraphsByCharFilter + log_score: True + params: + max_repeated_paragraphs_char_ratio: 0.8 + - name: nemo_curator.filters.heuristic_filter.WordCountFilter + log_score: True + params: + min_words: 50 + max_words: 100000 + # NOTE: This filter tends to remove many documents and will need to + # be tuned per language +# - name: nemo_curator.filters.heuristic_filter.PunctuationFilter +# params: +# max_num_sentences_without_endmark_ratio: 0.85 +# - name: nemo_curator.filters.heuristic_filter.MeanWordLengthFilter +# params: +# max_mean_word_length: 10 +# min_mean_word_length: 3 +# - name: nemo_curator.filters.heuristic_filter.LongWordFilter +# params: +# max_word_length: 1000 +# - name: nemo_curator.filters.heuristic_filter.EllipsisFilter +# params: +# max_num_lines_ending_with_ellipsis_ratio: 0.3 + # Top N-Gram filters for N-grams 2, 3, and 4 + - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter + log_score: True + params: + n: 2 + max_repeating_ngram_ratio: 0.2 + - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter + log_score: True + params: + n: 3 + max_repeating_ngram_ratio: 0.18 + - name: nemo_curator.filters.heuristic_filter.RepeatingTopNGramsFilter + log_score: True + params: + n: 4 + max_repeating_ngram_ratio: 0.16 diff --git a/tutorials/single_node_tutorial/image/jaccard.png b/tutorials/single_node_tutorial/image/jaccard.png new file mode 100755 index 0000000000000000000000000000000000000000..bc281639b0f96c94c66eeb09067782fda6019d79 GIT binary patch literal 14952 zcmb8WWmsEp^exy@Xwl-@Vr_vyaCdjt5Zv9NK=Go*r6g$4;_fcRt+*90P~4%o&7r^l z+_}$v=EKYfLQXg&=RMi)-fOMBc9^oF6b32@>a%ChFl3~~Ri8aWAOW6xA-@D3S)*x> zfnU#ERHa0pm5sdJ0S*u?L=;4xJ*$jCyE8%pj!_(?wOpP(!}<91^L*Ky{1fn^k(E01 zqnf>yqk`E-BU7V~M!;*po4zx#wQ^K;u(JQ?@{U)>>)oe!yv+3PJb*_Z`hQ0pY&`$3 z$1JSu|8vHF=RY0m`HUs0JU)BI6DuPwqW($$U_bEEVA(WSD{}OUXNgQjxhZL@XPMnxI^1{J_tM)?yUO2?d`g!Zn2H zdphs7Zp~jZC04Jv%<{L7Zh70YZhhlnJNnr&dM3D+dKy5F96*5_kV*c@<(liir+Y!! z*oY!nh$8LS;gY~FEW~?B5XsXESbyJSS^oRen8W{Z68ZH8lV7sW7T?N%@BN-ZqE0}j zz{BwuA=9(X;i1Fkqc-QEe#+gafBqX+T;9feySOEoIGn-(Ua-nf8!TN~z23?geYkCZ zjXkP z&ZjjexgR&V+6O{=>RUD#<_5v#-+HUL%6cN0+LD%!2p&(!ALD#(CNoAJ?nWQ^Q~BKO z8(mGY{w6SUkJj8=Vpwq=3Ie=K|vVQ}{jrJk$( zdU&p=xGcr4qT}UU-GY5;Y4c%|RjSopG~Z!AneWx=zQf~%Lv?-Yf!#)Wfcf2_d;1Up zJv8=yTS{n<%;QhoezHZn%tE4p@3n0d$4tNA&3KIYfi?m@$LcEs1t*(e*m z%1~RJh)=VnOiEd>)>||7N%<8i7 zZD$5nnMCtY-*ytV#oE=3OU^OWFMh3C`Z%dD2i(#}s@dU_KJpBK>p$^URAeb#K32y2jVkRq>^>>=n3NjE^A`;3emCv-MrD ztKVR1yE4`X*=;m?khqvU78~5tWw4sbw$18$ZE!7rGSLV5T9qWyYpMEH@$vs16T5D3 zT0afLrAg3{d~Gxi1T+!C-K^ieCh%o7n+e_TC{?2|Hc+JUx>{dc)VB-VP72*m&i0su zPGY@r?5F%k)TzgPVy&)w`j#+r!lREx!i~=fajFpChmkG5dSGryD0CiQemb zdysg)Zy_Yd`1PP}q0CB{rlY_0o#cxiIzAn*gWAdVF9=m2GM~TW`{`bXGIMw<_lNG) zHf4>}scaUC&iA1aSJ$I{>AFHMBq|#eD~Y7uI*B{Wq`5D9FhmjO8c*cN?4*Qd%$cX! z4Y~(FkR{$D2N+^ic0K@`Bl&)>{lTVgh_Tr;UGFZNU-jmbTzb12E@o`7Z&Jn%?;?&3am0jJOxy(+(=^!{B!TEY4(L##URh`pq z|JObc!AcM$R1~wEb{pwMucKC(th0qGw|RdJ6SxLM#-we4Iu!=wg#TIj^(v-c5+dVl z-xr~~RcD*M&|$;p#I~z_46H_-L4OD+`+K6W_8*TiA7fShfNQw4?IU&ndGX662G&D~ zfAy4!)LMdIOE%@LoVTg%*QtNZSJSsS=^5%*e7)$3sOuz3BLWRol+OH9Cy$DyWkyf;P&ZnzgXSDJcSt|;TI!K*<`@f>*8lg(m{aTcx z>Rs|cD?-mW-FUtCSm$yrP20Lx);w`}oIL7HO-sJiOQdECV(D$tYmjf5uMI-{@5Yia zwH%9H>{pNG`3c?souvI-rPoc0o^brWM$2?6CDXM3w{mxpp3QSSl68BdES|%L^{?>! zR7ZsYqO9Ni=TmAMGc!im*bPlR1UfLC00_#zScxrwoI?ijUWzvBL0=<6U%za>KX8aY z?_6OU;eL-K?-pK}#O-8U6{*}b!hiPjH0@AXOBEy)prpk;-{Fc=z-Rt5$ysiU?!PO< zxv@lME?Go{oW=-TKyTcOLw&X*J|Ce?sQDcSlAG!{w7PQo-d#D{Y&q73PR1^Y_>a5i z=!BMkI`UdceJFibfSSqbt1Tfpy6knNSA8inmZZ!MRI8UL0U9zV?yY-zR=K5hh+Lua z=)`XjjjDf0BfvNenSX~XgT*?)Jv~kSRVz0U72J1MeL{@&7io`|j}Hc@e{!aXXXTh$ z*%zFL7^>gcH&8ZKM6wO33P8mW{UNeO|0{0s&@||1(Sp)^siep}OgXKm3Tp0ya22KV ze2b5tf*vDyRd|it)X|Z4n_Cvk9qoh+vFYr902V~*_KT91#aQ}n`I=spVdQzA(8I}o zN%e>b>}q3 zG_8~7F!QNcvHwkE;JK5VV_EU@W4!Fnt{XPf%zF-j?%H2|W+>Urm9~z0g*UX!@Bo_W zI?_wo;AaL5TYhbDsODH4F=VHzdF5|AJD0uzHG#E7j=ie^CGOV433|-7wWbE%5JbJ1kmjT5Ruc9C{ zapnllY#P_FTU4xi5H2|r=HR{lrtoK-e01x@Jh%}9#1d|u=5^RucqvU&e7^eH%p?$l zQ(CsWzU{o!_TlcjP`eOk2ai7?TDD{Ix8g;FvGK4WkLvvt2i{}Wl3xYvUyy_#BYxw> z8PqGzmg#r$xcfsRBm+SoLz9thO@wolVD!Bwx_eCeEL2v9u**UCY5i8uQ`8SO8j0D77C914u z?9C?lSrv$K|7{{j+-|t%y)wkQqsi!8 z5gXeAo4%((l>|g;a63P%8u4kNLCo6hM+N)NhUgrYIIsKRqPlPE0%OLi;y(=56@=-t zEk=8;aVKdue^d&x)Xn4D0d8yx45r<(C2%B@CWj*RJilWv4xT+p9~^16*0_~ktfinn zSdxU9Q9PK^%-FH_-&KBI?MmDkL$MZe%z|<}K3ul@>3u_rZ$;2vLrUS$!p#>pGCCwL zyjvq|mk!sYsRZeQ=P$&V%PKK=|Ep%t#YfR;i_^-yQG{9EwlR^`$&LrHZbFt}vE~Xh zP!8`ypnzF{i;cM{LnI9C^0)fT%)a&1*UVRPEK)ms_*W302Q>#A!ofo!@@ z`;3;DF8n6)>s%C;-wz4llbxJ_U&?(%^x8F`9+ki9z4KYSUpgg38`1vPHN2E5E?(CT zCB*tx;FNQ)cPlms|( zmyo*9@Nwamb-NfhFV9+isqirTQw9z8L?&!)ePW8m?<09!Hgg8kt00KZ7kNw8u#1D~ z**2QA2I?DJdrW){TGcfT^&9n`>bWa!%mggR`<2-vag@u;*2PUu8Pl z<5kRK(h`t+cNRU-kTbXh%M7@bdvCHM7-ieJV5yE6uS~;3iDT0}*TQJb?7^;9GT#H-DRJsG~}EIYa4 z-r7{TPfG9bUi}W#L4}~cOb#C9BJ~$PZ{+?e&6A)A4(@qWMqzjojNk%&$)1?2uQ& z$C+fPj>{u+jWGapP{_QY|FOj}ksa!qA~=WJ-l16`Dt!+D?5DjD3ke0N&=Fvxqp8K@R4$7TFkYS zH&SB`ALE9YUF7+!fzYrG9u^Bj3DL5N>vZ>_TE>m~?!LBCmQ4^DD{og-_*>U0m5rHA zdi@sev|^qiHmIb?)%we2FN;K7XPG8)5?W&B zSIsK|FNTl5sSz|7rc5su1oMcfRn^YwH-lyElwfCWjJ?IU(CEJ&S>Lc1VQ9{w{Mt_# z&+j_XZ$OcECnx8v>Psj*GBDvdooD4hd;ob@89AX$d9mUeT;B4PP}|cJAK@JnS%G)s zMuJ?qaq#)PSI1Zn2R%g0BMWbn&HQJ7#J~q8M1`id(zWVDJvXg30I2;LSFfTN#{3cR3Lj$9aE`;v9^Ci|fndc(647=^4zt>pnID%B4qT=kP3mweF;Z zZd5$Yt<&o5Mg?ze+m709H{)rPMk3(QCjH<5P2X%Uyy89Y+k^Vi81A)D0LI0O4h*9q zF)LBEy4jp}X{cV`CAmbbNSAA9>YdkK24qFpNLS|%M-(6X&6aC; z&+MikMfk-_WZ{N~z0;z)l0&0R`56mQI;_^_9ucjSDHbxfI>h%lk{|C1Hm*{Hgwd+y znHHB+$@2P@p^nOzp4AOsY2&jsvsS-(we+Dm3LjkEPDZNBBXZlox!I#VJHcU&Q6{VB z_1{X9ES7M6*TZPiLG?3vMH?L{DjbM5gnhjM6C?lIB7GUaMgQ9((MTrCM<9gNO^#WH z=tBIT$Y$K|FgjJeKOY@C%}dc7!+kHRGZfmUjPcXFf0h1FA8gt8W%DSj>?PUY6C-C~ z(EXqrsk#eGv`3OiWuIrAQ$rR!jf;UDeM|)SBV#JFktlDk;+p`Zd$jt`s&j|I=B~V^ zfgbv_+qFT?SRW&NHQ13;ck$7MiuPur6CMG&KyCAqDo`Q^cVCSEv!S|(miLd+cPnYm zq64G==k5z15$-~G|5Bo&SX3VG3&hBWgw0%jw@(;*n<-1ggKsm*L=`J5bJB3z2rdX< z_m4{b6$P)5#ArVzboG}Y-=sc$UV)~8p7t6iQF@KnK^wpy@&E|k?`*RUEg*Wz)e>S)6JHx@x|)lCTH zsw9|b1r9bp6o(oT$wMRb4=y} zF@vpr&oM>#+)TM6_Os=*!c;undhrrokDJl7@?GdV&;cuvC%`<6D9zM3C60O!YAxxhQ z9`*5=Vw-CKOsTzFITqFs#l;XAld(`jXlFVsa?AXo@&`mhuhcR2hb`k;z6d)@@XHlB zhZAzgliY$2Uo#G)xNqwas50ybsOdM3GwsRzQy856i6XH8=YejIST#^PYic-nR z`Mwv>!z}+y>;deHp`J4zI=qP~bmyqgPjs!owFG0*Z)s5Sn0NjjftK$46K$BaeLb4* zRJ4m}#g5ct?sCP?Pk&K;=NFmZ-AUbPSz6bq-^1D6#e#!UpNeJYlR%{0#`pJ$1Drld zb{>>JQqnCdI~82XXS8ifTcA8}odv1ZkS9j4kwqxVHr;J`IZuddqmTUK)UWG>D^a+N z(6o>;p_rl{-yNW*3;`n01f2KTu%l0m;LW!8RDsHl2l8nPv^CF@DY#0{o|v~FObv8! z4d{B40@?QB%zs09NnRkYIIxVH17Ocddwc@H>I)zf*`0z>*==9tc~9e5CEvj>l$teh zblUwb(mnF-Onpodb=~I7Xfb}ecVm4@bL>PZ+&^4$U(xfe5@ZGajZ%&pK^5x}+5(v= z*Y-)`6P6xNFB3lg*1hUa8?~!nN?5K>(GdN`+4p_fe;FmJLTQETbqf1+`jj$0@*Yww zhbR@a*!g+V8WOLBe26W+ef-*k{IpQ6CrLX*MYO^Fc0VgRg{?}`bU+$Y5YYRyo4CL^ zr?NJ0L#Jbfb!cmObs-F|5?m42qX7v`$7Psfh(YWXF(_JrGr>#zPL=tu@}szS)U{1_ z+*FirHfcnN#$FWHa#9iIM3Yzc;DU-2DZISpbnx1#k2LW~Po-TcjJdB2DXAKyvy#vU z5pD3v8Od-1kn~j4a!X0*{$t~L6n8@FVzcc?01SbfD5%)V+SQJxz=&PSW3k{D~%WPHt|S5+|wf95peh= zs=vV`j^V-!O#axTsv*Yfmv3xiN7DuPGCSikEb z1IGG~(c!_6A;^_}@bRz^px$lvk$Ktr0*p5Y-!%(VYqvu*Sm4>TmF|`2Vm5uOZIuf@ z1qwpWZLjO0`FUsC+8iBq@5k1K+BQUy3R@OaJPhhzN`5z!9+eYJ^(|zg$JH74wsgr} z^H%xO^h`P;$a+AvPm#lH1?=EEH=E-Y0|k%#J@G2QRs z+WY43jtuk~-Cf0@!M1@L1pQf98$^&e20-B?asU-FfJyqu{k)qvHT6Br8gpIl5^R3r z9q4+tfwao{1aJ4_{mxQXejWjs=q|DS&g^=46$43*_yiEr1=GM(L)SO;mAOI9%ewAL z{3qR+blP}=xn)ZW@N5!95#@+7SBFw07r{yXzZXo9^PQDequyDl^zV!iG1WWIb>`X^ zD-P}VUyDEQW$SxbNyK~%HJitG(qo$bDkuJ-qL&vH`=GLq3L9BdlDo(Ywii zFW=_-mL%kvdx}dLnabmZbLn)o;}Trgf##XI?Dv)J96=eS26hemU}5e&aLfrEu9wn!q(!piJ75fBCv$TGt3Wq?<20o@O;B|}mB$TYWk+S1c9fRs4z z?HOSk>Q&${1z9)Naza8D>2z6B+|Glne$hD})Xt7LG#}JC+t@+HQCHI(0GMi(JBO=K zfGAysiY!_h47bV>&c)w<&9|>?UmCMQ2TS*UVj^f8S&L10Q323ksH*4bpLnIsRJ%GY z-h8%E{^}j;Uv$&EjY-F;F9S(j4&U?{jCxMd1`ErH6qXeV)6PzxyX!vOD2RLyKDL|} z9*Q%YFLh8rXd|OZ|70BA4z);g-R<>McO?C_1t=r)Rfdm)nr&Z4 z&eM*(Eu5p+1C1&5yP#YyX^wi#d+m)_clSmk^-7)c)L|y>OHbPGN5S5XUuwdF*8@s3 zuHQoku!j{+aM#t+Vtj^vVWBzwc>XHKPV9M~)EsW=>a`;ix!W(J!pXcdj+uw1%LW$f z4pI==m+any{%IAI<+io9YLKJBn}xPchV6NkueWC#+Vwc=_=qBxL^$Cg8VwpDGE;Pg zWYi;puX#ZSqk|;`KzGO^<#XvpNyr-!OOzjz+AS6p1=Dbi4+f3g!RES-9Y``F6;1nT zyqp#z-|k+JUkRhy>sAO2WD5*yU{F5@`kw z4{=%qdM*>6Fh0=Z;RWGCpp;Rso@;{%Le@52%ESbEB3*}e?(Qf3dq4KNs%I0>H>ZVmC0J&g?= zfA+}!NT1=w;*o22x8+@LOn$X}J&R_Pj^D|L|6N6}p6jWhP*`wRad&%nbnp%%;jvK} zs+vF8g(=kHozM0y@zb-yR8Wv8Z#h5$1=A9taX`=SUC#$gJ4X8;@;0(HCOp}j=JwlW z7~=wZXA$$7 z3V6AZXb%7CUR1xt!kdU|+tY0^KywmREr=Qi0D@Z6n(_u*Q(uO5{9kf)r&j!yv}Te* zTD`a4s1iJNKTFuHrjQ{h&_f=ka=OR}{zSlEEW)gehZdo}Z20TzQ+3sdkaXxaX1*j4 zWzVgpn>Ur%q&olFlbysoq|}}cG&rb};rC|FhluSKaNA&8jTMX%?w)5qQ@NNLl}*1$ zW55yy@s&$oYk^Zru}aNLWp&OuhLS3^xfs{5wR3^@*07XZl+K4za@|BpUfqkb$Zfg* ziIZwmyscMadFf^F-<}nS-0f~gValw!)R8tz zjJ=}fk?W53CHxvH@4ybZpUr4#pUoE32j0qN(c8_Rhi)tJyOp>cYF>)b>;t%?u%^;Q z)3Q{H&u27v>m*K!OlGE|jl!!(z!k`EG z@DX5$*aVy-o`=z3)ldnSY|$l_@BR+NR4g9xB)}lRJc=Mze)Sz!QQ&;4MwGy|03u;f zuxlCGt}HIMDO0LJ-_8e?Rt;$#inkGFy^#D-F`}Fcs?LDF{?-zi{d^o5UGD1q?_i*+F8dGqc zc_{{L0t_bwR9}wosRS7QscWhH<->=w6vzp7yo@3bYY|L&nw2rma&{#Fd!<2O*brny z=iswwL9Xh$EFHj1P93hwR&4&JT034vYm^biC@Qx#VQKH&`e|FXQQ^0~93~-l5C!n?+Vz)8BXVye%?)HtE+a=PT9uX6 z?h%12hdtxW8xg@sc<3(rbyTtF8nBM{#^~o%Jn?oC4hw_qw2l>rSDTJMhb#jz)qd;L zr=1u=G+H@5EW{U5QsKQoYcUS7U`<^-QY`N*b4gzH?#{eFVO{p1+B6slEWE zG@enFo)e>A1iq$f5U&U}edpi_IB)v)QoF<59tcHs=AS&y5x*P+yz^#`K(Ohs{Nk{Zu)%t6=IF9zcVeBbC3#&tJ~H*svDlFjjBr~8wNqgXRhL68Us2&UaZPK;buv}-I6qU`Wa+0X0&T+2tKsm} zb4fdN|D0Wx8fV>F2-@e;S4UTV%T_B>ZxQN(Z zWs4b6>B;FICtcTg)RtH7wT`Ya11N;02g*cc4PTjtBqPn}H=JDyx@~vu$4PefiN~O=Vl# zNA;&A=NIeVY?sAdw9l2oZ-T|2E|Z$7(b zk(LZbCF7soU7rn=mVSRxYTxp6ZKSdu-07=RS)xH?!*rOW-Jl6wfqA+gxg(%WXG=t= zS~J=xHcKmM%rBqMDm0+N$fWrr%2I!rUfKc<7Xe;XcH|Kt^Iz8{{5glcz&@ewDy^F@ zvQKZ7DGKjlZk$!45O%~h#Y4c=L&}1p5XG0#WA{mqLxK7`za0AHY-!Z}rWUM^#VVFS zuk}@zZ+MZARRZ~So+g@?T-dO&s5K1pSmJTVSb%CQphK`D?KFFt#)tMr^P9}iq%oV77A9Q~}&Mj3XOIfWR znSHSKS$5@F3w=M)nx*l$C3xA$o_?^lTv|miZ|BcsF1_cir3*kO{XuSG^zIu+C1*_) z&p-;~-SN2W4w{lUIP52jNiR(2F;qrdHa9Gk!vxS93-<`k(khtzaw%|lYb%-(1ho$2 zNxKU8TF`j>Dh!Hkq*tVm3W@KY#B|*QOLlZ64l-qzB9{OzBvm=*UFDe%g zrFeo(%W=lQ@7|O&D=3t*5J^Im{>wBsRalki=Kw=CGL0-IXx43HJIvo4RZPNFP&JZc z8i!BSAy)}YRkFLSA;Gurh$BFvqY-t#gkz@Gkql7JCb{{Jsz(L+v`R^6Uq<&Ffu9xP;QtRvj(kvYQT zfOI_vaHO-837*(bue%eu-B?%xozTTFT+k)vI8Xx=nG_4c2xTo1otr z7}5uSmJ=mKlO>o%K$QC?<{e2)calJpE&EMlT8-#_p3e(9C;;Dj-{q(A}s?E&+HRsVLU>RD8&STJi^y@x4 z(`bdQvASfst3mCLcY>q6xJF-3W5M2da~I7n)$eOnDzNkUlUcmis#n#?d~xU@Zz&mEG5b1l)5oicYp^iO8elC3qsF zFlSjNNxw}!s)i^-za^UF=5W_7w_}2Z+K-aF=v3O|GKu)Y5s`XX=n5o~RAt{Lz8N{e z>4uDs6s8#J^4=Cu6ZG_W*I~Db{pO}lAhq|lL%b13UKo2 zNhNx*wV%>3OIjT^+89GFB)EhLTah)g|6-57&x+EF8HvBq$4$QndOjLnw5?;HbKKQx zKE9e<-)~4-hjEw;Y!xN}pj9(kx;6jSs7G?E%9XB zyNq*Lvy~;ch0;Z}72<*LkT3z{rjVktREIW?A=-|6g)IJ@k*1#R1$K-YWkyxr@|s~$ z7J8t?k#u|pRJ5j2+2TNj`UxE+KnCz$kK*#x>svQV1M)&H>`W9HuEYCXnNv-}g!blQ zM{rLe>rWocn0BOSxu|3>8BCyfy+2nyq=Yp56t8E-S&8ZHXX2%ml9fM((2yu{okx?| zXk0HiXM~O_PCp8e(RoQ0#tKY?=1Pxg0P;pfP1Bk=-C?=&-ftrbsFo#Em&Ax_z`4u$ z2mcbXJ##`!`sS3`^BEB^uRSg#A=9gZOk1UzT^}&Y3@AW`w7abbZYLt64j)0Tg8nJlbV zxIMf`zSyR#iAd29kjgQ$#|d-kSl5kT>i(o$H($6hiAo@(o70;V0lN(Z*Ac@|8HO^x`aClOp zV)zwDz&o1ZjA;#0l89`G;Xy2ZpXhE&PbcD&e+^i5to+v!WcjaM#aAV;&pLIaOfV}n2Eq5;7i%& zCF^}+>eIA47qa=d^5{P+9*T-+`37@d_|Xyt6?hUsETUN5#qV)n%I6miyf#B-oYSGU zWwIK+d#qDwC^{I%!!+p@&7JuIuZMmH0wTH)F4_UFn`hLe#*6O2@;!{&)gsyLe^;2I z?m44VOQKOhpas1;bksIPM|v_*Ac;aTqMyt`11r40lDc!zv?&WMbsF7r$_K&-hm5?~ z5aW^^@n(p3xrGX&WpmFq&@9;+QBC7f3oBCA44#}~Vg?PDISxfsnF&&30-)9>>liQCbiv6m;Y;8R2^2x7Zg%b3&-lTda~}e^|Kbf zJ2au1)k79Kd<9`gy5)7U3FERK)BXA15zme zpFoW2c0;}yg3eikLFLk-vRW*3&e-SfJ|adO@!4nJ9SijL9-q|2Q>6&KEK?NQ;(SYG zSww%j&Y7KXz5>kPTGmzYZa%MVb3@)*a6qr}W>F;p?xZ4{m}4u+mKp7{7M-Vcf{Tc8 zwpAsYiGSgvqnBmi94<15T&k?+q6(U)Xzo<&|8N$VTE@a6VVCR~rh~(qWlJ>pL`n$W z0}8ptfHg{||Jz>p;F*m=|`)v+ z`p=jZ7#b4C8i1(8HiT=Bm^EEnmR*$%O@MvSTzdrcPAl+on1<e8c;}BhV z%0STGT=E>$W)WmBw$2_k5y_R9^?F8E`yjhcdK!uIKU>fkHlO3oPJaGxo#$TwSLE$i z*FdbCPyKGcWsT;pCKB6Vq)y$~g|mw98(slgXVZR8yrOI^@04@wlj>97w|552SeO>&{iYd!oS*Joi$ZvZmc4 zb+tZdzpz?N_igXoO3{8BFDqcqYXXezs!=2vM{bXatMdPhY+{I=3PqLdw*KMr*EeB~ z9zT;<-oFv>{3tta8fCak^TA6@n(oo>bbyA2%YUBff;Lu#Ansn38!xRG70M8yylnav zG$Yl~{5t^cD@Q=HY#?7ax!fzw_`#0aMUBPx^y0WQit<)>FflFhv<4kX*8kRK1R~c3 z2Z)xpKio_q|Ku#g3r8T7QTnK^@mWy_6QP2jNN4b)3Wpo$?|TwDWqQ+or(*8|8w0so zOOQPt7Se;~_qvywKc7tWH&m6emTj9ih<+a=pDfl%6cW13Ad%jKldJWZ_AG*+x`U!f z*45d)trK7dfDcKsDikX=O_bToHT7ZMp}4}WAUSKTsB)8_P-6*t%xI=4!XN%dmvB(# zSPIyefNR~bB!O>O_LukButD(zo&I-=Z#{sS0*3br5^i34mph7+owDI7n~VNYr@~Ec zAoDG9=7HV071H2KiNH7Aasnl@In|Te)xOvUpmr zyt-bL27W25`Tw@e39Dc2TLoJZ;)AEh#{5F2v-8b@5Zw!Fo8!4f{X1p1u@ScB6?eLM^;UIfd)k^k}PRh%YbpDwICa8CDH5)z=@ zb0c3p)nRQ+0i(3vy_P-3sy-nR{s>bjfJx1KuqpP22k>zMHs=aaUGI|~8EiFsJ+WRz z?BZ-(#-3c<+34Lum}tUxK(tQyfBb--0RNk6k2H8609x=TN8sJlTnDO@Oso1%HJ3(< zIJhHbnr>x95ltRNa+)wmWBtbqO!>mvv~1+$PUzE&!qa@r{+Jjh!-^I8R#)qXYN0l_ zLXq!GIA2BS^e4oO@1_;DA4%xL05cSyMjtFI>m=1LBFb`F_7s3tj6V8}u3J5%AMWd? z&kBN{X+j8Y4bN3j;8cb9gyBD+koSkF{|vuX)?-}%1^csZX8xP)@3b0Unak$C+AQZD zLl`LgMYE2=di9GGt*z{XhrcSLp$2M#5#Vd@6}xuZnM6rTz=Zmrj5vW-uqn!z!N!xn zI;y9Ev0D!7p@iVU*#^00syf)Y@@#}J|A7KXjW##F1_8p9A;AlQI3G8YEzF;6K+b;3 z`%sv*mN&>{c{bZ%&Yfyr?|bG^OJM`$Ld;hbz-+D`7|Z`BR!mimL9~tlhECO9`U{Q> zNX@8-?y~FZ!VwyobY7)dghqw&1(zI>??ZF7X95@ey@G`;E*ub%gin~9uwaO{hv1Tz zo)w_4643T_(j)1p!JG+MHZy_Mc_&*{s1_zj?WUZFYonzYHPy#xwKI6&^m{~+eS@(e zl;-a{Y*XtJsVEY-XBrXMo**1E*M70VNrskK$k$tn<(+JgS`{=C#86uGKO;lZ?boeG zKIMzNtp9>FGo~invLHhQ8_P?$-`~r>f(nY4fYTz4jN2PjT@vxs(a}OR6{rYd7iEG0 zy%&C3+E%&v%ifb;Og+I>)xabP{OB)6_>1D_;Nd3&Lx4Vt69hvv%`sno2c?st5-_FZ zWU6j0SYcN$hvH|cS0r2sRDcSLorW0dKh&eTW7NY}*z4#D=@t}G#(CS++*Ls&9GU;7 z$kNzsW=dM1m0PpAE((oij!8m(zWN{<4@&+s(Q&|HwF#(V!)TP>2*G!E<5J`d^)G!k zl1!op(5R@GU{TI;Hvi|q5B7IpaE)2?w_61X)8<89GC>Jh3FaChUJh!LoWZ)jB+kE_ z7>U&_W=T6rO3{^zZSAQn-7STTiqRwmA}*)-XG+En2+uSoLyTDlMs+r3xZX{Q|M(#B zo)vua%H$0_)qkgS=g7oYJ@xm^{6OFJ(EPvbihJ*ykjsfAi(FE>0O4C?x!`mT&tK{qT3|Hn@xg47IFsoor-QT2^m|k{Bw!xU zwGkN5QJYo@4+*glLvbQOwbW918W&M2g9DAlEN?C%I%W1eh17(yVQDq?cTa^fGp4|x zLyxldV?l>&wqi&dx%_38DbQ*BeE1$;R*K}`#1?SY3Ub8%=gGtWPveRIzfZ!|x_Er9 XdUrt-7kct<=2J#OQM^plFyQ|JEdKbs literal 0 HcmV?d00001 diff --git a/tutorials/single_node_tutorial/image/zeroshot_ablations.png b/tutorials/single_node_tutorial/image/zeroshot_ablations.png new file mode 100755 index 0000000000000000000000000000000000000000..10be349440829113b3b7f5be11c66a595aa1e359 GIT binary patch literal 84269 zcmeFaXH-;Kw>3;_t8JuJOo*s02#SgdND?q3BA`SG1)?CK$T?$Z#a0nS1tl7goFr#x z1EL5hp~yi=7DbROAn?vj+g3gI-XHHb#`oiXV|=0wdR_0uH|2+6lIq# z;#|bW#J;45(k82Oq&fVrQzDheK6rW!6yUWcOXXB5} zF{ZJ$CLym$e!^W1h@RAVocZFMj zEo1(UE$DZ#xy;|19s1w)gGUcoStTubSzKNI>aUfS+GG6%Gqk~ul96`;iBay8ql3f4 z?^UAoMP(iOuS z^9%p<(ks@m#6LzQQZavc!Osh{j{X+hf9U>pwTi#5Ea>R!;^T{)$t#+S4D)ywFvZTj zKZ;l0|4fa1z^+n`eeyLgzaG80n*ME|tv=m6$j{Gjc(lP%`xoJ#pLX$bm)+#H%*xEH ztZkb%ZBnTX4!ckq_$)P5SgztQyHt2th@_Et`P;X0DiJ52Cj{)&j<~r-uv6aW=LfJ^ z_!O=ie<$_4Mklgt5^swbFEG#{e};zj@6)!bXRM7(6CYR#yf+$ z@h8%b_iaCY`b4XJ^M+5BnwyswKY#`AZ6@L6$5+GG5T}{4D3cNNx~NF&yr^gfSA3SW zu9;cfjWvQE4fenM@{49OJ3D(oP>^=S-O8UADs1X*e|y))(Xl4Ww#z2ZzcA~!-xe*Z zxcJN5_V#<@2eM;hW4}*Lb@cZq>c6}eZP}8qmE#c4L$6elQESO}ug^i}tQf4iViI`b$W%s4*RO?u??^`9l}mmW#FMV|Vz`CHC%3z9{iF zFflPvq|9Wk$6&i0RR^a(XYALX-(I-y*D}WknVFd?Y^{LPyhKd zY_%5)$Cn9tPP{n-lYSLh?XXA{dj5i1s+peAC*JB*X}u_s8};oqn%+9vd^s6w<8w^IdAZt z0L?y8H*XQ;;-rBCnVrxh5kMG~ix=dqi zpC>CjyBnWkd!*s#<=D(r;~uRq?CovLQkj|@3%GgnCKk4x!HC^!_>P~K_b8uIXw_(M z!>WxNElUNp)8q1Ird_jga*VWxdNW#R87&?`^k(nK=>DAk{2*H1WM9rh?!U}qePpK3 zgYr?NWV-BrZ0SqFzL+ot2jp~Hvm#|I6m(?((L5fb*@?anMGbo6km&YwJd zd{pF%`T|eSbX$ayN6tf?Z{NJ>@=sF?-5-t2)MTTm9wvMAXh&RK!`I>As)-RsHNN@k zwQDE*9C~UWZ;*5_2<-aw=`p=Bl1k!w-I|||d~xb`u{AbI(&ORXwD{QGh72lstuDh+ zxYJeZ)HCTcP^HO9f&cBW zSm)PY7c3~*eM@KRWU$ztHzS7+9TGlU$OZtSg#5ByBO#bm#56oW&qLFTePjCM*f>2h z%x22{+S2tG#-x8fC0o0~^UD@&^Z2%IJ*Ffxm6|`%9eZ-Dt*wp3w+&)e2WF=FJw^0i z{F*mfPn}oA`|}4czPM4Cbv;_IAYa0v$Em>?Fmz_8jFO_FdU>eSsue5FHBmo2x)dmE z^5oAye>Ja5KH60lq5_bj>p5l5%g4t+>@BmM5slGGi*XtLV%^{jG?qHq{PXkK2Cw1s zO=xisrKi8O?XPj}&l`(N$~yAbia;69{E1=udH2wJ_rgV6+*CHYeZ9m(4{>Zh-H>ix zoo?Q+zdz(Oa$fJCJM-h(WdH}(WLtf@EdBlKVTKDp(!&=QVazaP%V#<3aS`S8`NBfYiHUuIj>Bs7`V z7zD56r($c^nTqHp?TKRUjdV-F4r9D2oX-=2+_+)>9h)w%(QUhqEn?1zs+oVaO zo+z1=_2E_nMjUy|7gqaJ;m1>0Jy`Ml>7wj;Z2Wtr$8Yw$Q9ph9G&u_jRA*i5<)>3F zDCA*>*JoNkEen;3rPd^D+qX|$UtfRX>q~Z_kVDIsEu$4ok0kD3y!sa{_qw;`-V!lW zi&p0K=9bcGaUTmIKpVT7mxt#|xy(!zK)_Z{PfuzeyUh1T((V&yeCDte3Sy)hpV#!O z#ZCzct6a+z3wvaBgi|JvF!fI;tosvE=s<6K7Tj6|@EjOG_t7ZV5(G^P*$K)hqe zj`Fl0t5SHUD89@73Y)LW-HlsEQ`K|_g%p2aZNu6fa;P{ea5>dPT`5L0khW!Dz(7c; zoSd9^{Y-rt^uBHkJ?nyJ;~fY{9<^!zoWSJUN2gf z-JEJtsi1FNG`5M4ho?QN&wI#ba+jgDxVX4xde{aT3yn6lGFx$b=vD~hg!Jy``*9$B3Hsy?^i(uQDdry7;KG^i zx6d~3P2VM3Y~ysZx+e%`Ul4kqJ32QOg-W^ZrL;JGCb7%3IXBCG3Pn;=dS_>8P`tI* zvJHc@v%99I##(Ot_1E(R%kM0)dC~2rxm}6*6DI)iGyw5LOe%O>1Kyaw50h!(oE#rF z9RJ36^k@^6i;K&x#3fQGO#S%r4cT**9Wy_#j3|w`0>ieF+8Rw48<)0IB8F#U%q@fGTs#iq>R&i+SD|LCmJ>X zq9se-Tgzzv<>RBFrKNSdv#<8~VK51439e&(@#kLO=rHkQF3RZnu|n~-zv{usG~|4z z7`?u+$`mnuwV-Y={bR*CN}h`v0(+>-NVnnDDIANMFCU+#1nWINcQo(Y=bN+H5ga1vxP8;617v3+COzl~#@VLd zRgyj(sThLy&rhGAcLM8A`m-*Fo`cLT7qBH~+KP*hy9NlJ3fM)d07f2uv(b|u8&$-# z>eN_YlUjxa#j7vJwo7rOzeR7?u3dveLw8Z@y}P&V>Aia#A^V+`8!{|4+}v_S3e4(~ zC#EO+<5k{kIy2%khD!ts2(zC7Rq? z3x5GRXpL}gd*f5XNGu2)6STU59JE2&?F{fnnS&zAj8$A*amTzl>)cs3gfliomcd&; zm(PFkap3Pk*y(~T85|i&8{6lw$NKXT^V+0?e!@ttW?fT=%A68*Q6rhAFL%e4S5#CC zBii!=t1Mo>OSdk^vIE7#58O^4PX=zyDz84 z$H(h>CsQq&w7}ct08m8W1<{M|Y{|Bx`=Q3V9~>M9!l>lU8`;tzQ4Jq%sjA0^7v%IV z{kiW8^Op!o9Zh>06~#~L+2zZZW6P)*8%J$di!taoALPO2LN3U5=nVu7lK|%H-OHD4 z5zgR&VgRXbZ;;%Iy&V{_SY!MuPcsDQTPB``q-D{)ADKJO&Up{P}_v^35 z08s=HY)Maf3o6R4w_ar3wa4o5T88WP$yYeGxEoz9^&`Lt3@8XoTM6sWe~##$uM`#m zN8`p|iF}I!n>I%8lskNF$sHc@xSlgp8Bd-(5x($#t@j2=_5A#NP>r|uOG&B4oheHf zU2+()JdNt(IW-a>V$qmsoj#QReNbMmW^Mi% z3}4V<0Z^%iy~wDVW#fpb@z&#m9o3HxUfI!gUqGXb^UbSQwb6%vUYd=+k!yw6%y6aB zZGNi8Gx>n@A|$W(Hj2lOUk_*aR9XP`q^AMc#XSc&2$5itOc!62KK7~F_1=wSKAdWc ziT&UZWxjvCVy~P73apMs5U@zuS_391D-7vLSQ;Dzr2hH!%@Sb+>~)3p`^<`m$FX0c zfHoLX6+Zh1kEf@N_BBzx-QC={Dg>GL%CT>mbGXClRysWzxyXH_T4S$*%E5yNIrp6V z6J$>$KF$O|i$?EU!0mJZcvX`o#~xQ{D9R+e*!qMW&0BZK%}kFKq+KZG1hC@Yy!l{d zghGn6yiA?h+G~iL56r!B*cji(hy21%+~02fg2}`cQns|r=_2SD_x&NUU^=EUQi*|- za|1cbn^S~i#R@+Y3oWfcu=YH1Ha0fOSpE$aDmV)xfX3YeZ#LH8J;eEnhv=_G$>N8S z2A9Kz)3PNTo3Av>Yu#<*)cPH1iBT%rQbmi`7HU#rcMJ&$vD7*r7(wct6r=KYyO(Sb zw>biUZgbodoAJY)EwaM4ohQM<$dNh-)Y_`GYoicLwQvt54!wV^+OMRRb*<243)w^s>wiD66Z}e0t2gV!dVVW;v}?<2%jHT^qbcG#?+D z=e#p7Do!|>|tKjq;YjV78%4B=k&TjGgwT!GUc7-uT6f%UQW1lMMvCi_a#F@#0nH+O0 zwP|rni(Z6@DT+XXXI#S&`W)iNu)+;#+{_!{xUg(tu%7#f8kLKKBOFO8#-uVLE8%Yb z)}1>Q7O-=tYJ1YTo!gfKA5`y3lOAth3G7|J>t8&J4|>F(gERX&lXNWO6gB>^ zNhuA{+`@b1%9TT^0s?ftM3jBckk0qjFhIW&0p^UpEvZcFS)wP$1>Av(vU1$>c%KTz zH5)dVdm~GQL~3b{g8_|8asmtNg12Ft>wwURqU-_BTVaLjXcgJbn>Rc3FK6D&B7vL* zWzcVui@;zPtHfVfz^>IsszLCLVAD?@*%)zXMn z-+A@VKcf(D|NBQmW_VtNIA3u)K&C5PK!3;YWU%UfePPtI38gX(R?g^okfnqNs zy6nl(n|2e!CP+p|2Qf%U4t>L(7s`U8xusl=`zfoaj9^_%0dho3wV$863g$GDRKLo& zHi(BmEG;d)Mwo>>+tvF`a{+FQq_CWR31Mrw#jMUo$e!h_)e-h)`*E#w@k-sWWv_ zR{r4wZEs>x#p0t-fgqJ#jX3%6No{Sdc7nXTd}U8vN_B+7dV+fWNG_8aYqY&>EVNQX zLnB@@<=9eAVPE~VH*VjiZSmRv1elsEXLn_!kFv5dj~p`3X&?v)n7v=~nJ@iO()P~l zPgZSL{ebXYonoYfRo}L2mttbxXdnp-ge8E;Lu-8SV68&veoa#B;y8jQ-c~8hEBE^5 znwlG{1ll`0JFoAv{ER3Lb@4Rn7sQ)LyU`xaG?U6>Ke!v7Xy`JF7A_>@%FwfaK?iPy zKMruG$^IfzRX1oR^+Sg)BTv=tTk?`qJ~=_i!y+&Og?p;;$3N`*o10Ge8Vc;+uY=sK zj4FTc^q^|onbX)IgJWamW1(kd=rgUG+qP8xh`En;b7nMWPVul^>X5)hL)FfY7x5HZOWKUX&z0Re)$h1Atn@W9HAJr$>rxbqfm%bst?_-TEZ` zx%w=8^>aOXmm8)M27i-_)WRE#Ui0DOxl4w(5x947{Du5 zM-EWK&iCroMkGJU7j1s$!GkF5@J5GF{k13L0$AS zG5G@_mCWzB$XK+)s!Yjw;mw7C%xtxZia6U<8WhPF>U4+*i$}(F>QYTq4rVZalFg9^ zk^SYhrIlbl1VKG0%$>jF6ka(r==SUJ7O}|S^O9LIej}aBe+~#kQ@{{{nouxBJU{=& zw7)s;c0E%EV)HHMS}hh?Fj*_C5F+ELi*&AXcdJ5G)afk>Arfjx47cl9BsgDoV4nVl z3?;$*Z%<}bwS1GnMT-{``L4aUHy(O16>E0l_;ITv>sju8!QS`7Pm7nG4UE`l(HMpL z)CAZOq+L#awEk>H+bl;^%r+mH1RVHSa0a(c`O5w7-W@?1PKV5M3aq4QRn%i#fE?X< zO`JBHQAzsY?5h`s{J2tKytlR`huvbfQwAqJYW1IYZ4d;qK0LSU*5|gik&zlbDliT? z8d_TQBl9hO3j^8vXB2a}WhDbcyrVQlIIF2YlsKn|_>+@3ITCWgM?(mqTZRaf!gcFDKA!w(Y=q{V2oxZl7 zj7;Qx`;Yc^-)?G86R9=lg1P+n>*9;ZGY2g~eSIHcElE}M&-Mx&x!r!NUp_%C*1u`& z_RX6qMToj(Hf1=0Bx^Dx`1$!u2pwPGFQ~UmZW;SKW;l41R17+1>SHBjLBeyywblqL zFr?o4wIlA65J65ps35Ki*=(rnJG#u8vPD#6@L`_*L)%rO&U&9ecdn;u^S`$U9@}Dk zwb(}J1mqL3(N-;AzM1a1+1BM7qo*g1oY}&Qzd)r70x;@J^4J?{D(yOI=5GrHm6sx8 z-UO!Vu*bd#lGw*hhC?XvSHyR)Y2PbC? zfeoTEdvkz#T)LwpiGG6GU;K%;wrPE8bZ|(>{lLH&U*F#e#ssOEE*%E#0hMhW10Yze zL5r@R&ExFqDjA{AyavAk;u!nd-Gwfy4n?vTa&mHF!BlhZK6{ZkN_w}%3|$Eb2;lbk zrcc<76AB8|B-v*)yT~K-67@pZdC(wM2495kYtX`OPANhOy{vx&^CFApdKJ8^+r0$# zqSNG^0dA`$8_3C8K0FUXC7+h4mxyPhW>D3krMWBNl7G#W{ESFAb5C0TT%&-%a!R^XG$Kzn((kg8x7Yi9(>Df%*Jw zwfuT>x*%jxgpA6nz<@*+7Z;m#-51o;rDWTE1OZfyC-`p6vQ5u!;genMgLFGo9d=zKsH+N#8jvhW2iqhh>hJWjTyd~apD%a|oMwQ9ko-?Zn7rG4 z>ttBVT8Wl#b!+t89%H!(RTu*!($o6fw}5LS4VDFT#B2g0&Yk3oS{-n3CkGVY_ zeYHB?8Vr_7mfQIVc3&xnaE5tO9w z6q6J8W>HSc#M>_pVGSxQ9zRK2(jxC=n&Hve=a1}sKaAS#{aA&V!}rE;XC%pSy(FEo7$E6OKnp3 z0Rte+cYl2970$TSjsySY`aZ(^!Kp*V^F2J&hK);1V}3j@cej83ObQR=_U)DX@!OuB z9jL#-oWfnvu-RP#^~5dkZBLP}O%Tc_Kp?=7l2{TT$UyZSC}AH%)LIl(;s!Gy5QsIe zyDlX%e|mySwgr4t2#CLJpguv-WCMZqPGo*XRbt=Qs7Y%6KB=#Tb3tl;K)xf<8GJW9 zYo=dY{|044oM%1hOjvy}H^uc(;GP3L@DqVqG>>UR3Il>?AVB!rO?|8%T^1}98r+A} zqaJt0y5U1@Nr^nRj56#>!shkDgqRF-AAMLBCXppgMEd{e`MfaZ z>BJ0Mg$`gsU|BWxX6F*~jnrCKV-!bapUvdpE=z50>#y|-fA#8>8G}PW{iKx3urJRB zDVM5Um*Sqf;V_kO3ZJG*3n$zrrK5_qn-2dp z@S8^7>>H)c!sc(_*|X=gq(e_qLPHwWu|Z^A*Pqv}b(+k`%zRgYGPf&fin-FYN`1Oo z!E=|c-=~TK-yW##=+UD@ozViF0cF!hXA695h(gMc6M3SsvZzn)P$umh9W#ocBx#ln z3=gZq!in0$zT@4h`@1L-CC*3<8AXI{>m52qKlmMGy#9#lf3XQzNR`#^UV@|lEuA`N z?%eo+$A`VlET^xH;0UNAcn#WnGS`D>cba~>F7#R9`a%vc~VzR5xx+`UX)&}B{*g8bG;xU1MM$8BPL$|r5Oubc9RC;SWO4So} zSZB^rHOAGSokwv_OArX(!#_l*-1M9)Q1!c}nGXL0k zOYCBpH+R@s=z?>RIt>Pt(TtRAJf_PV z4f8coAmLWz@$~d`c6KI#7(C$zu%jd(5*5>rC~_-!PZ;CKidMh<)@Mrm!v8oRY)W(| zyLhaY{q|e&;Pl6*vTEsOn(oLR;y<~}MJN6`Gr6&N&PB;;KF>Q-wcEFEKLNcx#Uy^} zZsSZM7}^Q2?ZQr9&pwQJNEfOUww%Zs7{U%T1=#=(Gq`d^MMcww^!RW_!9`HbnI?)5 zsoUX1f4Tb}K{FfORehvJACTbYy-tYmHtJE-wbuwc1AtYZR|5rN3Ptu04xgxP#nshn z@I!%Hs(G;Iyd46fJt%gqA%rTe$P^*u<n9PDQ*6NMgJHZR0O zyu7?R%R*ua$K6{}SO=lyj!9tPUc(z; zvG1G=TYv2_%&>G+p2{aq+`zjdjt{){vTlHQIc4w~BIk>icnxbJfH#zFQ;WGr{5}MC z%iiX?!$WvG^4-FMZIVHE{rqfFX8>&WQo4o>Y}5yBXOfbVcJxPvxjDRgTK^-=6owYX zi~eJTIVNS+?6D%52oOOv62exBU2tuKb0E(m;n){^n@ju&wzdGAFhJO(Fo1ky`(07b zl@2fwQCW@bCmND%UK!yHdziq|b$gGXQ9(1bDB;xO%Sf(w=y@OJ{asqB0Lix!E>d1O z6yw=+_c!=7nph;?Rm-(n+j!bz`ujLn^?t`bod~LSSc`?vl)V3r zhfl{unZ<-vt{Z#s%@~M63}Q_J)1mw!zEK%5Z!bmm*s%-QqfnrWfQL`Pt{m}3$EX8F z7+R)ImQ4(vvUJL@aR!c**~kOJxgnS7TVb2D+5TnzSv_RIhY3whO+u8(iHUb_=@XVY zIL`18n^XIm^BOq2MxK9KR|Bn}FQT&Yb~6*Z5j!u7*Bh4oQ$wAl@&7ozMMQr%y?rC6 ziQCt;FDtT;1x1{dakM9aWQUPa*GV7e^rLtQ2n=!OUN4$n*fG>qo>4?JH|c5c0>zvQ z7A$BpF-utbB&#@ z|J@&+94rd1pK-bYGX4Z2W=i}x#3X*iOTVfoM;FhVKfg0GtmoF+Juh71#t>{$@gjG0 zfu%NdF>(MJ8yucC{u#*GM(s5glvRwZ%(z+yTDppqRJyrm5cD+5vRp=DnqB*)kTXjb z<;7|wCLjzMi5I|g;xh1hO*{iEjxNd8!hGAd$wB3h$^Z6wF{2r<+MI%VP zr;(o!fU0oTIhHTiFAdwRf0@{-GWv(X&{L5(!GEd%1CRg*yHE+VHRAunhwReJH!Gd# zNhv1;R!%Q7Vr}aGEx31+)Y_4+U#-4KgEFf}njYpVCxkY%`Y&6+jw zI$6rZ^J$+wj*};0)mo@91T#${NIfxTXQl(}jZl&GOul{nq?L5`5?N0;)rnmTNQksfU0b$cJNS_tPeD229|iAg zoagF-<$|Hc7qJ>WH`QrWjinnTRGXWd|FcYLTy@D7wb+xVPgAD$17*P;l+5kzE@@(7 zLY)2R%8CM?dS+}rtW!1Wmdd$K)`ZA&`TF;QUR_BFzMTjo`)kBKzB?t<;QW9KjX|~J z%yOU)OlJ$s>GF}DI1bX^!aoeE(MmqWC|F^fnknQMe%5pLJnk$AccPHxe zBn(^`0@R?r9`iFYSBmYh%7^g;;&~+LSHiv`^|0mSGDF-ESg#`hnw3v|q+`Iyip5CD zRbZF4f0$r?gcV;3wKW*~2M|&KT~0)sBpMk#;t}#GHFzKeCO2c$MR57vCFL4Ox>zvg zoO=zf3T>H-V>xgLp)~gyqz5qp^Nh3(XrsD)TsX(OsP6mm zUifC~QhanE<>MnOXAg2Ap!t|saSq@xg2w;mK znoV&MyJ0rIXA+nZoCErX$hxR4Q_zGr^sE46?*)xszdkZSheaI44Dd~%L-(0lWTphbISZBYafA8lvhqr5=KzGpv|(?51j5aV7#cR zs?tfUgFcI%Qyw`qkG|#D*FusA2()|ld0?0tMV;4!HJ^gaNH84 zB&GyRjqvqRvOkvyYMsUkaM_C_Xr^?UcNqoWwT{nQvQb)xG%zmgKu|-cg9n_isz_=b zeSH>VQp|1Ls;F-}%eEph!1<70;W;y@2cdf^o%$|7(9#<{ibVNsvYAF-ZHf++ohWZe z#-I&(|pg*hE-CBqFvzJh{ctn)ix;3Gvn_w0m?}0#~0ME=k|6 z9!Hv$`N0%JKnySmBt4IWj2`K2ASTZe;g+T*3M>c2Tne7t0LCgAzhNW*Q;-5Af9W>l z%7K81vp#a<&n5=*92JI??XIk(NFJAy`$#-gB+x_9ha5>v^xzmG(WI$}#*aWxyg0&2 z&Bg|FN&FKV2vt4tMrQiXHjeZR{1IZ40+9$Hx}`b~UGVC?`hxd=7M~kERm&C9V0&HkiqI=w{R!@o<`)aW$?NgT>}!q4D9K|qJ`U4z&`t}owWhn^r= zvmh$sP%eP{)=J@O^cJEpO?afuw5(nHXsz=7BfunD|x0?QG6zG`6e zw`7z($Cg2h(7|k=dT!UAJ)ka35K0Ml#K~Ti&2k?2Q@J?B@un_U=z=_TQ8|83D!-wXfnO}`p+Vk^C{%uJl_t$ zl<*8l|GKaXKw&1&teR%3M%r}<9|#IRew^&ITkC{`A(MD>i7TU^tU@(E{A#6A6rx|C zxNQV1_9i|w3&O{$V*}eK*MS8=I#))V1xVw8Puy#ad2zO*_w_pmj`x%Pm4jF2<6z0t z@`nBB2NQUtV5f!j9SY*uB4!Vy+q{9I^|&zrgYCMxnX^m`_+T(7&L)D|#K=7~l;Yyn zen(aq)wo)MPTl6o$6tSgQ#pK&r>EGDYc`f;juiFs{ldm0cXwl_Tv7TvY`;iZQ%m?G zVI74^aV}zPtp;W^pcpQ;>xaQy9cVZ?ZU`1Set3rf0q*e-7yb8d1E#e}k4Z~%!0MDV zdxkny2N;eVy#EM)ixCz8@m)F}8i7gL@p3nrx&YF1JGeOjHR9c$o}RRX%TNoO@GhJu z(5TQ*lM>g3tOaO8MM+K!*Cv4zxEqq%Ne>fo@Q=%1=)iZ)KF{t?xpA7GMY4 zH_0ThyT3k-Cr%)UHI;sNqU zB0@$LuqnL!9Cq@|PdH_O)v1ZotaK?DLyG&kRb+f4@?+GTE+qO{(<0fM41@Q- z>M)ednZ(iGd$s+V{+UnUJKop>S##CATdU5J$(lAPT}t6`c5yN7Ys_*15>L~AIe)`` zr&CZr)LG&VOv~F+^?Tx%Idy&d%0hcJ4h(XMMaDvyo*3)5Otcw4hpwS{P0U2V#-Zu3 zIQZrhoSrpsPW}8kWgx!Fj(H<&7oYr9$TV}aon09;w+phM>*(qK0M9x1OE1~tO^vk7 zOl#noM|=2W2@n5<+w~;rQXO(zz)Dp}zqwqYQ+;+~6_IkAo12L#1kd^mG69*c06B`R6eh03 z6R8WmKkCZLerfy8mx8I3CHCNJOV<4Xaxyp7TpYG~w7kTi`KId&kL|?RJJVg4Y(F0O zcc{>#s7+S$F9vUc7YrSWK^vn(5&*?W+b2d342R8pmnDS-fW?K7Zz2jRh%RbJjvTq& zZMdp^A8t-C_v>ZQB77PeHM!|#ge`<-{sb2@T2xG}j(LX6N1F2PAa%jeg_%4aM`dLh zI394S-?5V-mAW$M!;W@?Ax6@W8A@AH$lCX|wLJw&?=tQM|36%36h`d7q{Rh9u__$* z)d>`BDc4a|)U-I);yI&}Fcy*#4_R75XeM%>9`ADKWo~yU^*ec&Lgfa*ZHJm)AxxUG zQ!BNY<8yQ0aFR*j!%o#0Rk2V{fEhb*vT8ap3Fl?*NHT953ZNU&$9Yo9n;gveY8 za(jhV&hiumjiqYC@ zg*g10JvQ* zKbMI4q)7xk0NMRG<-_RGh{7j~n(6rwAT{#kV_W(Vwv@?O9y4XNdqeuZm7AMOh}n_8 zriSbQ$W4ers8vXB7!i=bry8u+By1|UOI|s+C90uE-D7Y9un^2e4EChg1?Eea-n5#Y zCe!q-N}+MYx8_G`Cb;xEjhm1>gSVM^ppYg_^{^~yj-5MyUIeC3cl7Zhc&OpAiRS_h z-qi^$gq@%G*5=cz#q=>Am@Xb$!w&0ZF=!r`kA1MORX6FU@E}+yfjKWB)DIM;l#rmn zz`$N3I9~6Eg@qxDa1#v?f z!0o45IpYw#SQcK;t@^jm<=h;Q0x@SmX#Yhi}POC%(KE zR$8UoG=tzZ&{$%pm`~D})mOi=jJz?C+t;rb+bCj}+o4L7s~{$JY$o9nXE1olMeAM~ z$*x6PCZ2LT7_QP#sRoxGN5zMoGt>6^7Bwf7&zw1vP}6t;hp=*rMB=o{EHq?s7Xr@HyI(Mv+0x;iqwT=0# zT#N)M>23_+0gnW`1i@~I{Pha9+@9a9skvx7!8sMI?1CYWI4?<~fn1H!5-QWsoi0(jIS&5_B zV6xVhc`Ng}@G5u6kv1$expS;vAB*KyC+5Y7@{JgrMrtbw@KV@=u3WxMQlQtc1TdEX7dZcP(&?iTf@g|`pNfi^380;(Mr>w%QtDmMiLF=y)QnGsw5_# zJO{}Fp}l=am)itA5+@qnB`7HXQb|rFTL+%VxBF}%`LE-x;Fygfk8vUQlqu!2S-O| zF(d|dkxnJjZVSHzF`rXQiLnywOuLC5^CgWMK8P?`ZMSkBFQhwkduAkPry6fYCJ*W) zsu9v)4YF@?TpOfoBrPo%VjcSTz`=-~nGXQCKuFJUkyzZ3Xz60KIAdHKuN=H08E!JR z03<{~Lv+?0BS9Wj0P%F9?yDh9_>~QWT!VoiE(-+ec1m`tdw%QAa9G+AIteKm&fxRM z{JRHT*>t`zuz7)VV(bu~J*d5gw<{U$Rr zuH>(8Yk^S@rjXA_l?0h^D}6%p30TXACTJ3I`E+$n)ie1u0)$fci;Of1buC^9+940A zoYXDhjF~H8680y67s>-zbh^fR$_BBnV=$W$Bmp&5pk=fXv@Wrc$ib?UM})nMdV(8B zBNul3yafxWKt>YqvEvLzpaqI#T+-o5e68rerBcAm8Lk~46YM7bK@c(EKIHvS`>>9h zMcUoQqo% zlkYh&3WqazR&7+v_D4*-Cec7l$}wWB14c$hz6bYs2ZP~S*+^RIF&c&7E-YS8a|t*W z5RSJ!&^T!sKQu7#cDN3IWY+fySS7Uxdr^%@W`p-_%%RrOYC+bLff}qzt3mnZOGD5y z@QrFj%lW+{4 z`Q9Z%I!Somf+Zt#x@{@s07hO_Mhch;bm~-Rqn^pQv8rv`*Pjc1K>t&ctSQEH5cLgJ zlsb51h_?|agJsQ)QZN*OblVZrJ5t6Sx5!h;v1okTN#CrDg`0cW|rlo}mQOEUU}1jYJ(o=v~L94P6SfM%s9P%5K9@trtvq5{z< z&AeeBzTwPJ3$#aYO7#|#8!;GRi|p+xP=m3MvEM?<2b#Ih6C&lRPKHMi@u2_&yjGUY zv%ah@?wpaIi_NyFAG;0NoeBb#bo&D4S~eA6b^tL6;ZvfqN~A9oQ%J+H^~9cdH%uVm zrn!yV2&{$f_eR8hV)*3LE%GcP1R{XaC+-4brsqH(prD|jY=(s>>9zoO<{w7r3F3%` zB;V$nS6NYE2mXJyX-FJ9Q62m(sNr^BZsr}>jfIU|Ta>Ya8JtCq4Ycl|Xz17_kAxw*(wNh)Lq# zA+1n^mH&EMA*vsV!o#{@NHyPCEQleunWx41&`L)n=fO(QOpLpFcF; z5&bVL&nb>y^BzvcVuqvWTkN7pEC%$aK42KUB?}ZgeCn;)OY@XAk!UPjShrP?8cXM0L zGtdr0swH)Gb)0;>ydBZ3pSfz@3b9B5*vO9gaL1Ew0Fd=$djbOS(mf+XiJVEy5lG#n z*(ep9TF#%@be{qLa5Nq5uLixGJPY+zb~5jjfBPa~t1>cDa+Y*KI7ihZ2YyQiMrIds z1u?5fr;Qfr8gu^s%=01zY^Fqjsf`4@DCXH=T2|< z%~>3*P+5f?lt_Gl1nlFmMWT2VIJ-|l1?yiUFPImO&g^b9nw^4zwzv8l4hwk&An3Q( zOrpIB6w8SpI1T<6g_iS>*GXv(gD4<8!p%9A4J<2TIw)l&?eRUc8Of6v9S{~t!4IC> zEJSzTlEVOUHKdhMOSi5gZOP<$@%*H_ z9E>>_@~mieVhBPvFv%`jK?8SaSpz^CF)(r=*n?4W>3RM}*tCiteu#E|1{$u0>#2#+ zcpr$bvXm1VZ6DFNN>9bl0GIq+6 z2Ps16`c}Z$9VG8+C?{s!0u1(g77<|zMj{X>pM;u=7ul#5=u{@Nd6JTL zV`AH3P;u(5z34aqad!;-`o0k+nZS0G3?+eclH3VV|H5akzNh~ZkFGBgpvlk&(ok>0 z4tgE0O8^axU^DlP5t4&+fs?LgP^R%71o^}5sQKv|w#^8LC&LNpStgDUb7Wvn7PFav zjsFH|i6#X+Hm7>w4hp+TWlO0WA($TSeGMB6<@veMwiTNlFYYv@yS~<_62hcL14^|$;BB~N}vWyyK6x} z@(YIy5_4wHWV#mWL26HTw`om1^DJ74T4Cr5+MkHsniR+AkvXs8jnnrrW)GFNL(-9d zd}2bY4SZrl9?LmGOZ5sdLG+GAn4IuKz%WwrhRhG=0<2?z*OCs76ksB##5?6G0J;B* zYJ;u`w2+}+8AF4J9}q=YRfGiEZk{V{CLPs?aj8Z@e&GSSE^^TQQ5um<_n)%%d-Oh_ ztc|ktA7%VH@O6v*ztWGEm0IIHN8cs}d994<_Y42v14#svedF@s|Wx6k?pLC~* zi+4<`eZ*}ORoR}Bs4JB-9PklNo?cxhF~`=o8gF9m+_`G`ZqDenlf#^VD&hu%=HHQB z27(Hs=H=vSS~?nGYjw?tLF-$6-d5)BY20b!IV`q-aM<*T^D$V|SZw$X(<$iaQ7C`4 z`qla%LnuUsv#kH^HeZf0s6hdE9=KbH%7vPc*yGW5lHZYysLw$65HaHrM;k2baO%OC zRkGZinG4wl3;TSk{KLW4P{F~G;aogq#5w^~n7?^`mb*nuEOO5uaIT>(Ydfr71?X5N z^%-yiw(4x%w0f6L6o8w=69v);MVQabKm5<=2n^fl z&gu%q92CEQMn`OE7^jD!dxr>dgnNLc@oD{Cd_hLVakd9X8pIH`XgmYMT!5XFj_jw^ zzyH2T4vtxYR>j%RKYH*(P~8k=K3nUBmXmh1dg3kjF$Qs#E_us7WaLZq4)NjF#i;VD z3D=GSMk~@kr6fbhqZ!8KDG0sGmIhfiW+YWH*Zbm=7l;pK#2r9vJcNITJK`j?j}%n9 z;TV}PBXT+Nl zc`M(lgUK0WES()OL9;6D5u4Bko|rYug9#)TM3fq9h12xVHC zR*!+7|HvP>_Z*4fThZMjx~)*z^VcAhF&E}bP@kl@xSSsu4S~X~-Gu3}X3K1Vg_Rsd zaA*xdhuUxbl57&FdF}nWE(tA<9_IbCIsU;X%RvE%1Lb2*!4cHPT~&f7yMOPV!PQ*g z01S)>Bmf7IkufvfKhxfZ(I>*>pMO9z=r%|+IEA;3Rbb?=LAW78jaL;jkzJIWtqzu0 zX{Vb-Z}j+f5c}etg9ora2j)wm3?gxWxL6{T!V=)?CjR9L0Mnl9`$XV&_o`wUH;{9w zNl6SmuMRXke;z3PCjq$f3~SxCfX)Hkqd=;6Bmu;+RMYD1t{MO8h}a&*TaX06aNM(- zqPz#TkAlA-CB!v$3?Xp8CO!+k#Xz{*^BPj+K!7znwJn7bbC`5h+m=g3qN-eNM`-sS zU#B+pOdszh4%jxZ*^M?9klND2h+h^$+iXaMtnvkZR5C2DY2scu&JP19+MkECd7<;z zEZr84I|`ZRE_$^cX2S{;qY+pLWt5N(EOvReMM61YR<|*#ig+I0*(gFdfw73_3nf}8 zP?CxeUNV%ftOH9j^AWL#Oo&lK{BLl^?Nvd#0GD-PUfH^tNzg~6?bU&$eTaPef-;yK zXkwUqS|U#Gx621p3^^q^p5P=Z7i|BL%M#_$1b~q>JaV`@5EJigbV2O{29PxHf|pRb zvwQ@aIOS>%cVV@*i;2}HupTu13uMk_bSGY!p?Du5l<_eAm`;57%8oW-AD zuna0vBE#uWC@H}ia6N>`cs4HzBFP4ujEY}L83^apPP)sN{h)JmMp^u*S0BzMM$NKa zCoRyEC_)lsay+WT?NlAOJ90$>{zjsT8-U|Es5ck5Rl5!F*&XhA8=LIu5_Rywh&kom zy`Ko5n18V;(+*+Mq5p-7JgfIR;37t?upfHHR=r%@(v`S@fMfg}Yj zMWzMmWc@*(m<+1CGPoW}c_C+*xNXsG3Nv^g_+&`hDG^v6Y(Z=J6Q*2S`=S%v&Vh%rOFAln3G_BV-KzhZv+!0u#u;iGXV3VZ=7{!IPG5 zV#C8wng7Gydj~{$Z~NZF#GS;tvyHJUMo6M4b_EeJv5qJf6qRC$1!*=AR0MPczXlqLmeOi%-#P@Cc(H8b71&bY4@WY168kTo}Rwe#>~#RJdhgcl)dQO zsncN!9Cl9BhO$XgTMz+oev*ekBG{GSj%b3Y=V8!5xk_kgmj2+QZ{eI&GN)sPAroYV9>OAF}n z*w9KFY2@SMsQ2hXBJrPBQvy74E!rp~;{v8IzO;g_0_CRb73$ zCcaq?DN|hXVFiuk6VaKW%V1N;l?Dm5m>0QF&gyL1`0nHzQj-@)zxX}rK;N{mIpyuH z>qs?_fCuoHL5e?SK1FZ5_eDajs?5-=BU{iI)Gbj99XK^vhOa@v2NsiM7>>9n6brCb zJ6@^<#m#S1LL@iWUzY=4a(cO}XS2NX@%lAT${9;;o~U(s%=zmRjc95ofU|1J-6B#) zIJI;1{%yD|$mJ-vKx7%*sBa=l-u(4EH0m?q9U%U4=?v*i>1#yO>s9zL%o&7O9Bs=^}bbC8rEs>}~D;$*hLT%15*6yjTzrC+mU}@so|52BqB{Ao8>B$e8`N2!e2|@Xc>=tkD(Z|Qy zc{en*-kj1k2(|6fOmN?9Nj#|L2&-h5s2jm{y6I^R+~VvUbA>(>DK!#$#LLk^kazhB zo;AZbEJQ&AjK8*Y^qic5u0_}XuVf5i)yXlU+TLPFu9QWl2iL7+R<{r}D|$2#0}%_K zR8KuI>Y{(or+;BUYr%+4I$L8J(mp`ef|SXCZsu{osgA+YH{!xOq9s=4f_rwaD-nf2 zR^2+W_t_U;e6cyYY@Zm+xh{*<=NFs^RAU+o_otqE%j&~+x5apZ_KednyJ5v8kRP$- zgQR@Ywd?+NwNKUxGeQmb0OKXVnMGp7!YgI_Z-*TbL8N8)RRGB;`4>5DK}N5&jW<{u z*Z;3&Ir-u_jJlw9nNJ_kZ~;qAY^YN z_vT?0$#`QfhK<9?aDt%zJ6+xX(e@D!8v z8_;A4KScqN2GF>{aTtC+LxNaEU8=313}3)+8iVlZ$Y=8}pXh-ULm(-<=gC3SsdzJ* zNEI|^JeM+Qq-9p+o1(-G6I$*+J3GJm`sB8+UDH0%rY(fezq#x)ggO^b@3qMCo5Ea! z)BPW!f5`CB&v_W_QuBTV;={DwaN#{xd>Z>vj~;O2+omhUfW+bUFzz9an^1P0c>yQW zEfO403s|2u@!4J^W5ViCgIOqZSueAzbLP%0`FPb?-?ph^+)US z&*!Pt71}0U9zqR*6;UK;D_UP=c`U@}II+NqDI@R&V~_(1Sb*?DC)_e_lq3Z`zua-S z2-E~vYJ(feEE`u>bMM~DeU94~yt=Zjm))PgiJVP6`EHar!SrJ=1h6MV4hQ-GtGBck zCOmt2BFv+;yw3NSc7W!$13xn#d@gt4H0?#ht`#i&aG-f;P+7^Y-`w|2dpFN^zIngd zgO}$qSn|hzANxXQZ2L2|vpNm@@|$$cqGY4yhz(yH@OgabM>oR(`zI#Z`9*qWZ7MyL zUt(^I8~?Bt5ozLvipi*jSMQSGQs zY5O-mee&ddvAA$@j}dibUIXl}zr#Zf6LQsGv*($FYFF>TJ;auY$sDYpyOliH0Ia~$ z)bipkxb1sfh#ub5yc-6VsXCMrq12@Ys6=H!qkZpo+OWA7i@vdOR z-1zmk&xa-fIAHf*d#;XmAfreWlCR)INVA1)PjvNpwRn;7W*rNC1&GH(8jGPsF}^y< z;_f9_52xf$1t|L42m5sHJDE1z7jq;VT;Stsc4{vAPU*q>1SgE1;j*M;(~hmno)gWq zj$T3OPqv@Th>KivN4zY^Viys)LG5-Lq^%M+`lZn2`~lf>D-GE=R^qBSqpG4ciYk;P z?QetKua&w;h=1X;i3pw#MW2&Ddd@R1qov0O8O7mM6Sv}WkCw5iFHWC6Et3N}514d& zdyR}D;-l0}*wChmk9;$tM z^{-{2abGL2^qx;Y>W{Yf&khjGZ#xEgD$CVP6L^Y|Zkf$@K?{WQZ+LKYxhRVEPptn2 zsS}eg&a#Ll60_3lAAmcTM6G9G_YzHM_m4gbfS03Kx}KFK2E06JizNsIO$UJsX01{~ zir*w!{qY?7^lfVerPM-}i4?aYrlVF^4CZ8SbsAVQCiCb3kv5TBGsUw`{IF;yE>-ut z?j^EooNB=!0<@mSO3H8I`(y`))%uP4jmr$e94mADz?RET252uZL>2#;HAHYUNRIeK36WS-+^l=wA3;v9%S zLXST|kbvmtbCEi;-F(yx5*ld_DpQIO3)%WX%=XZ^+JtJ7*!E*VN%ZXC4oBz;*4TAU z+dm+yzGvzPf!!rrCf{83_VoOW*Z0^s-e$_fufP7d+X~Mf*`rgD>iZ)>= z_yM=fHQpwHiQ?@wtsAa-bvv6g+2vr5krO{*mt&3)xwZr*iEuwV!@wQM-H#wT6Qxj%?i}TFGBy-S(WjxU6U7=$8WdLS z?3~#>CL>Z18A5TATY_{1G^8_G@zGv)*Jt~G+u@_3j}A{zeHqQ+o6YY&p2=h+cH{zl zce$dIt1B~_;dO!GG8Yl^1*2`Ar4$Jh-^Z5U1k)H(t2Ajkj#pZ8YxoQ@5O#=p)LC&i zLhN3tY*0gkmfby^vW1VGa1!q;Lj<+`dN{s;W@v^T!F`&HN9mxpLs_q3}Jwh5*4VJ*}1xDTKL1{x6o7?b1y{WV-6Gn44Nc~S(sL+Hg>z$dAB{!s1u>?Jg$!DB)Jm6sGHZF zk}(+~jR7YYr9SVETm0%n+YNM2)aU{&$(1U;TnVC(nuy~~e~lVVlU`?EcF$+KOQe>l zEG!&Fk!!|kC@ngnGl@%=cWpt{DzD~m?;KZLhj>HXA0HHENFnUw@bO$o7xq%kb&*UL+U~9`7 z;QOIPjww*l4;rs_)$DSl*~pW*TltnYX#@$3wQwx`mdZ&iD%`G$@hF(ungV1X#cvT-tHRJk&INAfpF6gjLVo)^lfjDzZ@MQl-jJ%(*( z+SZjN8(BD>>F4Ks#|NbHCy5Q9*c6>CQQs2fYCdD5xIvM)n6-koS8Uy<`iKWCEFYgb z5Ak5uM2$1MNBY`nY*_b~_uj3J(B||m1d}ExuNak^`HFrMI*8X+EC?Y2GH%N}xEIQb6_b8ES^-UgeSkL@>`f zDk`&!77*;n3NH6r!?0XNvbDbf`ftYKX;?P>tw=3ltMMVr04+|`BRVgP)sC6BVUHTXlB{~_fDU=i7yru;x*EkRc1Q0yFf?(z8sif+-@#U&&doSrS_ zSPG^5XTI$oQV(3s%F60+JTWoRcY)f)eg9KVW^K98a$wiHV>_mHo)zEg5o!xFf@{hh zT1->#v zgn|+af_7=Aqo!Xy$gNli_TM%ea7)!frrrO1{J~N?*X`Q@UCikP#p+rM{R)-*PZ%lj`eIOZ40<qyrq552T7Ot^zu64V6` zxY@hn7(kxbI~4FlGH=Zx^r;zpI}dbK>RN_&xHe*xmxv5NlR+$vz9WX{h#`y^m#z-2r%ZKA)I-LP-dj*3zulBxiy<1!mw=-NN`B z##7Y~RTHYDe{WC3_n0}MOD=ptS&M;Rw&0X!ov>!K@9?1*(Q)iKgw?kW1yVq}Rluc` zlL+TP-TxeuRkbWRO7Ec?Hq<%26{RI+M?_?KtU_P2|2!JG2TXa(v}N#KVq@I%kT(4$ z&r?@{9y1%<>+w)irPFV}yiazF%*Ja&G+9P7h0zwEo8ty`oGcm1F8Jn~o21i9YK#Or zNvw4zM5&UdD|taRW7YTn8rQph`}bcTO>E$a&^VEm-Mi%cs;sqTIdT3p;NpRxfs=Ao zv@u|Yozj3@r%U6+WN(=ODlw0qD`jew1fC~OSH1r9VEe8eVq1;}?NICpFa?R4;p|@Ayz4`K2dTL2`(~lo}PTn`gD9<@FYeYPAaI(zV7E{YjJV z*!CE_K+D3hyRK!SZ5jh3?ZPaeYvdy++u{K{THkcZ=mrSF+MlJQEEET3iLLriy3v>t z2lZxVK<^?K$MT=1erunxC76zLX^l`(c41NFFzCla)z-6S@_ECJjnAJ)6?4FHQxaoh zmO$ZDbfVe}hRG}eJcWC{<|XIMT06Ckl|Gy&!80DBU>zckGj<7dG{Xr$9evG(Q~2b* zG<2au|53{&mM=0YPZ$t9UCECzLPJ z@7sru&f^cbUW=dFRH$zFoynt1XdOifoVN^l7!n!GSSN!nu(M5JGNOqJs*Z+sD6#qp z6mg(%pPx2TXZJl}&GO7QvFA@Xm9jHOsSXd#EUL!axHd(zynPu79@GjAkkVDC1A2MM zNJZfo5Z&2Qfk;xfyc^UzHyeA)flJ^0$I1kQqb@G(N#Zo>$G?$N=`ePVt($VM4NrS^ARE7vtAU-xQ zu$)A{arP!|Nv&_1cqtvLfx$hYUNB zi6CR}n^Dv0n68?gvk*D+b36ia$C`wx4TY%hANX!&5EzZjVIVMHE71FfQt;q5eHZdJ z0lhE?-ehTZ@7C?K%-@ZRlW`~Ud_^V#n2KvC%Uw+ZpZ>o7M|u@KXn7aXGdU)xmHp|@ zd^{ARREz2EhQEa$-2G=ewQ_Ffl+E|MRPaqHuB+)2Bso!fPE-O!=mtH-7sDB76810E z&pXiFq_YoeYK#m`Mkg%ua@KWeGr**dWm5R9dl$TVj$d!-6{b=VX~g|%v+Eh1^K*$; z;Vd#XIIg`W-ec|TVaFN zhintjgAXqt+c^4k5rmp}n}A;&f(8c6| zt=jK*D5)FsI0z)-RMc_wILbG}EuHa|p`7X)rdGSu>-0$Y^wOqsPYDs#Ina+!fVt(% z-5D;DG!K$tAvGsC8BxegUCEuvKlU=hWhodD;51hW*H;h&IA$cHn3V0+ zuV1JjR!CuK<^`lce`h6SD6Y=lQ1Cq5P}}kK z0LumuJrfKVgECVoxCG#ECFpBRegXnwr-ro;t?_)tPh7??q4G2bNy*&hUT*PP8?-HvZ@sK#TWNKT9d3mW8M`<+Z&0aM#Y=wh!aB+Fjsp+Cb-Df!6bDEB6%syIr|GB$b z%Xp?N;9-j6#VcLMw*gd?mNF8fXxK1Q5y5i77%Nebg(9{Ym9vWZW7g3qX?~zUweI=x zcU48e_uptU2bQa)lj##Sy*QeN-+_B{C3W%g7H9Y0jg@K&3%JsFQCL!DDR}s&S z%p(&%BIxo-uK`Buqmcm|3ysV6GQ!vw^YB_H8q^gj8U#0-<=>&KGWWhx$1oAw7l<=- zbzR;#%){0vWyJPL+z$C2RVDqCSUO!xWc|VYKKS6^$^pds1}6gF7-Sw9*K^pyvj;lo zwfNVEj^Vx(c zR1iQ4+6u970%W}!nsu+dd{s1Ky6cLJ8oF33Q1vYHE>JbQPhQA3Vg>+9;d7>PQZe9E zuX1pUcd1(+?eV(uQ`wDRz4>;+pTQ+?!3|OjS#o|w{3_p?Cx{#r@!KXg(CSG{(ank6 zsp3Caf3jeygyA4C!Cm)-7~DFgrq}mntwj|}KY*&fw$KG7x!K*v_wN13Si9AUj}k7r zx5SG!MMOE2I|WR3W6ra^IQJe0sNO@pp45qwV5v2jY8qoshC>|{nFJk><)`$QVM(-F z1riadZ=!KD+wY!(ZmmUt)Wi=4NHeLU?u$;B0P#OYaqJHwad@cwg-j*wm* zItYa**Z0kv^Y|qt^-D)pQwPfQCysQ#+B-IXZB>nJQ{J+kUHtp(q8r;;Ne6?2?c$Mw zO_HM(zC@g`idjACPXq+X94Hyv?Y7wA9JPZjn1y_=P~WzRTzuw5{xd(IP(LSSD6qM7 z8Qcz{VPg88Ru_aXrkcZfheon3h=s#G^RwZ;l7@*64k^YZ;6C6S zWjGIlp;Rp~Z-Rc@YJ1ZXXj~{^B7!!W@8c|4#0Us2VdV2O??2{F6giTd|6D5Rc)HQs zO4ECUL2@|1#BN5%D9=DBH@4Ar90w(KPE=i(6Y>gJHr7VwLPbw6pKpZ+_#*8UR z2p+A2UK3hD)|~%I1Me3ScfQTkWfF=}YuC08IHhcP;K=FgkCN3iwlhl%6DTT0>F)yI-QnH%5I0imoVNj)2 ze*5XE15VGzzXVNUD(r*4OI3D`l+q9M$}STDz^jXO?pP53#DdDjR7vn^AbjZ>*Gv>o zZls(k*Z1!@iLj4^#uxSV*)qwIUd&jDdE)3QSX4k|bwE++x{)*1ko#0 z{}Gwmu^lI8${~@LgnKyP%I)~zi_jO&-cF0T#KglAVh!xo*Fq&}l_y!T68S5G*-VYN zq$01-t!tOM@<;iuoLFW}oYxaDI~!KN2bot)J0sn&;eZj{2<5bsJl-FP5Fjg8T6?6UoGdXRZHX^! zFFwWH)8@q5^*!FY@1MKsXNoV!xVHX#vinV&y$4IQH6|xTd^yVFzJ1_9yI1|&?Ywa! z+p|t{#+px!94{pFRIQy9qqlddSq@VM>bviGhHF+-^wJN&(NFvCL-?b8o8#30gvsL` z%)E=OeLh3LW`N?E(p`i5hrSS_5{K^~t5z(`LizsQ@zQ|Dclye8i0T(>4&P*ZmsqP{ zxkA|phiIGp4H`U5OJBoyckGc*7beu5LYQ@#S8eTm!iGLoLX=5|lOQBkhiXB84yaDIlG;1GuPi^_BfEh^T>%-gtghMG0rC4v9 zSv$t&W8P(3CX-98(=0dD$3AdWU%M@=iisUT2u7ELOY)uw_yAZx?B8EEXG~r139)qI zJhp>cq>^OzS|e44_;95G=Xkwyp^R-In2lsC;UxISfhzAnz$+%@1{a&b3#s9A>aV+<|?1F&)PI-FJ2tgawBM0 zeDYg9{4?8^7uM`sgar}y{CqJT!Hs-NHdIuxoap(Mu@d`}2Sv$G7vp$##_34c=Z{wI z&Jgv(8Om3MT>mn7-Og9-chp)|DoVw{Rp7Al25|CH8sf?b6avXD25oPd7q3 zwG{boxqyus=a(0!Okhm%l_6>$rY^#8qd2JYmH`J->ek+zD0x(LT3QMzoL22OTnik} z4YV%0&Z(}R^FBmE^Xl}3=LwC0`0A|Yl$?CWnof^6Aj$s$3#O!!B-<&^2z{9a-ORed zhkD*H-GRZ-TDWxqHLGH*GHEcsm$__)Mo)j~=HgQgYHr5F%94C_f;*hg>g_-N&el|0 z4~k@#(n z3=fIqZ8TSV9K(U}`@)roMx@R4EX~ln|_a9--gL^`2KiLu3fGOcw>1$7sbmM>@HPE4o+aCJ3zRuqxUhp)pN@ zx~rC0eQy*@AgVk;yrd@@iou#B`319!NF(yAQB3^-@r!T7DPcDmD59^hKKz-YsyI5B zbcfZuGlSmRW_xp}g2r01Dr@y-*t+4trOXq7Q%;o{A9CrJ!7NkBO0bAseUxd|Bra8^Caf|v;>yj`Reyw?0{vR;dwNRVB?DfEkk|l7nh(=0 zh$l^W5J;sA7y}vAM(8C*x#AX|92U2CZOGD{q6Sz#b3lMI3G(j zsQa##zT;Xo+3?*Bd(?A;?;hxjS}dH!f59yio~L-yr>d8Ie_HFKlsEv@QJqp_P@~@D zec7K>6=WLqqgV}c^WP}PlpwNZ)`0Xx40QWIE(;VEk!#*846rGlpeoljn!6!4x`Vqd z)L>7JXbA9JGHy{18RQz*Ni$L(M)~be>HXK|NHIE?#|+DjU*gqh~?Qy>2N>#sIk z4$SK8a4_-GB^htSut;;c@EJg}V(cwsvFujsQDhAVxyNZ6RlFJQ&VLUjJ zNN*wTCzP+cox`CetrSI^9rg?Apo7iL&704t6EZfsna9R_^2u}xzC1eB2W0xvNlEqV z(&?mj5cq-YVCUM~K|ohS^!1l=ol7oI=p>Gl98y1pQ#3aSpxhH!zd2BZ+?-9MW*nbP7wc=(2ZwBDXS3K0Mk!n_NR+$c0&K4Dpy zs>C_9l+g;!kq*(;M*xYzC$tno4p3AdwTo8$^vNb1$VmhOu}+4-QBsymtvsDreztfA^G zxudhd|C`^_=wYr7hi-|G+sTdY=?+}8y3g^~#F*Q;0MlZg;BGG#!E1-zy( zaOS~}m8)xc={Q9dnK!q)&`wne72~EtRfAH|(!+FD0kNrRy zkDWWvR1ii4)J)Q7`0e}Q!h(pwGT;h|<5`k{Ol{v_K6!GALkk7P>gej+pz?X4-9DMk z7wfWbPM}QsR}xOfPml#PNM9bUOyr`nl%_(qY!ZN*Ot}C2)-D+cbA8 zOiEp<6b4ByCVE3}r_nc2q=YZ?XxS1x7shW9+g%`3(9!(r02}iJ&+5G3K#qj6ajl$G zQOq=!w2SZ2|KFYGOEbQ-_#XGjpSoB%nINx4al}?Mh*Fcu+_!F%SN+;!xwx=obavbJbjW#(3(x2pWq&^u_l zKUW2pYc#t(DC|w~Lcb4NU`fzWX z6ZzFPB~)^>Y0^(WXQ1OFessPJOX66%KV8kCa3EN3frShtAyZ?J*r>@ExUg()pK)cc zeeh-qY$(f|#poDfrg~24I-6JTc?L?>AS>{1{AuGoOmdiH| z^>cr%K)~?3uvczS(`wZ&-bsCE;IqG+)82pCca?_eC0lua0;@x9+s{wZdQ5KRDT z1!Gb`Zc&=Tc#7|wfDbPD%wF16Y#OQh7L~zP$V|VX8Bf{a)yb_b;itAe?A4YjCBuC% zh-Wu$SDovc;hSdNoroS-xN@pkxL%3xrfk#Hxjm@g zWPlAM$oZPFo?n0ch)vHoO7W~dxt#Uaon>x>(ENL`e#ZRf#i%zKh%zEJz zAvE9xXDYeMC!r?!?7Aa~T61J}1Tg!?890LM_9?S>7Zw@D7}UJP31j_V4KQI|J&KTyJpujY2g|{ zv;9{3wQIkASwG+SZm&&vMZ*-bq~)wP-927p-?pv1C^}T2vdoRkRJ-S;0nIFY%Y|a< z(Cg>-qqv}mN5MBc=Tf;P(50Ed#~)u#`dFCG1>nADTm|NXeXG1Vk+LnDYeQSXf5@ef zTyK3`^^*@+i=3H-U|Uu>X%#qqTS3A2qitz%+5JzuA<#WJbp$kVa_y(T|@2 z!<*Nz!)|X1Tkg@4XIo|y>*Op!qtbx=k&qQ+`YABwG=a}HZ{F;=3II)P&d5R*K}*a7 z%Jx5*_H{CNM@+%i;s3Ytg_>x>{LLjE7tTEjJuUUva$&rwzuRI5$RLD6$fsF%Sgt~< zCpUIK?|PZb&b)?Iw6-u1O-KF*`F=apWRap(D_r)M@}bgeWLAiHwee-f zocjL5!EF+Da2ff~pfw`!RAloeBnwoyS(c&w&bITcMz~5Ds76!ef*< zUxXPS0?u}XiF&NNNGTsgG6;rHWj36!IU&O|vA)R23K7o2j>$9@_9eV!C0+j`;stZ) zIeZqBl|mQ(rJvBrDdz;w-kkw9FQuDEQ`jAkL<|k{gU0S0_ui5g5sy&)7JC$3~W<^{KEDFGs&zR6fKGc z&;cCfM{A;07lar`9jWIzKkYe<&;9btFD0*MlNAL>Y4*)Y@O!Xe$N}U0`FYtHHyxqV zGd~;ci-e#`qCZegKc~y6lKPWFX(1FDG?|qkIQ+kD?-q%dAcow*c7c5x+=jDAP_yuw z=ka5q3ZjPYDLd3wVRpx=-lJHDqT{bj;Gnh^=pU>P1+X@rba z8o&H!e6G>AN2;2BO_xIVJy2!#MfIm%bU2E_So97;r2$~xZcKs~muY&e!Z|b!o2MV| zB{BC@cX(Y}`&x_+p7iQ|x3&q=^gas#b|x8@Gb}gM8C7U)Q8GZKS;@hspK@r&JaNZnw zY2skgj35h?LiJt#@kt&S^MnXGujW%Nrgi4t{?wXlwk1Z%w0As3)dGIJ=nM3s6MTo% z88z9V=Kl1&6#XNT{Ew0Nw(g+iJ>mXocFwVOWV7YpDqnhwz*VJkc^=-)YM z#n0_@Ha4QhM1a(H_%fk{Lt}_NB@C=iKY#wQSlt~heW<%vBxhZm#65qpM$AaQHk^w2>V|ZOR z@~v?K8HG5rr$&vv!o#qIxccnx7HetN92BAW%Twhz%Y5%u z%M@%`zx3m?L)#YMOY7i8D`vJ4)Ezu(*M@(5VMo`uCa=wSa|rIp>eANvkb|4ng>coi zUGjKCL_1cRV_k%hF2^vi^P&xISzUVVj&YXjw$0`v=InrW1qs+@_U4B#52tq#(VL|7 zc(w?PCDY&#F?+=18+Y&1$1a*P^+eFf*hZQ0-dLgbszktMbExh{?TRe{G&KGR6?zVu zJE~osH%`BMiKCWCZ-tWPRf09_vqZoqRvnlTnXJSSB06N+9UX2A;?x%7Y=aAxG$5g= zc$is@t#j~;g*u}?7&Z1`{>E4Qu;aaZ^zVP_UUiUx-=0qQHvFTlMXwFb>FL_~SVQAY z?VWyKK6HNHLuXLI@2U&`&Sjr_ak6-)u5#!qm&%Ih;W)%zJnv#$y2UN<#p4w&TZbvc zMgpaxtpNEK`AMXqH7)?rDuMuF%ef?xezB!l&0x*JAZwD6L&Y3GX=9FVqNv!)5}8V{ zFGjMUk#usEr$#4?RsniD9jTsT;o8MHO4|(|yz<5Kr4P%?9n&;luCII0UHl1N2_hC) zNI;Q&)+_2pivsx)4c zYcD`+FmOP=gVan?-I4l}Q?otO#wGsb%Vl*B+M`*N8AV!Hl;+ZlUT^f_1bGeQpjCSN zpy7Y78VF4(0`0Q(qmVUSMQj1L z4f?_zg#i{%sa63t=lwV!(bue_|F0samf@PB3bW(Hka3Toq`&ReZf%6+ILEVJdu{_^ zg7M`yay)PG49pB_)|V&5MVC?3^zGsUq958QhDJ&bAluR{vl+GHS+2V9?&|^zQ-K}U z64$sy0D>3dNw*eb6rH>v#Iq0zv}->qxf#mcK6E4?IAy^ht?t8@S)x;o-16qqInXb& zCNA0Iy}R}6=UW>Gw%jFr`Trc);OiW#x)O+rz&i}HM^~i`X=M@`huBS!!^by$oZro) zIEb$5)TvX{mh}nmkj4-nLw6y58f?a1SMhx=v9C%)*j zn7-GdM#LarYU>A6y|G#RkY?y8B+mFds#Q^CbN0_J7>~+{IAVbiogDW(1t6}|bD>9J z4&wWYY=mLhw7%}tB&J(&e~EyqlJwaeegNCYe5sqLefXF*yg8-6Tzgb5UEJ2HK3C?w z!zr&!()25|kqi0dp#WVpT}lk3QkBWK2eicA#S&QvZMO)JQ0A3#YvD369$!lm z!6EP9mTbyl(C*~Anp%?>knFkGH zucn^aY3Q?kkK?knJv{*uKZY`=;BDc8jk8+d1;m;SQ{fL|P)#*w#v*@!=o7f`oCwp& zx`PmuCZ-ZVJ;pO52&U=>YKCb%TvIwDnajrIxP%i~LIy-ZNiGE)Ls_HRw_WYWq|M_c z)9!m?ArXB5pO^--0Mp268s_`XOE*UR&w7k{qsIL{zF*!qPuioQ85t0D&met%(bcRs z7Yux~1qno|HJ|1~4lyK)W)e)wg@q<$_ibHOfv8a*)i)tkA09ylNg!MLnAmMBA>=s; z9>Kf55i}}0j#D_>5NP(#PE1R>WpN(tYInS84ZD`>KRVGwwV2Ys_vx?K%G(p~ThgnC zeSj(aU9-l_;gy%i$Th=E>Ug;8v(PJ#WNpzU)I`h3ch61F^pv@+b3%=kL{R>OV*0jY z|I`tJMX4m!8h7h5gJ#%@us{Y-L0l{!Oyw0|6VMcQ2wAHLNwF^o_CS$&KdSeJ(5nY^ zLR)_>SfRRTUcT!HuqlHZc)`X!uEZxK1hYkm0?DP2uA4B4A0q01Epbajh{Psd5na7H z>(*>4W_+S*kUfv6j-#2EDzfFrY=5UgVn`qeD>*xg8TcZ&I~ZiK`KRh~ z8NJNGGi;NwhU8~HBo0)Do=bJseVnx!(pxRgBZWxbYqqE^3XQ&Sb5rruJUvj_|L?Bh*j$kue)j?Xt+x2kL%E)2GZL;qk+zT zNe!kk+=Vy3@$!<6TBbWfYzwGITORB6UUHAr5m8KY_;j&aZo{^(C>1&JLL0dYWxn8^ z3L$ht8mDwMJnS1$D`_c`x+X9r#p#}hBj%WTt}L|drTQ@UtzrfQwG^(}FRU%?hETaA z8uU9+0g)hXC~XCF@^-@JYKG=SfXPYK!XAin9$i-b>7wc3swbIZw3}MHBZvg7yA8M= zKTbMrDl0_KzLolHB*7~cW>$P$WE{w#C^4fA;`juN-VUOZ5XK#F0B0^8 zgM#BD8GG6i&49IuI`bA0Zab zusvlcs`!uu{3lUS|IFX_`M>le#1$4M*7+?;tF{T6GLlE|0-3p!_k#B&Sfa4TqNm`J zFr7N}myF*mEu(0#pl^#Cdv7340RR2v;-ZQCH8FyKC?;1d!K5r&S@Fn86(U#xc&jXv zza2Nh_C^Z~@+gBvtGADt*K~1S1ksIkskIE=PhK})hj#Hu?N|jkgUq?~r^A_WCW#*= zT0MMO_zy7!UajvdLI;j|j`mMQm&+^5pC{E%Z_kC_LeZp9b+Yx zAwGvMO?Ru-&Vg9{nwd3XjAp5!`{<#_hBf;SIRh8kiuRZE2O{#Z1&OsTOyj@@05j=f zVl5%DW=%9{jUTGfo!czE-pP#R>^#iew)I0D215wtCBZCvx{05io=AjEvaN?wkfLs zwyxQzcH3^|c`ZSX^Q@vPDBtr{oZyL1Rd>Q_N z+TJ(ux#5{w8Pvl+yzuV_q>7x-CeFAzQ96F=1DU@pVoP_9x&k^xztOu@7v$vrlIa5; z&(Zun=&P2(Y{^M_VPSqS(;og~01VKy-s!JrQxEN9Qw6ZPWgHfySzIVqfc?}m{`}gd zeyLcmYQ!`Ft23Wia>polwI?>m^Z2cpWhX6@b}Z%mi%xo7r33QR5Yz7$&y_Qk2$;^n ztDZw$DV0OywW~2e$+on9-|gKC7eJ8bEx@od9OQbj##25*LSUgpLwV}@k+$Da|CPzA zy%N$;f28{%-^Ll4tuy~@v(HGIea*!EVQz|kDAcJAxFx_rHuu2>-Ss`q-t+A}X*?Ay zzsnS45E3KE=d;eOBi_lBpS&|fMm&cM+!)h?+kAEZU6JJxhy3(gU`$y`d0hEorWKJ3 zNCn2;nFVQM=E(mG10-M@(YjG=C?^-SSy2eVS#|q2ZNK!)FbA5-Rf&$KyfBV*_(!Nk081a87Pb*(tB04+G=w@T)pcV6C{L< zlaw{dcA#cGa_rb=m94doWOh1UD`r1L=FhZ}K)w?rS9uC}MloIu{wd>nO!L*gO7qag z?J6hw!{RZqncnaf$jb?VRBoV6f+qkc5=3Hp&Qah7$< z+s^zpJH%Y@Hu~ezqmb*0^k&Q?@CP5cfIf45DAr9$mQG09a zH(fheEIh5GDw0RaVJYNUZ0U7NG!l{^Eg~o>hrqTiJD%Ji+!qO@jzaA-tS^Ypa55)d za!pbNRfM1%lI4v>^antniZAn2XT*U>pUwU;%n*rR6%oo*c`*Xff691f{I4a6QKtR$ zHcl1{gPxk1nwv_Rh|pkAL60F*%r^Rulb@uiqqm@sMug{}=$c;-7Hb^X);Fl;;$uV? zMfY|D1Gd?cpGmQjEZG*~^naym@7=~a(F27^nZ-xdVO9!*4Q`jKRP1*LT(xo0<3=?X zFNAz3FkO-_80rEXo)d5_B5On^iE4q(7BoqMB1*JXTW7KtH`yf^HN$5&MvKHh)zLE9 z-Aaz}BrKDH!Am0Lg)h=%Zj-o{=FR(O+DO%Vc$vlzprEtflLo_J%I?c!EjrQuVXP0H zVJ5Fr+Eb7LaVp3(miQX?D)f2uN}A(msSXj5z7L9y{1`7cDTR{3*p|(GEl(oqq`&4* zWwFr4oC*ubJs$@RATX6oh=GB#RF|M`HV$i_3Bv)iCy<7CCH?)ibkgT^J~1a}AzDwF zAlf%bJVY|T(xbI$Ia??5+(3tPOJc`_SrUc~70nNv8V_bVs4a$q`&;ksK!L(4m`gg1 zn9ux6_6K5pKS)jz)t-!0Oh;c~(=c`Vbn_RZ0!yuio#~n+MoO*{8P0PESRtoM^%izE ze>|gI1>M82WEt9$TP@GmbyF{fVC5%NW<6 zN^jpLq<1kVMtWA(@EO(WAMKaM6q(j7axqyFk}8*7^FlKrKE8^bWCp(RoS2t8AH9Bv zZ%2ZsIZCAu=7E~QfjNe%f2!g39epq%hQoI*K*~3dTEzc(!f{>-Dy`h`55 zF*w=2yIMDx?#gc>ZF0$swPZnquwY7a7Y`uk^qLuJ71GX$*Srk~H<<^ao}299AqSGE z_g9(!^_w*&_kZ|h!Bp{CF-hzJp)mM29?_^S#K2&cquOH{^~dbnXUah!QM=Xst#ige z#D=4HGs70JkU~kk{!*!iscB9$woT=mfp|aSgp?87Fu;o#exQiBTp?s4reBF<6t2(> zEJOOkVQNK~#^$5LX4sCWpOZaJOa7dDF-X9tJLBh=!afuL1_~M%y`gLwdYnbP%6`|? zKJ?~Y4la7eN>Bljn6lmm1K@@yHlDcVJxr%^?W3UzPL`V-1Pz3^UcG)@r_^=h4ZH9} zq60(xD4Xv@dlV-vZF#>_ryI_RMC{$Dk7raxdSkrm0X|-1ch^V8(MxWHwDRQ6DK0LT>6EE1+7ZekJ9*X<7=AEvw*%LU?syHZSq9% zNg@6`nwN$I_{tSq1i8JLjV?iyvQR};L0qek)t5=6z4^cO_CIJ^wT;U?yYxFznZy+; z<2f2`S_Mi8ErHB*gJQ&ajuP(?ceN?WZ~iBL zc4SS+90rQsM`g@~WUjJ8J_51#mL}dOhO2Y%e`FZYsOja{O$2Z;O8{TOF=~?k-K9NT zHfw{__T;ubqMYtQ~_P^hg!{WbUbEfWm8n)re-OM!K#4BbCLp2Cr^N`d_R3LxLxN{}@z!u_U?LtFdp&*he>nMboOcG6 zmRiQxFJ+6|8C)$FE3jhjDOCKV14sxQ)h4PersXu;|4kVKP9P{NA9^&0Og>776i!4OjVNBYsTz<*o{o9 z?GKJD)>24f1!3U6F@l02NU|}k$lp}_$2lZzK1==Ktj?+jR=c-f=Br=Wn*j)46xl>w ziR?TuR6LHsR!A$wf6zE~qM@O1kJW9XJ%)VUSUa?SX5J=Q+gk7c)YhXLj+GW&m8YRmfQ)Z ztq_wRDchYH`Yiz|T9ur;x!_$`|5vygrk{B7+fAu<6Z>_fT! z&8`;e|473PRRfy4`AqO^ih4`a?&acPO}X2B>NGXdzngOMWtRx!z>z!NeQPrIksB@t z{)_8+FQ)5@=xr?SJw>Z@U}d3wU{soo7)#dfub4iLUnp;+T z2I!FESzXm`&HUMU{yKnlvnKrT^UzQG#NPoFY#YAKa;M!)?!V*7=d$CJPOqQapxw2i zgNS$y`?&;sh7wct!y-H}X6u4+Lr2G)8tqsXX$2im^5j7DeUH4{glA{bSIeZD3u;&M z>Mr<`Z?mDC%CB?=jSi(dh%_xg74`qiFKiO3I>mODlSz|tTen@hg6-$6wVS!4o3&pZ zTG{iDJKililhZTlD>^T;*$)lUP5uk+@}gU@@2QOa8X8>&ZdAQ#4ZW}bW{>*k|NDpk@8SHllK$Vy`On=^k-SSScrDScHOejW zHF=)v>}oLBN{K6TzUWCsL`hv>O2N=TICm8RvN3AudMT6IZ%)uyodtC^mLj>YMb`D} z*JI+O4&2!`2qDnbkRwqv9CY^*Q7Fl=G4GcC^=>pG_NV9%Fecdrw@rNkW9+9(?&9xj_AbuRd`rtxj3^rF7C`vH@={iwv#9BqKQG*8> zH8*Ca3Hr`|P$;Fmgix}!y-j)HQ#tmOTjSE;Qi(Ex<*}P~=V!9h?RRG&NxOw&LWdf> zx&@JM=e~M3U(be@`iWva0IabvV$xq zT7H>0!)$XPbFqMFbsxH99+hgunmhkXB`C@F+bnGV`nVbg+}_E2b&~|Mp8zgqDFaZq z@+1w`_)-8YsevvE2Hq_?l>^9Q8JE2s*0R;A$%vqJi5t+u_Mug!zY6O2t8s%6$#-Oa zq~y0)Jbd$C<&){&6I9=5^#F>U5AgLaUbbzTav|a+n8acmJIShK3*<*yQND~fnlWVV z+^pIAp4Qh#6cZxBD><(Yt$67iC~2?hxXiU?Ea^B3rJ(O95)2#!X*RDB%a=j+goH0e zMhh%68g-yMAri-5H*=Y0K51YhFse=4G;LeCGfObCpzQu2N7)2oBqgq6dbL;}L5n+L4cx3}?KpRO{anw5r3YFy*9XO6|@q7@{ZF2vfv zWxTOWPsr27V@=vjZ(ZOMLyNIJJ%1D@IZbTQiFYU8k;+f80LWbuK~jt$N;G!fzRbn0 zIsU2PJIyV>T)85yr7rulH)fWL!U%`?EF}_M+4Kfm5(OFIcOKFr@c112BPP?Yal>?s zCTnePgw9xJ?QHF|b?erL9p63>e&*D=l-9DhH8%JD`!CTPR?E+plvk)3W2FEL(nbbV zpD{FP0{1d;@rdY0OizgOT9+YZ#lvg*s$S^ku2NAl6$aE~mJ$oCz0sZ5W+Km}uMe@d z4yj-&RdklG!O@_R1Z78y^~ABNMGbKtwSQ3bv-4nkLmsH&|pyvs7W*69pt5wIVSh?RK;j5A% zYE>t(9kp#7o76`M3RfRz6d=A4hYE!Q^&?uHPncM7p%mrg-&MzdE3uvPpO(^Aid!#4 z=IY*Yyg@@B5(6SS`kG2Pa#tgDw&9dsuZp2j)aM#K9lDR!;ykH4rcx_TLMvLTZ4Qhg zQ|yeD66+PZEb+CMnP>M4}?vYgNsiZ9DXP27||(OGYbFjZjbq1M(l>rI5!iMx2a zIW!h|0w<7esyo4;GD@J~Hdx^0wST=g&8`DCdJ0rbOa&7XK6vq8s_~6!?gRlX0+0-? zfYYdrIUNm$Dx<9au5Zgzna@)6JKT4EE8R`x9L0>8m>O^3?fc+=CsosV=cxZ5LxVP| zSL37h|6Yv`DeZTS1Rni}^^%(^!eO$+k+UnsUvcy1O{W%?<``yu2M&0q>f<%5-}7m` zC+c_Lw3+d}+PR?NNy`soejmc~9+KB3DDUOVmz`+Jko8eDMXqV%xx@g?aDXa0TmVBX#<3=9to^HrPHZ%wfcqbvUzn$LaWkp2! z0*^VJQZKIZYm*rQm>+Ze^}zv_p1EH*3@AV8&F9qc_v53M*Y|3v$Z<(Hf~;OyHPJ~}l6Hr{znQ}`47$TdET ztRxPp2E5UB0Ey&r606olBYhPbg&(<>ss-i~Ito|UIehhPaSsJyp+cq!K?b<(Q*!3= zCth#HKjyD&PH@urc;i0~bmee_%Q6*ru^eAKV;i`o-#WZ*-enare$B%TOWqtpb3a9F zP%vJSmoH(cs40=he9aT-j?A7cGzp-moj1k0!7TODV3upDx@uvzKFuk14#%&Q9|B22 zyjKF_G5})X!i9%rZm2N4LfQccHLww729y4i%C`z1)WxKB(s+|emBGe!!l)ogx3ruE zoJ64>S$#-aVUmDj!tCZb7rOm3eYyA-L|ZEFjS4CRI@$^}RRXp6d%_P%@+8!v%*h+$ z_Sy;f^?`HIwl1xS_x0XP?3NziE{*eXoM3^;f#5xIyj^Me2gKI5dl6`1(q(k`&!~o7tIfy2$XJPoL{b5bg)6Kb^a3vM@F(CW z5sY{+e!W%v(gIp`Keect@c+wCYk6d%@L(hqUN57afrhs=hBwlao)aRbt^RmVV1Ozxvbc?3#DGb*uSz+!6~5W?OwIfgkw~2EzD%ocbs) zxW}6}O&d1Vc>DV9qYN|K81&6IkJc#W%vtov^G6Qp!Wf4>v**kin`yu;c?qKQ%Nqk!ta`R1Dxh~}G+-Fo`^KVQDwj*ayNYKaee z_WXwPaG;@~mBS?Jt`9f3;B*fkl0(g6?w@mb93pzTa(#*)RR6|F$<*<2RE0ECTaFz+ z{v)ahoEmAM>bdJ9Db~LmRa8`TkFRQb6w}AQX+6K+MltnCK=(ZclGgsxp4nm@N&;Vo zuYT_bbnPP@_zW7~t)Iko#Yr(iqH9Nt9Jy=uCsUXyd-)P`FFs$gq$eVayF78OxYXR- z4;vdB&n{1%#0{|z6I*D^8UaOdFmi)tfow_x3mJ z%|S@;&%c1(ccE4N7*Mpl?(+dkt$SR@-OI(tUp?@rZ@+yX8q%^#Y~6!~5AUk6V$GV} zB5)7Szb1oOw)Ow1Rk?R>r;sHV+L150>-+CJ4jndZ<*HS0qt|3G;9YpsAr*;^6>w8HE=2`+OVJk6We__*jRoLE0_M4er5 z<~zM)Cav@(7$k*txVyP|$5%tiL~Tzf_;f&uyN@4t>e#U(Gb2uOKOE!!%OBR-Y0I#8 zKHQz;T{hrX_QXDp`k$a?&PPRk{nb}*^P0~?05B$H-Hig|@#S&nU&eotbS8ss_Z0p1 z<${79R4db{oA1`vdW~Flu0R6{+-+I|RYK;`}8^GlzV70+1IW5iVdWQ1)i zxUj#RIAll%A#L=QFI(0htbQl`YHDd|zj5QnosWq4_sNreSPkC+A6I>7R?q48l1K@8 zyVz8i$wQXlKVBxcHW@Lvd|LI6`*rQ3Y9hWk>3Eb`sO(jK{k;bd+S}RL>FMeH!0LVp z>GVrnWXN|}iPvh&7Q(7_p&I)Cs(bUWp4a!?`-5dlmZ=OGBC#k*nMsBU4M-X&iiifO z2$c*Ag_0o+rcer%M9Po`qNtEchESr>6cs9|{kmIg?PKr#$8+rGzvprMj&-cx`jXG* z{l4#OIIr_OufM*Oj2bm+xt(2Sc*J#Ff5ZfCHKqHq8oIi$w908Rj;`Nc66*piRRdlx zNW64aQMaCoOAR+%TCm%Ng>uiHJ!9B%n*?$h{b5XsZ`szzLp@}L0mfW?{dR_~CoRNM zslZ3+VDS=G5vy6ob*r>aY`$c&tVL-9Op{T|5@l%lCa|w(IcH(fqD9ydbmdIlBdwgt zch~N<@ncxd?>suW6{}ZwIA+j+fdXt*TayE6X|p)b#_ji2lO z=9MhC)LrgbE+EgODN}rEpB);aC6h9=@O76w2WUsSH*1#4@2{&w0!`g`|mS1H6UDczG5?48-AL?sW=AzmGAxY(*D7 zZkbdTpA%VksEho>TA77#|H=vt~_1E&Fv19%AyX?)TBEsZaOrgJEVlBsg;2RjFF%MSK4K{X5+g z)t9+bJ4qflQ(xuQnKLuQx*tXEM?&rKqetgbEd^0jj=z*c&{TvJ9Y_py*M+ol7_Cus zzSE)*0>AR8yWNW$td}fV;vW$a@n8#Lwhq}RE@(VxxyB4$Iet`4s{TBimvCa)JJQwf zejHLb{m95jz1qb%RL@pSen@R&d<}3u@P?KKm@a zeI^wLBsCWl@P&ts9XnR7rR8^Vi`~*wRCpOHmjc>_L-?~bX(l;DKfM3&p@ASencm;n z;Mgh;er9#T(MWt*vK~JU5|jC3M~~XR7Af1v z+}u1;LwcjrU#aVkH|u@*YuU2~lC(d>*bJL^`WNnGP+?Nm$CGoyV|*rUzmb^eV?&N# z605;^JMSDbKwjQT$!~P3gKl2)lx?h$GatBSmHIK^c9)P-s2KU_zRlL^V2tfXX#Rq$ zHGZAEu75b}+!1Tf?%dfnrhL)tZOI8Qr)Sx27_H)d(q#O+{L@kD85tRRqB@+E3YD|f zD5Y#v!|&|O%)?@(&A3=3|2_@zH#>}s-7=LQha5;SPVAp=rC~8zx9*bZ&z*IRPewmh z?2)Uio#J(gZNXWt{rdHrtE<~aA_5ggG0rY7e*xh-s*bN>`4z>4Q1LkF20~L!mp8X2O}BC7!fCdKTXA2Tt5lzX$If1d#_&YBwxRO_rfwOGKtd@vHlHy`d)n35{7A)uUwhA=g)y} zOG|fCX9r3}o;rGTFAr6+gKc?qggfZzMESurz*A0vMo?NZ$?fZ42>U+Rkp}+$)%G@2 zv0VI)G(77KJMT5lzuojaI>PDZm@)(Qdj573e%pTFtfL9(t^~82s8mZs? zUy@c}3pu;GYV)voz%K0kLOL0SXi}**H_$VB4yT{RYfqNDpXTQrJ9f<46DKTg{9Yds z#UbW&k6FKV?b;_8HBOp3_3W855(36d0|RLyGM86ehXK>K4G0Pfl4{sLU~*4-uR9!A z#L>Q_51FsnabxX*h^{oA0aNM^cRL~b8o2gjp>)lVv#P?-Lp ziC*;=X@$Sm)@m_>Ah!KY=|6PBgx&-#sfspRJ(1_s^X}8#sQR9u_NNBq`SqvPy-SoC ztlN{EyoA6#(#s-$JvW#riH&7PjPzDst{+l3;xA2WV@I`J1uY>>>) z)bskvvKyfc?gmx=gZ1Fz=@IzZqa#a>5+%)ptgKlxXSU&r?BxoJ_1j9d$62L}_B;JX zOpFQ)NJD5S;F}8S>OJXRwJ48KzpOb_)9bK3Zj>|NS=>uFn>f0)JC~N%T&jKZ%H_*5 z2t02;e2~Igf&10&ulXlu@K$N8b9wqWP@Ez6>DAJY%CLh(#k`t8S^w*oD1G=uX=DqR zmgcPnEPgGSwCu2G)gOjT?IU?xQql%)@6i({-h=Z`;!pFR?(X6G-J?JWEb{GV?%^?o z#Tf@EmG|aQz~$n*2O+b+zanbDHll&h)009Zc5m8A4Xrn?Fq4s`NVu`-vt0AQonJ^&( z*%KFc7xbjL{<~@N1~Hnb2MW%L(P%7E^^>1&sjgujyJbxwPvZw2P9+7ggs(VPl zclUB0LGJkl^10T<_wM~cZL&K!xDO3Tk3N0&0L3bBlHf(o0c>%Gyg1JG$&8gR1?YTn z>CK)qa@{jtz3NX?r>35L?E0mjN)@(5M& zqI2r`$`=$IZtdHvsC&KlV{wvImSMqsLEVwBFTiqrv8te9(;JAB5wuA@O+Yp#=f2m} z^sItsYiZv|#JX2UyosE*<<}2KwPeDS{qrULYhog%X=sGwa}um~%r|BVd-Sc|h~myWNzs&n<8m+%Qqjx7ddO5Ld@-^m66TN&OPe zWr4!Y;F{%LKR*-Oh#wH(E*SixaL)u^3n$q|#k<$8j)>WdV{nAnB`SyOV_uR^?US;n zzaS^>b008pU<44a((i3Ff7MbdB?1l@mrB0&>MbPWM@~)Oe(6v-kc1J;-(eD|82zk5 z|7h?Z1+j8dz04#J`Ph(}3F_)f-=_`nC`ekTaoYYwBvv7q+rfq-RPK_)rPvs@KI!?YAEowR*pp=sz?}yX@#;v<6C#a-KY&Rr8UOb$I3~^~}3>wIJoJ zij0Tgc!ZOq(?}qQTcdatYId)8!QaO*CqZi?e&T>+U2Sp#{ zonGy@Rb48m!bZH-e+;d%YNOW7nSZEw_s;WXki`IUymx1UZnaP zzOD8YzJg2c$sL&2lx)jaa!u8j?bdsu92 zlNB9KlR83UfTBl1@_gCH6be$RMYd9VTE3QOW#;BCZ0nd+#8~wY}zWm)@kcP zu}n_$`n7AO{WrxWDW4?fUOLo5+^fr)b7K9uhzMO`?7I~^&wV{ly6^Eu1udRV)`JI6 zqSvO!-|X#gb?{x;a5hmrT$G|<@n-yuoR^_44~gS+ArGl(Kn+I5aD+pU<_Qo61Z| zMs45jXm771L{m1~grQ!iv6nvlz196>?OxY!aXIoes)zgI4I@1YLzp!%haeul|c7(3KxL${^(P^5O#U5z#AV*jU+&v?ew!=#! zesXY-nb!vVc8<=;zxoioB9hQ#pHnfcI+es4IAN`l?sKU%BvIj)IV6X8eBq(cFhwL5 zE%&PZSpEHg(+usF&USWolRWFkfEPq{FLb9wu0&WdGf0i@UCk4eJ;H9BSHbA!i&#Rr zRVV+{Ufg}7oql=rLB`*EOIVBmMd7L83C&59`XW>5C?V<$#;(yOCB06nu%a)fRmIW4ei9l2Dp#96f zhhK7sW=)i7P1W05dY#_m?G4m__F{BsFa^L`bTRv5VeZ-VHPu>7&No`LsEtfW^i%{R z6Sl&t?=BzuKYTom=XdMexyRk~l1gb&5B6m}QC1aM);l&< zRWie05g72C@o`d8NnKUG#3!ws1zFlQH)eopj9Pl*iDcMc3XY%+BliOv7_87hM7;a; zC9~K;La~?7K%ia0P{6uL&`AQzafIn{7v8r?e{*-%6FTFged}Rx1Vt~^{obr9M(n*R zY$S_8B5wxGX&z(^LEDYVR`y2s9h*4TVuMrfyVPmgNPmpaegFPFof70;J@biiH+YM)Q=Xb;++2JR65cp^9=MUFPr z2=a>Eh~)-PpKfIkTpbK1SgTiiD7J4<_4b>KU0NOi!1vf{ECeog=33v2Po6$~kOHUO zh!4HAPi%9sQSS#hxgUYROn*hugTU}n34ENF*OQ`cXKLznI?ivPnRlK&Gp&=&VEg5>Uk0pP zyLLa}FO0|l+`Gu_ZZTC=Xu6@!q3PWM#>?%SPB`Ss2Q!;es)v>^gYd{X**UvPeQP9YR9)q3&EyR#drvF)vVWXqY<(r-0P^+${vC6ld`;yFpFp86Zj;La_iN^K^#rb`vzt0P98r!Bz7hg_kG?O;ZA1~KLC_Iil za`I$TmdT`cw?*Xr{eWy66`$6?3zfaglyWB+%FJxi&2DDDa^rGp6t{0(59K=M(gMle zimmCY#*rRXX-}WVzqY)felM9}!Vt^=24^JPywC2Y_8@v`8|9LVTF<2nOWVA98PT27 zLbw2CapBH@z%bxX2V@qp1F(_44O)*^Gyj%TKs-8T-~N}7FvdX4Wrod^q2f^??*E^&b7gk;m=%QudUG9fF z6{Smuh;Y~BKKq2!6aR4`DWK=sgaYSa~O1S5~cgmB(qe z3D&9Ezkff|qfw`nZKDQHyf^t^rO990PcG2c7fn#niiEO*-&AY2&3f+BWUX3zG{P5( zLKt6n@4q^mt!5LE7U;rqyXcn#zKRCnOYO(@A)c1vI>l z@)`2wb6#GajAnlMpls$Vn>KCo4Eot&g&1fWp#hX_v>W}SGIyxKY0ZHYL>(lf7GIWd z9!*|Z6#$T>w=+fX{zu1-?{UVSMaVa#Ofsf4UV1wtJ$>J{8$QMB%+8T}6vACstx{x~ z{m0uDGX)ECa~~=ExlY`r7@(Y#vMbZ-AN>8%dab17QS$Q_^*d%@sj&&(o*3>utSEKMwu?RV~`#*gSGixeO~^Rk7{ z3{7j8iMaqqNXEFO+Ov$tdXlDQ(efb{`|3(6D)zv5i=MZEsiRBnOLUihfr|^{4P&?R%CW4O_p+&%bI=miLV2TUP1Bs|1HVkoK@K?)muf~=%|15$pqih*XlAEURbbR8crQ>$2;&1q53qVXF z3$hm{-S@a^d*J&l?K+<^>gs7wixlRY;(d^gB~ZY*X~dn%mLa0SZ)YmGvAOz`sSP-> zjMKW|!-kb`#>7V8PJb_rOJDy@$K9on_dggYRbq&^7jgWe=&~pyYJ4@%ipK+YrS@}Y zx0ly8bBc^^XDa>j(FhM#QF(MGTe(X^%5-7-Z|50pnXegKd<~dD#iGPjG4&Jk$z8jX zk93u|9u`%0hN zhnT2|yO~>YYtiDx^+rEB=6<~Q=uuZpv#pmFZ)x=R8&>4;F-H8vwWCLj=-ySI(EOE- z+y3FAS1UK`ke7M`ceOuL+3yo0;5HJ59T_B}HZsX>XGb@dZr61$o|^iBPDN&$RpINE z5$DhM54V|=Jf$^-29h~e(mz`_L07v+?W`AZq9<-=8;om zRW)$uo)!JYD*ooFAg7y3#qZ_^LZ+}cSmZz9=XO4ZOm;B!BQ84blM8K;CRA=3Id<$- zgP%RUQ?9qmsyKcJe=OH|hbi<6H23@Y$;RB|pIrw{^RC=Do~gH*CrY%Cq4m6lm55)h zUl$k4Bgl-%>(}^#W_v$CkVMDNxYSB}VL69S?$^4B;c4^-m7O6XpnNmmW=f1}GlyvBcL>{-X2nu;-7NDcx&t&B%RL<#Fru zZqkxRD{|Gxj=h3wTnyC0g#-Ha8>w_SCMG5qqZ_Yy{5BO7SK|^U<%KWyERw#acwdw9qt$dr))hu#aIZVhq zv~T>Nhpns|xN!FK&x2crf8o_ym3JA_5G*g0Fi$3)QWxeiiRr-;R;l-6>NG54rL}ck zV@i2Ng>p>&F}0N)3=Dicq;wZ7xKBk{r*wXZ($9TGics---=)z>W6UA#Uk#3<2A@)m z?MO!0PjnLjSyq%`taLTq(xjZT!GIWe&!vo%U)KC&#x@J!ke(pKHd(P6v0g=g)I487 z%X^Qy@#(qg+nOff9-DRDP`ky2QEDqC%neXe)-q${@a;)We24E*7XsI;T)7K+^4@PW z1X1%S*kesTP6cfKcDWI@m)uu6|VjVHH+7l1HvJy zI!St~Q^$au@miT{BsH(h1Ch3>eX6MVnN`#;ci6R3!`RSjdZEF-6miMMX~mQdm+ ztXMC0M=x@i+i+wJ+}Tbc`KYg5RxEr7puYQZzyb^P({c4^r|*3w#KtFPs+{|f(L|9Z zs66TQOnFL5OaG7vd2U)fAVOr<6{d_0)s~(L9Z~kv-NkpZ4Ra!DZ+Y8?Ib4MyC+kJ0 zN9OyuOmo@wyW%vzs$n3JL2^y742t{BlhyKi<{@~$FVkp-`;T4o_^F3`q2SA>wM!GG zl_gzQ&0`X~hhf2p*RN$ZY~F0v>59OQS28&_-&3Q{!J?HKvm)(}m~5>&t1*btpF#zk zk=w4AOk?}|i~``R-F5>??!FV&9(CLC8sCp^x=#46cuL7v*$e*pNMTi;Kh!b5Vz zi8QbJR9YI|eb74WdduXd;X2#-^7ZQ-4P~}{!he9VZ z!*%_~aEJq6O`Sen9@nUn>`{Yqw?77x6Qekni*fPs#uq$;Dk>`e`Vyh2n->{>iWMI2X-Qno+0aM>&WW9q7hY@=gMi@2cuMyeH0aW%mvp3m zGX>re_Vraw&TlR}3c48vZH7~?c2X<$CUro>Y%bo6CGGK#8n*QRt)yqZu5JbH-deCf0H`a*4 zr1IOfZL59yl3)WkJC6YZmCXx}h**K`)nZWG`A)J^r%VZBS`3O=E%S3)WK`Rm^&1-E zEiEl=zNFq<40jEJm12h0tXW^FaigTW1{^uE3mRuHd_HDiYNJ07THk$&%JjRZc%;@q z#8P%5qCNXa@7-6;T064ioAlDVk>Je%7R3WyP&1Nu zDAOmsd;v~bPf}25OW&>Z-T4cxMrCp8bf#I|pf&iFJ@qmeeivTN+p4OOSsI*waHft> zww2`TtP)*2bE&k1%Z;~Z$m9*fi`**)s-+;^5OhgGp~5{AT`aT~$F1s&km)jX4DXsD zN_Sqo=!;KIch%dUic{sqN|pZR^?(z??!pjiY(%$GdgMhXr)}vyV_~1RVhd?n9Uno- zSRI=8eI<+7V=yBw4PsMJLD7>B?{S#w-AzthUjl0M+NzVskNW^`jMU!n#0xUitkOM1 zOFs4-8*M3Fe=&+ilhvNuDI#&(h#ZH|i40{uhj@$aJwR^PjJw<;dRW5_nzXIy#;obD zw2e14G&Jn>@kxUy|BX$9C9tt2K9~AIT{|=3sk5!^g&Xoo&LwRjMmjTC2&{5~_kxXw z6a-PdyQ&@GqSfj$@xQQ>V&`dd83>`hqw1l~=R=L$jr6Lqtpf47DEp`t8H5 zB_MImsDoRkHvGCmh>^S4t$;eux7vQ)y3xXBZk$4XYD=?d#eK3)k>*x11ne-fUG515 zY$i%3C2XjeU*FgW<-({cxskE+81{!g#E0J#bG>Cf8kh8AAn*15DM?muV+8z=e<_*{ zNzG%`)n_eVex|Ry7+@QwG0+u^4f+Mk4;!x!`4Y_jM`fj$vRmujj#x}=*ElU_>E-2r zyA_={bVve3vlhL+&8~xtAx2qTlO|D{{c8L2V;K1iF!yp_`N``Q*)eRX^oBcgulmlD zoI#P+$86`*T6|!Ch7>fh{jeoDFE=g;q;Pk~4bINDA!qIm!8EuKp*1|JB>9xkDklvT9A12g_xcdJq%Fn%#a|BA1OX0{o9+d-iMwx$W3j38Rh7w6&O4stV&j=h_-*A4L}YNWRF#d!_uuQ)VLFhyU^xQWY<9fQDOlYA5H78+8cu zPp^%7ItLPx*s%*)$}p+n)|Q{^J{2YR7hE15vS9o4?bojr9VSpj_O7ehIy&=6<|=|) z7h?1Qa$TDl3vIrm<%L4s9y$(PU@vfSxVfFhL=Log#OoVTStlw}FLFh21#VBq+)aZ4 zVeW2DPB*R(U*>NL7pkDIz(psgZ?|-O@tL`SH5@95&$g@lIUQ;PZ6_auja`U8l@_SM0h4)$9EC@ngr{B^nAQo+97_6N*lqI_)<9#Nu5sYN0g+(S5~O;_E=l$nFBth7e3@E{mjGFpEfd4Uh2x%Jj5Lt!Y`vpUCb#-+ynGvim#@D-{K!R=xswIo8?~)!l3Rg2KyCdW2*GyUx zw8qh+{}hjzhrbX2gr$KLIKAHKyymo&@CxlMX7x8;QXPy;ZWlAb<$#j!NZ?~x@g3D9 z^;_ymg*#Qo_;1YM6vm9{4tLb8{_7BYuK!?0Cm3+x155|BL81h24-Qt6OILZgOD@2U zY%U$b9EMlC6NVRQZn?GfP7(e#)RR|vvVVSieemaEC&|9O7j_EE+_r<;91?`+@16)X z4l^b&etZH(6)?G;f!U8WPEM-AvMh1S&r9_WkrBO*yUrGsLkHV|la;vNb%+=%eebtrbFO+CQt8UEf8$;R-MoQN=)J z80}A>O=l@7qq*`_W#@!lS8`WKV59N4j9ymFXk4$`AdP=M>Q8R!e6gC`^c61u*Yw0C zJ;HSK;wE$z)k+}~$Bm1Bx_NHFX@)X_lCgdn#bpEosJVcRHiHN*c@R`L7S2x0dh{qz zPzNJ<;^r}ZB!cE7c6>$za5{f zyH;_Tk7~zI%|6GM(pK z^o50i$7FD9;fHNp{*x9{kISP}F%imHkgC zH2bV}5zBPBpQ#gmS^=Fa%**_B-JkFj!Ex6G@|F9!kCzz`6lAJ&u;#|~sR#&HIXFbC zr_xh?NPY~Z16z%@)4f$1DC7*WyZJ9&J5ATLy*@ z&b5tRR>Aer6Kuhs$%&R7^P=~t2pyTQU=yJ!$xBEU7;7x63$X^s7CD*9ba8^cEGB3B zNax_w%;$MFWeG)ApZ1b527lkNH5S_^`*3CF?DA!FN5A~P&Flt#jQX5dW z{Y;D$=tDVV_j%LWZnvxvIZxV9F1@Yk+nfQ@JY*-c(Q&~#3ld4Stf;!n=d^pO5uhiiq zyDs`u#>>$M4I6R#GGRz?MFE&i`h8eZMBWo4MEb+!P@CXi>KGaIjJtXj66r1wO@Uul zq5))jRPyQ5MRnC<2Ht4s#M2e&)fo&S(IKPcfw8Q%`eEO1QmHo@~Svc~?ZlFr>`~$ed?EAyESc+ZHTZOJf?= zEt-?$J7QQAYJugHYL$fv%EYD~a&kW4t)cwgQdOiX5y^$hjORKLGj<<=e*dUoHs#0P zxq^1>MS?>gYf9M| zLqeyYU1ro>pQl|&%_k;EwEuiIk?h1+iAT!YI+FEkDBda+~zdykG6v~ zkzNR-Apd9qC<-bS?#t$%zpt7xLn6K`zGW&KHEb6UO5k!RN}9~m_MhhV!4M?xtwN&CA&Qwmhl5|?fsNsmN_qM_X96x^5dE4>HuwMA{%!QUj9=?h5 zzP**Hb5r8{57zN7_Atc6JKYiX{vGA+nDwO(q(&E<_(%sBmOX>9Ho;P@NiFatI9Gah zKDjRJL=TUx-!9Df=jVsd_Vi4u>{h`01C^CsDwMckEtf=~pe2y_Sf)^K-l1GubSJ`N z!*4`G){_(!cLzRN)twHw*WKe`Yx6$-J*R{CbNrHpVu2gB_JaZ{^iX<9*`-SS=dhO1 zL?0?){Lp13Y@#@bka3^wxK)#=W?rOhow_4z^pl``cPR97 z+A$c#{AxZTLat{+S#a3)Z~l-SzLBJsBN&kQGhK4&}ckD|!0p7l!`(WlAS!ptntBZzB%hO4ZSSy4VhmIeYlGx^j zclJQn-BW*Vnb+1yy~I!Nk-I$)v*I!237WZ3Vyx|p%R{)wOiTBP=7+409dn;5f$)eA zE4$cQp)q!){9|ox<5NJ6J{XmRpFcko4O3mT1LupbZm;I%ea0>uH>Lv`B8J?>UA{*<-TEfOTEamE zx|3b*al!@k$tvtWD61bldZa_2TT)fE4+k#*rJ#~_SUc^loQ?x&2MVLxTsffH(>|mB zOdd`R^^K|8TUgaJf`4*v5BvMKrEYIkdz#13z?Vf04jIyAQ&*SF9eo)P{%^nPtc`Qq z4{rTP|NSHEc`vl+|J(1HtNP#P{_i&_B>&Ho^WX2ROJmmlzx~SV^8ei*G}j5ks6&~n z_&*?qn96ra&)5{Be9gw$*;%L0d4u>ReYS1>=c^S-50_mg5#KDVC6T&l2#2d`9LE68 zDa^*+RJVH$9_$7c@c^&v#*G_ErT#rzPc#1*Y2Bsusf3v=oZ6wR-25xvZs8d0;)z{AsHZ}@VpeR427#ZgKZ8lS{ zZJ&Rwd)7X6>OVi#se|{^*3<3bCkPh+E<;4>6tO6W1meqWHd&te-*Kr--7Cwjc5NN`Pc$GN=>PXQp9lPdBy)p@YW_lb{q*q$EO|WHdkx!gB1RoR zOkKs%7ae<-pr^ebsjgBGL>}iNG<1Hk-E%neksa+Be?dd)i37imnU*Z*3bU9Qnh| zWJfPqa-9B$8Ap54_%6D0rkQ^bp@G^*K0FZ^=*7R3h*cJt@=X;Vm!Bz0mgs2ZqHVt% z!R9+q(aU95qRQ8#c|4W3RSCUBxu36ZuhW>!+rJ(p={|6vAB#|1)t%1HJHYltQ9`17 zM?An}-#bHyR{sjJ7~ri*g-xM8hZ1dJw5bXzw6k=U0%sDYDJC!Dh4bE(rgwd&!JFMN zc|-Xbh=&)J-c+aY77(LB$P`)OOMdkHgyKLF$G&l7`<`O{+C_XnJ~cCRL5UK;k3hhX zz|8*`1QV7Zc$VXiN?d@CgmIWUS_Y>y`VqgLjht>0|NF~`UtgAib%;R#_`q^rp@oHo z+mz@aWk9XBAy0J@a%~={w+ps-nm^&!1-z)7w=APZVwcWJ9(3ADBKAF+SiqP5avk;k*kf_yH%_89D5vUJ=+&% z{jU&4Z4;GD%2BJ~M|APn(#r?I^>O{sj94i`KwM=t@tJ4qRdK^uQAA-QdQwPGJcT6H<9A0wvSPs^vG zafn2`?wC^O-|_}Dx>YQ+%E>8q{mb(5aw8@dg&GMBmpEoV^hwC5qL(=VbBo>W z(cc@n9K(>;Bs3t@`d3RnGC$Gw*PD?IZ!8dYtwOmJ=gFY%Fq^UB&5OZIbis0;)Q1qX zt_K`H9%XXsIGpmOA6Islj8s)!@KCuW`FWk$`V)Kv`{bp{%F1@cC-ayE==PqXU`}WN z@}4ii6B@vKYbZ-hPB`g&>ff4@@(yprmK7ssi3AG0QeX~uzKw*Vqz!N+IkUZ``2q-hWW10H z^<4mv%8*QQ8`lZ`!wj&}<%l|HmPB(2AEJXqRO&n#MbFTVQ3|zS2U>caO`9fA0!JSb z>t*StI!Y!nWfWfqF1#SwQap*QgRKL_)10MC2QU!bOU)97bWy4SWu=5Kn!#Vh0zePR z4oJ&0f3?qf_)wefhx0#7bHn=evk>445MTTt>ZB1PM%<=8&%N$swYDg!Uv*OLfs7^~U-k7+(VVczYa|i@;gEv!RcEd+ z^|A(jExVaHgPXqvnA$`C$LiB*yHvI?Ob;u#MMFu`znjKEbhkhPk{uL%K&XOmM%aqW zN)jqj!6~y&g6wQt5G4z6CD3IW)e@K{z`Pv(nY=wfz7?zLYIdF{boZu@X)EE=g>n%E zogZ%}Z+N~)8^XiG&wvX+J69rznUp6`?~1Y3fNuYHx@?bNAcS8$wT_laZfzt8p1qK^ zuv4-N!Z?gj27}Hr6`@U69-qqh6Z0sV{42#FCr^%9etUsURq3Avuf4c;RvV?qQcg;O9W*RE0By5=E^FAHnUjx z{y#`X4y2U-fwKL%J7XHfyng-r>ofNnSx{IHoM8mUqdkRG5?DO5A?Q?k%(7iOcj_g4 z&3W+P5RxvXJaY>ReU#tNv?7+_NHbN=(*kJGa^ zZrE@Hz-)<Gimg9S^=00&{zQR1PE!Pwn2{4Fn%N!aK~+IA>vew!PBvAfU8UN9 z7RM##=6WDtTO!ASnjGd1DDvb}t{Ud2-3|=jLUGkd3!9#}* zsRyoM#;hRrazq-OU*L818nva)hyscsr5NX2PTi@r2_J~mvyPfC3)qY-Fp>kJWF`Fs zyeL__841P0rmVOjA|5~POg>DW~+d|U_u#~BMN-xGf#m;T@t%jQ|7c-}NV&a|Y* zpRbvs;3{Wy>kEJJVd@dZGb21VQU+Om40H0@0A(X9p04%mTnh@C2s`G=iV!IduLB>j zfBushw?6}x3zB5Wl!bL49;6z&)RxanuFXB4wxo|BP&{a%oHROKPV|uawHic4sdno2 z=A<;=g;7c;mDd+{7fj0pt-Jp^b!4-Mx434!w?;UEdQb=Ft=#>b3%+wXV0Reomip_VVTKk|!iP zvD?JP9K2A7W~@CapD6!t7B)S4`0(Yc=BO*w=2hy0WGEV}eFm0R)vZ4$EAJ~`6{jQWlSpxIt_{k01_6pS8OfIOAdM5*Lpp6850(oWkA-h^g?BMb3T&NM3D?`{EpW^71{mw z#}Oo5@;%`P#Z2Twh?zSx{~5@cT~Dcw9V<422qO`}6(fJ;T1}AT zZdNvopJsIq4AV!9_{q?clfQYYLQ4+jMOUG!;n=6by16H)zk8-_wR>#`&wy^ z381eKGC0TM0Ec4OCps()9h~v?#Uv{AVfM3gr$1BD4Vbz~TGG8=za{pK;6g*aaUon5 z6GTb4L(owr;`b*6`gxPI2Pfn^u3d|_gs|n%@>c|I6T@FAySin?{|VhTj`-eU22Qsy zG?A(JL*i(xXI2!aND!(R_)=?0pZ`aw zaFR6Wt==eW#VqkIRBUG3tU@au&&v zQagPOkYe|{Vp>5`A6LF4X^=z=P*wDoZt{NdyyjznSYriU`qnfA*lgWHi7#dclCR+S z)^M&MeoZ3mcYt-7Tw7p@9|(B#6>gc=)(Z#X-qY^(&YEhF=mOF-lLhXi;`*;8JuKHM-lOD@Bz4^IG`(0cKS ztndC=aVbJ)0dHDR7bG)q2O@h(Bm%>woTEFU9+B+Y^BRu(OrFgZVBV1OMVpdFyi~wz zr%@$VBsl5 zVmqX$+mvZ0)%$0%g(<@UA02Wde_(QFvsU%Hbu*lLA>{I({O zaQ)&(1l2x)j~G21C9V;PouN_ig)z;Uv35a&VU~B{`9dmWv%c5k;_`TMuE&)gjQ6S> z?Iff$5&1SDSq?Xc1v~D2Rz!tk^H|G*hhwv>BJN@=WhnDrZzfhjAAG0qAxK2&_zeh4 zB4!BhK<#amqLa8&zFDiOgrxgVkS6&~6%Yt7i`_%Xo@2u$@JwM$3~8%*QE z!jl5c%D$iz(@9ODGoIvLqbN3zsQ=DhTQDlNRP!;O0a1CAhYr<}eg!umX{mPr`U;aN zTCqnovsG?>^8ckUemo7#z1XqzR4${Y5FVM4SbeBL-b=_AQ!u6oK&n!V(K6_noRLgO zbY#zCNvSM~b;QsCHxo91Z*eiw>$%Qqp7qXwRpSS$>TPpk?7FdeS#;L*{~1Wcj8epM z%MGm{Hw|V);{#H!=DMP!&JqAP;VJpFJecw9=oPxYzf{`qX!%{eUCu(T_}1o}<=H&a z#J(x;e~#7ocY7W^&tZzz&pt~Hwr%Oqv14@1+78mUrfph|&+<&Hgup^xtS@92cr_{qrji(F>6uwDs^Ze2K(jdl;3M`XmJ+`=<8IJ zbn{NO&0NCXlcTdy&}+CG?eA|L0ziylO2w}VoFJXR?I)`w)#i3i&H`zcJO|ud8asu6 zs@4=06MO#rU%f{!-OIdcspk*jL!)PQ>b}*C+HXRNg~dyc_cf_zCcVGb} zk>7ui-bh`4&P{lCg+|~PB;gM#TdaOK^xCZ1AlqvoSq&#!I1`6jMb8ZBMyvyHI_{cb`d7JiC-2uQl{@Mp#`?P z=t?o3s%?`3(8MlT$IEd6dpMu^E3+R#R};#g84Hyg{111k!^gOTpB;$MU*7caOYJ?t z$kV@~gg6`$;&@1K)t_$`^%N|S_Cd2JH>hjE<9u3IqtVR7z#7VdTng z{{C?yV_fXrf9jO5JRy1ESDl<5JY7n1-|fu}b37+-A0h%5L+A_}{~DhrahL)d=)Prv z_nt*b1B2jyAZz#GhM>WVd5}#aq$y#=U>-Qrf?99_poT=o@fL-LgY2Ic{TiN~;g6bH zxKAUKtDCHgKKuE{D0*^gvcptDY71xt7OH(O&Fr~gTI`T*z8Z=9-I`W9Mp|iD-F-bYlS8dkEY7fSc8Qb<2CqSuQ3`v#h4m!*g8^af$=4(DYIk^`~Bq=+_-n*H$Dxe;W{%prQ z3Z4{x9}Y2>+UoW}m2-5=ik*E=tJ_GyY7x$xUao>V4(maueH0UH&0n9!?g|LiAmOc` zZa6x5!|F(%-(3DM0Y^BO?40YcSJU0M7GIbdaD4INT4diyJ{<$uKuy#7UZ;qXfY1LJ zNT{)X!J9eWg{G}|brG_c!_R>YG4gxzmI9jRmE24$y*fz#V^t@1^<8$uxear!&cZzX z+_tx@N!vx9dC&AyN%;c}lW<5@8`QV&VkJLT)7Q@1tOAiTx;!3&N2j=0G0;YX+2v^m^u2SD5!3RWYLWzI}TJYK7ZaJvyF}z3KE%wZHwt zRHvb>ry%$P1@J=KHfFqTdcN>;=iYH?_2kd(*}Z!?Thr{7_rdkn7lc?kfJvLb44}y0 zNzP6KilfyR1PU;{P7-0;dfC1}NlZx`m%u!AWR&+8pb+SWuq}l8diKfvZ{qSIxo9h4 za4o7m;Rnap=nI!vxURwk#K=%s)4~@VC%Fx1$G+j)xcZ#`oss_*C_0U>sbU_+eGzlM z-OvHSt^NR=)u8TWJ4kognwmi8bR zZk}O<4O=L;a=49*eN+JrfopeyB!Q6#rhxd}AU9XaxAmj}lmaG#@Lk83^?&a}95Ht{h+JTFloa8bq zwCpMgCIxIF(4Czw0>!I;=c_7_@Cxc77d^S*vU1$I*MdGn<2D25L_z%#jFXF6I#uvf zOfN3}n*$Ay2IUSRbu}` zWCH(gh2XShqlivDB_Yj7@4~qnc2wKus%yBV{#5IQ8Wj(Z4HZ6K&<@|#k_w_8)^2b4 zb-Hjpqn453kZcl5zkIp0>D2#1QdHK2oqcIrf3tm038$Tu`Y7NDgP1;PCnjq?G{S<(1%G)%g3#rr!y}T5kF6-ya}YshN)-&o7%oWYfjI zBv}j;1_dlV$jy$#eA52<8W6lz3)lopD(zP4p=NLD8(m1@9x=YxiU*n$PczZ`(tjgE zpb=Pp*e9pZa{EJij#t{Fjys~cYBoyAM7%<>k~pCSUzBZLu^{E8eK6L(Lt++RndPte z`TDQ_sC19;!F0I5_#g7_|1^q%H+cxJwX=K7A5S<@NRT!Uh`gVi$1k zjad$GY6a6JX!J5+h^sh!co-ZFh6BCnn*_U#z!*}~#6f>@&Cx=P`Iv)c(CP^-xcXN3 zLA0@xF#sBXP^BY0(*s-=F-oQ0y2E643_btRqXES+SxBLtzzw9T97I~5N$LcsFf=rj zfCvc$8+ga9B4qQ^bqB>bkLQPoDxJ#xEA_yO!a{-j3r{1#Ny92Wx@&t2QyQQbSJ!mI z6m>WCR0(1(;Akz%of1sAMG8Y#fEr8eyESz$EG&FSP7oYV+?oP^9AAIY5}=6hsgIe?cxA^1{lD_ntvF zcNxtJ)Tf~s5dr?pGB=mQC2S#Ym(|C;?oEUx(WWKQQTEjg6+ubz5`ugD3wMyW)43m7@q* zEV@>_6l~iE{Zqn8|JM-kJHms9Xn8LOI(>kv^pt7f9Rel8A5J0&oZw~^`lE)l{#Ah! zkNAw1wEipqNIoT~s*DH4LjqXb^UuIpI4xo0HG~x);R$hZX;gnJR^9v2t@Wcj^)&sL zP}HNx!edEp-`Xx*xDXP6(G-@@4;11eCV7_Y?zDc73$L&I$Aa3sx2`8TD8X+ML_>~< z(8J(l^6ydtWTowe2@)yp4hAw|zWS|yUD*s|mBzVV|0j{A+tf$@z?c8L2>$v1ciK$r j1LGh6cZ{H#7RmYesncK2G`b>IO6ls%*S<7s`Og0V{53x@ literal 0 HcmV?d00001 diff --git a/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb b/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb new file mode 100755 index 000000000..0653279b8 --- /dev/null +++ b/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb @@ -0,0 +1,3786 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9bd01afc", + "metadata": {}, + "source": [ + "# Nemo Curator Pipeline Example\n", + "\n", + "## NeMo Curator Introduction\n", + "The NeMo Curator is a Python library that consists of a collection of scalable data-mining modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within the NeMo Data Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora. \n", + "\n", + "NeMo Curator includes the following modules to perform data curation:\n", + "- Data download and Extraction\n", + "- Language identification and separation\n", + "- Text reformatting and cleaning\n", + "- Quality filtering\n", + "- Document-level deduplication\n", + "- Multilingual downstream-task decontamination\n", + "- Distributed Data Classification\n", + "- Personal identifiable information (PII) redaction\n", + "\n", + "NeMo Curator team has perform ablation experiments using Common Crawl dataset to train a 357M GPT-style model to assess the effect of different curation stage on model performance. \n", + "\n", + "![alt text](./image/zeroshot_ablations.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "7b1808ea", + "metadata": {}, + "source": [ + "## About this notebook\n", + "\n", + "\n", + "This notebook will use **Thai Wikipedia dataset** as example to demonstrate a typical data curation pipeline using NeMo Curator. After running through this script, user will be able to know how to use NDC to download wikipedia data, perform language separation using fasttext, perform GPU based exact deduplication and fuzzy deduplication and use CPU based heuristic filtering. \n", + "\n", + "Step description:\n", + "1. Download and extract data\n", + "2. Language detection and separation\n", + "3. GPU based deduplication\n", + " 1. Exact deduplication\n", + " 2. Fuzzy deduplication\n", + "4. Heuristic filtering\n", + "\n", + "What is not included:\n", + "1. Customized downloading\n", + "2. Classifier filtering\n", + "3. Downstream-task decontamination\n", + "4. Distributed data classification with PyTorch models\n", + "5. Personal identifiable information (PII) redaction \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "78537bd7", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "### System Requirements\n", + "Here is the hardware setting for this notebook\n", + "\n", + "**GPU**: NVIDIA A10 24G. \n", + "\n", + "**CUDA & Nvidia Drivers**: CUDA 12.2 with Driver 535.154.05\n", + "\n", + "**OS**: ubuntu 22.04\n", + "\n", + "### Getting NeMo Framework Training Container\n", + "- Get access to the container via https://developer.nvidia.com/nemo-framework\n", + "- Set your docker credentials \n", + " ```bash\n", + " docker login nvcr.io\n", + "\n", + " Username: $oauthtoken\n", + " Password: \n", + "- Get NeMo NeMo Framework Training Container\n", + " ```bash\n", + " docker pull docker pull nvcr.io/nvidia/nemo:dev.framework\n" + ] + }, + { + "cell_type": "markdown", + "id": "062b5423", + "metadata": {}, + "source": [ + "## 0. Env Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "8add9bbd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n", + "Requirement already satisfied: jsonlines in /usr/local/lib/python3.10/dist-packages (4.0.0)\n", + "Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines) (23.2.0)\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install jsonlines" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9940c70d", + "metadata": {}, + "outputs": [], + "source": [ + "import argparse\n", + "\n", + "from nemo_curator.utils.distributed_utils import get_client,get_num_workers\n", + "from nemo_curator.utils.script_utils import add_distributed_args\n", + "from nemo_curator.utils.file_utils import get_all_files_paths_under, separate_by_metadata\n", + "from nemo_curator.utils.distributed_utils import read_data,write_to_disk\n", + "from nemo_curator.datasets import DocumentDataset\n", + "\n", + "import os\n", + "import sys\n", + "import pandas as pd\n", + "import time\n", + "import cudf\n", + "import dask_cudf\n", + "import dask\n", + "import numpy as np\n", + "from dask.distributed import Client, LocalCluster\n", + "import jsonlines\n", + "\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "fd8a381d", + "metadata": {}, + "outputs": [], + "source": [ + "def pre_imports():\n", + " import cudf \n", + "\n", + "def attach_args(parser=argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)):\n", + " return add_distributed_args(parser)\n", + "\n", + "def check_jsonl_file(file_dir):\n", + " for file in os.listdir(file_dir):\n", + " if 'jsonl' not in file:\n", + " continue\n", + " with open(os.path.join(file_dir,file), 'r', encoding='utf-8') as f:\n", + " first_line = f.readline()\n", + " print(first_line)\n", + " break\n", + "\n", + "def extract_lines_with_id(file_path,target_list):\n", + " with jsonlines.open(file_path) as reader:\n", + " for obj in reader:\n", + " if obj.get('id') in target_list:\n", + " yield obj" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "589ff257", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/work_dir/tutorials/single_node_tutorial\n" + ] + } + ], + "source": [ + "cur_dir = os.getcwd()\n", + "print(cur_dir)\n", + "data_dir = f\"{cur_dir}/workspace/\"" + ] + }, + { + "cell_type": "markdown", + "id": "662d505f", + "metadata": {}, + "source": [ + "## 1. Download\n", + "In this example, Thai wikipedia data will be downloaded.\n", + "\n", + "Here is what happens when function `download_wikipedia()` is called:\n", + "1. Run `get_wikipedia_urls()` to obtain a list of urls to download .bz2 files for Thai wikipedia data. In this module, we use the base link and the language from user input to formulate a repo links for downloadable wikipedia .bz2 dump files. The formulated link will be `https://dumps.wikimedia.org/wiki`. All the links will be stored in a .txt file. Argument for this function includes:\n", + " - `dump_dates`: A date in the string format of 'YYYYMMDD'. It determines which wikipedia snapshot will be downloaded. If not specified, the `latest` snapshot will be downloaded\n", + " - `language`: language code of the desired language in lower case. Default value is `en`\n", + "\n", + "2. \n", + " Run `download_and_extract()` to download and extract contents based on the url list obtained from `get_wikipedia_urls`. User will need to define `downloader`, `extractor` and `iterator` for the dataset. \n", + " In this case, `WikipediaDownloader`,`WikipediaIterator` and `WikipediaExtractor` are used.\n", + " - `WikipediaDownloader`: Downloads wikipedia dumps file to local folder.\n", + " - `WikipediaIterator`: Extracts the .bz2 files and useful content from the base html content.\n", + " - `WikipediaExtractor`: Performs further task specific html content cleaning such as removing media files, removing references/tables etc. and finally yield pure text data which will be store in .jsonl format. \n", + " Please refer to `./NeMo-Curator/nemo_curator/download/wikipedia.py` for detail implementation.\n", + " \n", + " Argument for this function includes:\n", + " - `output_path`: Output path for downloaded and extracted dataset\n", + " - `output_type`: Type of output file. Default is .jsonl. User might choose other types such as parquet. In this example, .jsonl will be used\n", + " - `language`: See above\n", + " - `dump_date`: See above\n", + " - `raw_download_dir`: Output path for intermediate downloaded .bz2 file. If not specified, will be downloaded to `output_path`\n", + " - `keep_raw_download`: Whether to keep downloaded .bz2 files after extraction. Default is not to keep.\n", + " - `force_download`: Whether to restart downloading process if the target .bz2 files are detected under the `raw_download_dir` \n", + " - `url_limit`: Number of .bz2 files to be downloaded.\n", + "\n", + "The resultant .jsonl for Thai wikipedia will contain the following keys:\n", + "1. text\n", + "2. title\n", + "3. id\n", + "4. url\n", + "5. language\n", + "6. source_id\n", + "7. file_name" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "adb59379", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator.download import download_wikipedia" + ] + }, + { + "cell_type": "markdown", + "id": "9b56f12a", + "metadata": {}, + "source": [ + " Start a CPU based Dask cluster. Please modify `n_workers` and `memory_limit` according to your hardware specification. To process TH wikipedia data, it's advised to have `memory_limit` greater than 12GB" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "e822b5ac", + "metadata": {}, + "outputs": [], + "source": [ + "cluster = LocalCluster(n_workers=10, processes=True, memory_limit='16GB')\n", + "client = Client(cluster)" + ] + }, + { + "cell_type": "markdown", + "id": "e90cc8b1", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "9a03b463", + "metadata": {}, + "outputs": [], + "source": [ + "#Output\n", + "download_base_directory= os.path.join(data_dir,\"wiki_downloads\")\n", + "download_output_directory = os.path.join(download_base_directory,\"data\")\n", + "\n", + "#Relevant parameters\n", + "dump_date = \"20240201\"\n", + "language = 'th'\n", + "url_limit = 1" + ] + }, + { + "cell_type": "markdown", + "id": "f41734a1", + "metadata": {}, + "source": [ + "Download TH wikipedia data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a45965a7", + "metadata": {}, + "outputs": [], + "source": [ + "res = download_wikipedia(download_output_directory,\n", + " language=language, \n", + " dump_date=dump_date,\n", + " url_limit=url_limit).df.compute()" + ] + }, + { + "cell_type": "markdown", + "id": "22b7d5b3", + "metadata": {}, + "source": [ + "Verify result" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "45a69041", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "downloads thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\n", + "162164 /nluo_data/NeMo-Curator/tutorials/single_node_tutorial/workspace/wiki_downloads/data/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\n" + ] + } + ], + "source": [ + "! ls {download_output_directory}\n", + "! wc -l {download_output_directory}/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "53bdccfd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"text\":\"–\\n\\nป้ายบอกทาง \\n ศาลาประชาคม – กระดานข่าว โครงการ ทรัพยากรและกิจกรรมซึ่งครอบคลุมวิกิพีเดียอย่างกว้างขวาง\\n แผนกช่วยเหลือ – ถามข้อสงสัยเกี่ยวกับการใช้งานวิกิพีเดีย\\n ปุจฉา-วิสัชนา – ถามข้อสงสัยทั่วไปที่คุณอยากรู้\\n ข่าวไซต์ – ประกาศ อัพเดต บทความและข้อมูลข่าวเกี่ยวกับวิกิพีเดียและมูลนิธิวิกิมีเดีย\\n สภากาแฟ – สำหรับอภิปรายเกี่ยวกับวิกิพีเดีย รวมถึงรายงานปัญหาเทคนิคและเสนอนโยบาย\\n Local Embassy – For Wikipedia-related discussion in languages other than Thai.\\n สร้างบทความใหม่ – บทช่วยสอนสำหรับเตรียมพร้อมสร้างบทความแรกของคุณ\\n\\nภาษาอื่น \\n\\n \",\"title\":\"หน้าหลัก\",\"id\":\"1\",\"url\":\"https:\\/\\/th.wikipedia.org\\/wiki\\/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81\",\"language\":\"th\",\"source_id\":\"thwiki-20240201-thwiki-20240201-pages-articles-multistream.xml.bz2\",\"filename\":\"thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\"}\n", + "\n" + ] + } + ], + "source": [ + "check_jsonl_file(download_output_directory)" + ] + }, + { + "cell_type": "markdown", + "id": "c5f58643", + "metadata": {}, + "source": [ + "**[Optional]**Close the Dask cluster.You might encounter error such as `Caught signal 11`.It's OK, just rerun the cell again." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "0669a830", + "metadata": {}, + "outputs": [], + "source": [ + "# client.cluster.close()\n", + "# client.shutdown()" + ] + }, + { + "cell_type": "markdown", + "id": "43334988", + "metadata": {}, + "source": [ + "## 2.Language separation and unicode fixing" + ] + }, + { + "cell_type": "markdown", + "id": "86ccdc1f", + "metadata": {}, + "source": [ + "In this section, we will be using a language classification model by fasttext to separate the TH wikipedia dataset based on the document major languages, and we will also fix the unicode in the documents. Detailed steps are:\n", + "\n", + "1. Download fasttext model for text language detection\n", + "2. Construct a filter which uses the downloaded fasttext model to produce a language label to each document. \n", + "3. Separate each document by the language label. This will create sub-folders for each languages under the output path and the documents under the same language will be output to a .jsonl file in the corresponding sub-folder.\n", + "4. Load .jsonl file in the folder of desirable language. In this example, `TH` folder will be loaded.\n", + "5. Apply `UnicodeReformatter` to the data and output the result in .jsonl format. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "1e9198e8", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator import ScoreFilter,Modify\n", + "from nemo_curator.filters import FastTextLangId\n", + "from nemo_curator.modifiers import UnicodeReformatter" + ] + }, + { + "cell_type": "markdown", + "id": "76e46d2a", + "metadata": {}, + "source": [ + "**[Optional]** Start a cpu based Dask cluster." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "da3aed8a", + "metadata": {}, + "outputs": [], + "source": [ + "# cluster = LocalCluster(n_workers=10, processes=True, memory_limit='16GB')\n", + "# client = Client(cluster)" + ] + }, + { + "cell_type": "markdown", + "id": "4a72479c", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "13b9d2b1", + "metadata": {}, + "outputs": [], + "source": [ + "# Input path\n", + "multilingual_data_path = f\"{download_output_directory}/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\"\n", + "\n", + "# Output path\n", + "language_base_output_path = os.path.join(data_dir,\"language_sep\")\n", + "language_data_output_path = os.path.join(language_base_output_path,\"data\")\n", + "language_separated_output_path = os.path.join(language_data_output_path,\"language\")\n", + "lang_sep_cleaned_data_output_path = os.path.join(language_data_output_path,\"cleaned\")\n", + "\n", + "# Fasttext model path\n", + "model_path = language_base_output_path\n", + "\n", + "# Define desired language\n", + "target_language = \"TH\"\n", + "\n", + "# Define key in output .jsonl files to store the language information\n", + "language_field = \"language\"" + ] + }, + { + "cell_type": "markdown", + "id": "8df0322a", + "metadata": {}, + "source": [ + "Download fasttext model" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "2666727d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2024-05-17 03:17:09-- https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin\n", + "Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 99.84.238.181, 99.84.238.154, 99.84.238.162, ...\n", + "Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|99.84.238.181|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 131266198 (125M) [application/octet-stream]\n", + "Saving to: ‘/nluo_data/NeMo-Curator/tutorials/single_node_tutorial/workspace/language_sep/lid.176.bin.1’\n", + "\n", + "lid.176.bin.1 100%[===================>] 125.18M 184MB/s in 0.7s \n", + "\n", + "2024-05-17 03:17:10 (184 MB/s) - ‘/nluo_data/NeMo-Curator/tutorials/single_node_tutorial/workspace/language_sep/lid.176.bin.1’ saved [131266198/131266198]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P {model_path}" + ] + }, + { + "cell_type": "markdown", + "id": "58452516", + "metadata": {}, + "source": [ + "Apply fasttext model to separate documents by their languages" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "d8b8c491", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Time taken for splitting language:140.04064464569092\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "\n", + "# Load dataset \n", + "multilingual_dataset = DocumentDataset.read_json(multilingual_data_path,add_filename=True)\n", + "\n", + "#Define Language separation pipeline\n", + "lang_filter = FastTextLangId(os.path.join(model_path,'lid.176.bin'))\n", + "language_id_pipeline = ScoreFilter(lang_filter, score_field=language_field, score_type='object')\n", + "filtered_dataset = language_id_pipeline(multilingual_dataset)\n", + "\n", + "# The language separation pipeline will produce a result looks like ['EN',0.96873], we only want to keep the 'EN' label and drop the detailed classifier score\n", + "filtered_dataset.df[language_field] = filtered_dataset.df[language_field].apply(lambda score: score[1],meta = (language_field, 'object'))\n", + "\n", + "# Split the dataset to corresponding language sub-folders\n", + "language_stats = separate_by_metadata(filtered_dataset.df, language_separated_output_path, metadata_field=language_field).compute()\n", + "\n", + "print(f\"Time taken for splitting language:{time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "d443a5d1", + "metadata": {}, + "source": [ + "Load `UnicodeReformatter` to reformat any unicode appeared in the desired language dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "272a5f67", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n", + "Writing to disk complete for 1 partitions\n", + "Time taken for fixing unicode:437.4811737537384\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "\n", + "# Read the language specific data and fix the unicode in it\n", + "lang_data_path = os.path.join(language_separated_output_path, target_language)\n", + "lang_data = DocumentDataset.read_json(lang_data_path,add_filename=True)\n", + "\n", + "cleaner = Modify(UnicodeReformatter())\n", + "cleaned_data = cleaner(lang_data)\n", + "\n", + "# Write the cleaned_data\n", + "cleaned_data.to_json(lang_sep_cleaned_data_output_path, write_to_filename=True)\n", + "\n", + "print(f\"Time taken for fixing unicode:{time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bd57a53", + "metadata": {}, + "source": [ + "Verify the result. We can see that some documents has been removed from TH wikipedia dataset since the number of lines in this output file is less than the original file (no. of lines = 162164)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "e3329c83", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\n", + "161748 /nluo_data/NeMo-Curator/tutorials/single_node_tutorial/workspace/language_sep/data/cleaned/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\n" + ] + } + ], + "source": [ + "! ls {lang_sep_cleaned_data_output_path}\n", + "! wc -l {lang_sep_cleaned_data_output_path}/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl" + ] + }, + { + "cell_type": "markdown", + "id": "0b6cbc26", + "metadata": {}, + "source": [ + "Furthur verify by loading documents that has been identified as other language, such as 'EN'. We can see from output that the removed document is indeed in English and contains very little or even no Thai." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "050d944c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"filename\":\"thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\",\"id\":\"1\",\"language\":\"TH\",\"source_id\":\"thwiki-20240201-thwiki-20240201-pages-articles-multistream.xml.bz2\",\"text\":\"–\\n\\nป้ายบอกทาง \\n ศาลาประชาคม – กระดานข่าว โครงการ ทรัพยากรและกิจกรรมซึ่งครอบคลุมวิกิพีเดียอย่างกว้างขวาง\\n แผนกช่วยเหลือ – ถามข้อสงสัยเกี่ยวกับการใช้งานวิกิพีเดีย\\n ปุจฉา-วิสัชนา – ถามข้อสงสัยทั่วไปที่คุณอยากรู้\\n ข่าวไซต์ – ประกาศ อัพเดต บทความและข้อมูลข่าวเกี่ยวกับวิกิพีเดียและมูลนิธิวิกิมีเดีย\\n สภากาแฟ – สำหรับอภิปรายเกี่ยวกับวิกิพีเดีย รวมถึงรายงานปัญหาเทคนิคและเสนอนโยบาย\\n Local Embassy – For Wikipedia-related discussion in languages other than Thai.\\n สร้างบทความใหม่ – บทช่วยสอนสำหรับเตรียมพร้อมสร้างบทความแรกของคุณ\\n\\nภาษาอื่น \\n\\n \",\"title\":\"หน้าหลัก\",\"url\":\"https:\\/\\/th.wikipedia.org\\/wiki\\/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81\"}\n", + "\n" + ] + } + ], + "source": [ + "check_jsonl_file(os.path.join(language_separated_output_path,'EN'))" + ] + }, + { + "cell_type": "markdown", + "id": "7d17f010", + "metadata": {}, + "source": [ + "**[Optional]** Close the Dask cluster." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "7e64cc35", + "metadata": {}, + "outputs": [], + "source": [ + "# client.cluster.close()\n", + "# client.shutdown()" + ] + }, + { + "cell_type": "markdown", + "id": "1d46cece", + "metadata": {}, + "source": [ + "## 3.Add ID\n", + "TH wikipedia data do have `id` field, but the `id` field contains number only. It will be better if we unified the `id` field and transform it to the format of `_`. In this way, when handling multiple dataset, we will be able to know which document from which dataset has been removed. This `id` will be useful when we are running deduplication and heuristic filtering. The function we will be using is `AddID()`. Arguments for this function include:\n", + "- `id_field`: fields will be added to input .json file. If the key already exists in the .jsonl, it's value will be replaced.\n", + "- `id_prefix`: prefix used in ID. Default is 'doc_id'\n", + "- `start_index`: starting index in ID. Default is None. When set to None, an unordered ID scheme will be used for fast calculation. In this notebook, it's set to 0 for easier reference." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5f788b91", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator import AddId" + ] + }, + { + "cell_type": "markdown", + "id": "cd17be33", + "metadata": {}, + "source": [ + "**[Optional]** If there is no running Dask cluster, start CPU based Dask cluster." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "5ba1d54a", + "metadata": {}, + "outputs": [], + "source": [ + "# cluster = LocalCluster(n_workers=10, processes=True, memory_limit='16GB')\n", + "# client = Client(cluster)" + ] + }, + { + "cell_type": "markdown", + "id": "12f59d5e", + "metadata": {}, + "source": [ + "Define relevant parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "843eba7f", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "add_id_input_data_dir = lang_sep_cleaned_data_output_path\n", + "\n", + "#Output\n", + "added_id_output_path = os.path.join(data_dir,\"add_id/cleaned\")\n", + "\n", + "#Format of output ID will be _, Define prefix here\n", + "add_ID_id_prefix=\"TH_wiki\"" + ] + }, + { + "cell_type": "markdown", + "id": "e7a8307c", + "metadata": {}, + "source": [ + "Adding ID to dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "b7a91bf1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n", + "Writing to disk complete for 1 partitions\n", + "Time taken for add ID:47.33783745765686\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "# Read input files\n", + "dataset = DocumentDataset.read_json(add_id_input_data_dir,add_filename=True)\n", + "\n", + "# Run AddID() on the input dataset\n", + "add_id = AddId(id_field='id',id_prefix=add_ID_id_prefix,start_index=0)\n", + "id_dataset = add_id(dataset)\n", + "\n", + "#Output files\n", + "id_dataset.to_json(added_id_output_path, write_to_filename=True)\n", + "\n", + "print(f\"Time taken for add ID:{time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e92b5dab", + "metadata": {}, + "source": [ + "Verify the result. From the output, we can see that the `id` value has been changed to `TH_wiki-0000000000` " + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "e585cedd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"filename\":\"thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl\",\"id\":\"TH_wiki-0000000000\",\"language\":\"TH\",\"source_id\":\"thwiki-20240201-thwiki-20240201-pages-articles-multistream.xml.bz2\",\"text\":\"–\\n\\nป้ายบอกทาง \\n ศาลาประชาคม – กระดานข่าว โครงการ ทรัพยากรและกิจกรรมซึ่งครอบคลุมวิกิพีเดียอย่างกว้างขวาง\\n แผนกช่วยเหลือ – ถามข้อสงสัยเกี่ยวกับการใช้งานวิกิพีเดีย\\n ปุจฉา-วิสัชนา – ถามข้อสงสัยทั่วไปที่คุณอยากรู้\\n ข่าวไซต์ – ประกาศ อัพเดต บทความและข้อมูลข่าวเกี่ยวกับวิกิพีเดียและมูลนิธิวิกิมีเดีย\\n สภากาแฟ – สำหรับอภิปรายเกี่ยวกับวิกิพีเดีย รวมถึงรายงานปัญหาเทคนิคและเสนอนโยบาย\\n Local Embassy – For Wikipedia-related discussion in languages other than Thai.\\n สร้างบทความใหม่ – บทช่วยสอนสำหรับเตรียมพร้อมสร้างบทความแรกของคุณ\\n\\nภาษาอื่น \\n\\n \",\"title\":\"หน้าหลัก\",\"url\":\"https:\\/\\/th.wikipedia.org\\/wiki\\/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81\"}\n", + "\n" + ] + } + ], + "source": [ + "check_jsonl_file(added_id_output_path)" + ] + }, + { + "cell_type": "markdown", + "id": "0cbddf6e", + "metadata": {}, + "source": [ + "Close Dask cluster. This cell needs to be run as we are starting a new GPU Dask cluster in the following task" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "4daa1f2a", + "metadata": {}, + "outputs": [], + "source": [ + "client.cluster.close()\n", + "client.shutdown()" + ] + }, + { + "cell_type": "markdown", + "id": "1baf027e", + "metadata": {}, + "source": [ + "## 4.Exact Dedplication\n", + "\n", + "In exact deduplication, the document text is hashed into unique string using certain hashing algorithm, such as 'md5'. The documents with exact hashed values are having identical text. We will output the `ID` of duplicated documents for removal later. The function used is `ExactDuplicates()`. Arguments for this function include:\n", + "- `id_field`: Key in input file for identifying document ID\n", + "- `text_field`: Key in input file which contains document text.\n", + "- `hash_method`: Hashing algorithm used. Default is `md5`\n", + "- `cache_dir`: If specified, the duplicated document IDs will be output to the `cache_dir`. Otherwise, the IDs will not be saved\n", + "\n", + "Also, we are going to use GPU dask cluster to accelerate computation for deduplication (both exact and fuzzy)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "3f7ba34c", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator.modules import ExactDuplicates" + ] + }, + { + "cell_type": "markdown", + "id": "e268cfca", + "metadata": {}, + "source": [ + "Start a GPU based Dask cluster. Since GPU based Dask cluster involves setting several arguments, we will use the `get_client()` wrapper function to quickly set up. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "4b73e5f9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of dask worker:1\n" + ] + }, + { + "data": { + "text/plain": [ + "{'tcp://127.0.0.1:36179': None}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client = get_client(cluster_type = 'gpu', set_torch_to_use_rmm=False)\n", + "print(f\"Number of dask worker:{get_num_workers(client)}\")\n", + "client.run(pre_imports)" + ] + }, + { + "cell_type": "markdown", + "id": "0fc99440", + "metadata": {}, + "source": [ + "If you encounter the following error\n", + "`get_client() missing 1 required positional argument: 'args'`:\n", + "\n", + "This is probably because the `nemo_curator` library is not updated to the newer version. Please run the following line in the terminal, following instruction in our [GitHub](https://github.com/nicoleeeluo/NeMo-Curator/tree/main) repo, and restart the notebook. Intermediate result of the previous section has been saved to local, you can start from this section after updating." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a590c78a", + "metadata": {}, + "outputs": [], + "source": [ + "#pip install --extra-index-url https://pypi.nvidia.com \".[cuda12x]\"" + ] + }, + { + "cell_type": "markdown", + "id": "0151abe0", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "54b627a4", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "exact_dedup_input_dataset_dir = added_id_output_path\n", + "\n", + "#Output\n", + "exact_dedup_base_output_path = os.path.join(data_dir,\"exact_dedup\")\n", + "exact_dedup_log_dir = os.path.join(exact_dedup_base_output_path,'log')\n", + "exact_dedup_output_dir = os.path.join(exact_dedup_base_output_path,'data')\n", + "\n", + "#Parameters for ExactDuplicates()\n", + "exact_dedup_dataset_id_field = \"id\"\n", + "exact_dedup_dataset_text_field = \"text\" \n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "6ede2e41", + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir -p {exact_dedup_log_dir}\n", + "!mkdir -p {exact_dedup_output_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "1882204a", + "metadata": {}, + "source": [ + "Apply exact deduplication" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "dfaaa765", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/nemo_curator/modules/exact_dedup.py:158: UserWarning: Output path f/work_dir/tutorials/single_node_tutorial/workspace/exact_dedup/data/_exact_duplicates.parquet already exists and will be overwritten\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of exact duplicated file:53\n", + "Time taken for exact duplicate:1.9788782596588135\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "# Read input dataset\n", + "input_dataset = DocumentDataset.read_json(exact_dedup_input_dataset_dir, backend='cudf')\n", + "\n", + "#Run exact deduplication to the input\n", + "exact_dup = ExactDuplicates(\n", + " logger=exact_dedup_log_dir,\n", + " id_field=exact_dedup_dataset_id_field,\n", + " text_field=exact_dedup_dataset_text_field,\n", + " hash_method=\"md5\",\n", + " cache_dir=exact_dedup_output_dir #Duplicated document ID list is output to the cache_dir\n", + ")\n", + "duplicates = exact_dup(dataset=input_dataset)\n", + "\n", + "print(f\"Number of exact duplicated file:{len(duplicates)}\")\n", + "\n", + "print(f\"Time taken for exact duplicate:{time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e68f0399", + "metadata": {}, + "source": [ + "Verify the output duplicated ID. We can group by the `_hashes` to get the list of duplicated documents having the same _hashes and use `extract_lines_with_id()` to verify that those documents are indeed exact duplicates. Please note that the `id` might changes, therefore, please replace the `target_list` when necessary" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "28d8bb0b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of exact duplicated document:53\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
id_hashes
0TH_wiki-00001220553e6e96a80410d5a191d098f464e66f86
1TH_wiki-0000105191e77a248506ef16737288fae5759db33a
2TH_wiki-00001051922e386f5c3af70f43874618988d4842b2
3TH_wiki-00001051932e386f5c3af70f43874618988d4842b2
4TH_wiki-00001051942e386f5c3af70f43874618988d4842b2
\n", + "
" + ], + "text/plain": [ + " id _hashes\n", + "0 TH_wiki-0000122055 3e6e96a80410d5a191d098f464e66f86\n", + "1 TH_wiki-0000105191 e77a248506ef16737288fae5759db33a\n", + "2 TH_wiki-0000105192 2e386f5c3af70f43874618988d4842b2\n", + "3 TH_wiki-0000105193 2e386f5c3af70f43874618988d4842b2\n", + "4 TH_wiki-0000105194 2e386f5c3af70f43874618988d4842b2" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "exact_dedup_res = pd.read_parquet(os.path.join(exact_dedup_output_dir,\"_exact_duplicates.parquet\"))\n", + "print(f\"Number of exact duplicated document:{len(exact_dedup_res)}\")\n", + "exact_dedup_res.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "fca41870", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
_hashesid
00b908a91cdf0544c1ef3015cff4ee07eTH_wiki-0000157216 TH_wiki-0000066307
115f35c239b6579b4642f7656e64576acTH_wiki-0000074714 TH_wiki-0000074715 TH_wiki-...
21708cb56ec582f78716f0864dca9382dTH_wiki-0000021211 TH_wiki-0000021213 TH_wiki-...
32e386f5c3af70f43874618988d4842b2TH_wiki-0000105192 TH_wiki-0000105193 TH_wiki-...
43e6e96a80410d5a191d098f464e66f86TH_wiki-0000122055 TH_wiki-0000116550
\n", + "
" + ], + "text/plain": [ + " _hashes \\\n", + "0 0b908a91cdf0544c1ef3015cff4ee07e \n", + "1 15f35c239b6579b4642f7656e64576ac \n", + "2 1708cb56ec582f78716f0864dca9382d \n", + "3 2e386f5c3af70f43874618988d4842b2 \n", + "4 3e6e96a80410d5a191d098f464e66f86 \n", + "\n", + " id \n", + "0 TH_wiki-0000157216 TH_wiki-0000066307 \n", + "1 TH_wiki-0000074714 TH_wiki-0000074715 TH_wiki-... \n", + "2 TH_wiki-0000021211 TH_wiki-0000021213 TH_wiki-... \n", + "3 TH_wiki-0000105192 TH_wiki-0000105193 TH_wiki-... \n", + "4 TH_wiki-0000122055 TH_wiki-0000116550 " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "exact_dedup_res.groupby('_hashes')['id'].agg(lambda x: ' '.join(x)).reset_index().head()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "8c9624ac", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'filename': 'thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl', 'id': 'TH_wiki-0000066307', 'language': 'TH', 'source_id': 'thwiki-20240201-thwiki-20240201-pages-articles-multistream.xml.bz2', 'text': '\\n\\nแหล่งข้อมูลอื่น \\n\\nสงขลา\\n \\nรายชื่อเกี่ยวกับจังหวัดสงขลา', 'title': 'รายชื่อโบราณสถานในจังหวัดสงขลา', 'url': 'https://th.wikipedia.org/wiki/%E0%B8%A3%E0%B8%B2%E0%B8%A2%E0%B8%8A%E0%B8%B7%E0%B9%88%E0%B8%AD%E0%B9%82%E0%B8%9A%E0%B8%A3%E0%B8%B2%E0%B8%93%E0%B8%AA%E0%B8%96%E0%B8%B2%E0%B8%99%E0%B9%83%E0%B8%99%E0%B8%88%E0%B8%B1%E0%B8%87%E0%B8%AB%E0%B8%A7%E0%B8%B1%E0%B8%94%E0%B8%AA%E0%B8%87%E0%B8%82%E0%B8%A5%E0%B8%B2'}\n", + "{'filename': 'thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl', 'id': 'TH_wiki-0000157216', 'language': 'TH', 'source_id': 'thwiki-20240201-thwiki-20240201-pages-articles-multistream.xml.bz2', 'text': '\\n\\nแหล่งข้อมูลอื่น \\n\\nสงขลา\\n \\nรายชื่อเกี่ยวกับจังหวัดสงขลา', 'title': 'รายชื่อโบราณสถานในจังหวัดสงขลา (อำเภอเมืองสงขลาและสิงหนคร)', 'url': 'https://th.wikipedia.org/wiki/%E0%B8%A3%E0%B8%B2%E0%B8%A2%E0%B8%8A%E0%B8%B7%E0%B9%88%E0%B8%AD%E0%B9%82%E0%B8%9A%E0%B8%A3%E0%B8%B2%E0%B8%93%E0%B8%AA%E0%B8%96%E0%B8%B2%E0%B8%99%E0%B9%83%E0%B8%99%E0%B8%88%E0%B8%B1%E0%B8%87%E0%B8%AB%E0%B8%A7%E0%B8%B1%E0%B8%94%E0%B8%AA%E0%B8%87%E0%B8%82%E0%B8%A5%E0%B8%B2%20%28%E0%B8%AD%E0%B8%B3%E0%B9%80%E0%B8%A0%E0%B8%AD%E0%B9%80%E0%B8%A1%E0%B8%B7%E0%B8%AD%E0%B8%87%E0%B8%AA%E0%B8%87%E0%B8%82%E0%B8%A5%E0%B8%B2%E0%B9%81%E0%B8%A5%E0%B8%B0%E0%B8%AA%E0%B8%B4%E0%B8%87%E0%B8%AB%E0%B8%99%E0%B8%84%E0%B8%A3%29'}\n" + ] + } + ], + "source": [ + "target_list = ['TH_wiki-0000157216', 'TH_wiki-0000066307']\n", + "for line in extract_lines_with_id(os.path.join(exact_dedup_input_dataset_dir,'thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl'),target_list):\n", + " print(line)" + ] + }, + { + "cell_type": "markdown", + "id": "4013203c", + "metadata": {}, + "source": [ + "**[Optional]** You might choose to close Dask cluster here" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "5ef2f05e", + "metadata": {}, + "outputs": [], + "source": [ + "# client.cluster.close()\n", + "# client.shutdown()" + ] + }, + { + "cell_type": "markdown", + "id": "7a2feadc", + "metadata": {}, + "source": [ + "## 5. Fuzzy Deduplication\n", + "Fuzzy deduplication involves 5 intermediate steps to generate duplicates. Refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html for details\n", + "\n", + "Fuzzy deduplication in this example is a GPU implementation of MinhashLSH algorithm. This algorithm measures similarity based on statistics but not semantic meanings of text. There are a few concepts to be introduced before heading into fuzzy deduplication.\n", + "1. Jaccard similarity: Jaccard similarity is often used as a metric to calculate the similarity between two sets. It's calculated by dividing the number of common elements in the two sets (Intersection) by the number of total unique elements in the two sets (Union). In the case of text documents, we transform a document into a set of n-grams. If two documents share a large amount of n-grams, most likely the documents are similar. \n", + "\n", + " ![alt text](./image/jaccard.png )\n", + "\n", + "2. Complexity of the problem: To find all the similar document pairs in a dataset, we need to compute pair-wise Jaccard similarity across the dataset. Hence, making the complexity $O(N^2)$\n", + "\n", + "The MinhashLSH algorithm is a technique for quickly estimating the similarity between sets, such as the similarity between documents represented as sets of shingles (n-grams). It's able to find out Jaccard similar pair in the corpus but in a much computational efficient way. This algorithm has following steps in a high-level:\n", + "1. Compute minhash for each document\n", + "2. Run Locality Sensitive Hashing (LSH) based on the minhash which further assign buckets to each document. Each documents will be assigned to multiple buckets. Documents within the same bucket are deemed to be similar.\n", + "3. Run pair-wise Jaccard similarity within each buckets to remove false positive cases within the buckets\n", + "4. Based on the Jaccard similarity, transform the similarity matrix to a graph ans run connected component algorithm. For a group of connected components in the graph, they are the final similar document groups and the IDs within each groups will be output for duplicate removal.\n", + "More detailed explanation please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpudeduplication.html.\n", + "\n", + "For implementation of MinhahsLSH on GPU, there are 5 steps:\n", + "1. Minhash computation\n", + "2. Bucket computation\n", + "3. Jaccard shuffle for load balancing in a distributed system\n", + "4. Jaccard similarity computation\n", + "5. Connected component \n", + "\n", + "In this section, we will firstly provide examples to each sub-steps for users to have a better understanding on what is going on under the hood. At the last sub section, we will provide example for the fuzzy deduplication wrapper." + ] + }, + { + "cell_type": "markdown", + "id": "ffca14ad", + "metadata": {}, + "source": [ + "**If there is not running Dask cluster, start a GPU Dask cluster here**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e00ba2fd", + "metadata": {}, + "outputs": [], + "source": [ + "# client = get_client(cluster_type = 'gpu', set_torch_to_use_rmm=False)\n", + "# print(f\"Number of dask worker:{get_num_workers(client)}\")\n", + "# client.run(pre_imports)" + ] + }, + { + "cell_type": "markdown", + "id": "5df73743", + "metadata": {}, + "source": [ + "### 5.1 Minhash\n", + "\n", + "Run `MinHash()` for this section. The output of a minhash is a parquet file which contains document ID and hashed value which is an array contains 260 32-bit integer data. To obtain such hashed values we need to go through the following steps:\n", + "1. Generate a set of n-gram components of a document. For example, doc = `Nemo Curator is a data curation tool`, a 3-gram set of this document will be `['Nemo Curator is','Curator is a','is a data','a data curation','data curation tool']`\n", + "2. Hashed each n-gram into numerical values\n", + "3. Generate a random hash function $H_1()$ which will hash each numeric n-gram into a 32-bit integer and take the minimum integer to use as minhash value for $H_1()$\n", + "4. Repeat step 2 and 3 with hash function $H_x()$ until desired minhash length is reached. Minhash value of each iteration will be append together to form the final minhash array. \n", + "\n", + "Arguments include:\n", + "- `seed`:Random seed used for initializing the hash functions used to compute the MinHashes. It's advised to keep this value the same for different experiment for reproducibility\n", + "- `num_hashes`:Length of each minhash array. Default is 260. Longer minhash length will have better estimate of actual Jaccard similarity, but require more computational power\n", + "- `char_ngrams`:n-gram length\n", + "- `use_64bit_hash`:Whether to use 64bit or 32bit hash function\n", + "- `id_field`: Key in input file for identifying document ID\n", + "- `text_field`: Key in input file which contains document text.\n", + "- `cache_dir`: If specified, the intermediate result will be output to the `cache_dir`. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "1fc5bff3", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator import MinHash" + ] + }, + { + "cell_type": "markdown", + "id": "7bf9cc8d", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "d600d1b8", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "minhash_data_path = added_id_output_path\n", + "#Output\n", + "minshah_base_output_path = os.path.join(data_dir,\"fuzzy/minhash\")\n", + "minshah_log_dir = os.path.join(minshah_base_output_path,'log')\n", + "minshah_output_dir = os.path.join(minshah_base_output_path,'data')\n", + "#Specify dataset name\n", + "dataset_name = 'TH_wikipedia'\n", + "\n", + "#Relevant parameters\n", + "minhash_id_field = 'id'\n", + "minhash_text_field = 'text'\n", + "seed = 10\n", + "minhash_length = 260\n", + "char_ngram = 5\n", + "use_64bit_hash = False\n", + "files_per_partition = 2\n", + "\n", + "!mkdir -p {minshah_log_dir}\n", + "!mkdir -p {minshah_output_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "1c31ddf4", + "metadata": {}, + "source": [ + "Run MinHash" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "88540950", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Computing minhashes for /work_dir/tutorials/single_node_tutorial/workspace/add_id/cleaned\n", + "Reading 1 files\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/nemo_curator/modules/fuzzy_dedup.py:175: UserWarning: Output path /work_dir/tutorials/single_node_tutorial/workspace/fuzzy/minhash/data/_minhashes.parquet already exists and will be overwritten\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Time taken for MinHash:6.340771198272705\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "print(f\"Computing minhashes for {minhash_data_path}\")\n", + "\n", + "# Load data. Only the [minhash_id_field, text_field] columns are needed\n", + "files = get_all_files_paths_under(root=minhash_data_path, recurse_subdirectories=False)\n", + "files = [f for f in files if f.endswith(\".jsonl\")]\n", + "df = read_data(\n", + " files,\n", + " file_type=\"jsonl\",\n", + " backend=\"cudf\",\n", + " files_per_partition=files_per_partition,\n", + " add_filename=False,\n", + ")[[minhash_id_field, minhash_text_field]]\n", + "\n", + "# Run MinHash() on input data\n", + "minhasher = MinHash(\n", + " seed=seed,\n", + " num_hashes=minhash_length,\n", + " char_ngrams=char_ngram,\n", + " use_64bit_hash=use_64bit_hash,\n", + " logger=minshah_log_dir,\n", + " id_field=minhash_id_field,\n", + " text_field=minhash_text_field,\n", + " cache_dir=minshah_output_dir\n", + ")\n", + "res = minhasher(DocumentDataset(df)).df\n", + "\n", + "print(f\"Time taken for MinHash:{time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "158bf3ab", + "metadata": {}, + "source": [ + "Verify result" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "10b5eb55", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
id_minhash_signature
0TH_wiki-0000000000[11565725, 19782487, 9831980, 5480992, 2306475...
1TH_wiki-0000000001[407876, 107572, 824528, 346831, 216554, 10963...
2TH_wiki-0000000002[727721, 694551, 233868, 346831, 216554, 77001...
3TH_wiki-0000000003[1149282, 931656, 2515604, 1428622, 4964646, 4...
4TH_wiki-0000000004[1559901, 11771639, 487706, 826569, 1203860, 5...
\n", + "
" + ], + "text/plain": [ + " id _minhash_signature\n", + "0 TH_wiki-0000000000 [11565725, 19782487, 9831980, 5480992, 2306475...\n", + "1 TH_wiki-0000000001 [407876, 107572, 824528, 346831, 216554, 10963...\n", + "2 TH_wiki-0000000002 [727721, 694551, 233868, 346831, 216554, 77001...\n", + "3 TH_wiki-0000000003 [1149282, 931656, 2515604, 1428622, 4964646, 4...\n", + "4 TH_wiki-0000000004 [1559901, 11771639, 487706, 826569, 1203860, 5..." + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "minhash_res = pd.read_parquet(os.path.join(minshah_output_dir, \"_minhashes.parquet\"))\n", + "minhash_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "0bce0f80", + "metadata": {}, + "source": [ + "### 5.2 LSH\n", + "`LSH()` implements LSH algorithm which includes the following steps:\n", + "1. Divide the minhash array into `X` different portions. \n", + "2. For each portions, hash the minhash values into buckets. One document will be assigned to `X` buckets.\n", + "3. Documents within the same bucket will be deemed similar. Since every document will be assigned `X` buckets and as long as two documents share 1 or more buckets they are deemed similar, the result of LSH will have more false positive as compared to false negative. The false positive cases will be filtered in following modules, namely jaccard compute.\n", + "\n", + "Arguments include:\n", + "- `minhash_length`:Length of minhash signature. Must be consistent with `MinHash()`\n", + "- `num_buckets`: Number of buckets\n", + "- `buckets_per_shuffle`: Number of buckets to shuffle concurrently\n", + "- `id_field`: Key in input file for identifying document ID\n", + "- `minhash_field`: Key in input file for identifying document MinHash signature \n", + "- `cache_dir`:If specified, the intermediate result will be output to the `cache_dir`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "645b8a53", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator import LSH\n", + "from nemo_curator.utils.fuzzy_dedup_utils.id_mapping import \\\n", + " convert_str_id_to_int" + ] + }, + { + "cell_type": "markdown", + "id": "110db216", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "738ab265", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "lsh_input_data_path = minshah_output_dir\n", + "\n", + "#Output\n", + "lsh_base_output_path = os.path.join(data_dir,\"fuzzy/lsh\")\n", + "lsh_log_dir = os.path.join(lsh_base_output_path,'log')\n", + "lsh_output_dir = os.path.join(lsh_base_output_path,'data')\n", + "\n", + "#Relevant parameters\n", + "lsh_id_field = 'id'\n", + "minhash_field = '_minhash_signature'\n", + "minhash_length=260\n", + "num_bands=20\n", + "buckets_per_shuffle=1\n", + "\n", + "!mkdir -p {lsh_log_dir}\n", + "!mkdir -p {lsh_output_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "a5250a2a", + "metadata": {}, + "source": [ + "Run LSH" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "1ef61e2b", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/nemo_curator/modules/fuzzy_dedup.py:361: UserWarning: Output path /work_dir/tutorials/single_node_tutorial/workspace/fuzzy/lsh/data/_buckets.parquet already exists and will be overwritten\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Time taken for LSH:19.37230634689331\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "\n", + "#Load MinHash output\n", + "df = dask_cudf.read_parquet(lsh_input_data_path, blocksize=\"2GB\", aggregate_files=True, backend = \"cudf\")\n", + "df = df.map_partitions(\n", + " convert_str_id_to_int,\n", + " id_column=lsh_id_field,\n", + " meta=cudf.DataFrame(\n", + " {minhash_field: [[1, 2, 3]], \"doc_id\": [1], \"dataset_id\": np.uint32(1)}\n", + " ),\n", + ")\n", + "\n", + "#Run LSH()\n", + "lsh = LSH(\n", + " cache_dir=lsh_output_dir,\n", + " num_hashes=minhash_length,\n", + " num_buckets=num_bands,\n", + " buckets_per_shuffle=buckets_per_shuffle,\n", + " id_fields=[\"dataset_id\", \"doc_id\"],\n", + " minhash_field=minhash_field,\n", + " logger=lsh_log_dir,\n", + ")\n", + "res = lsh(DocumentDataset(df))\n", + "\n", + "t1 = time.time()\n", + "print(f\"Time taken for LSH:{time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ad2e3b60", + "metadata": {}, + "source": [ + "Verify result" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "9d0449c6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dataset_iddoc_id_bucket_id
01692361878123547210
1169236187893844120
216923618786656486
3169236187893845120
416923618786656586
\n", + "
" + ], + "text/plain": [ + " dataset_id doc_id _bucket_id\n", + "0 1692361878 123547 210\n", + "1 1692361878 93844 120\n", + "2 1692361878 66564 86\n", + "3 1692361878 93845 120\n", + "4 1692361878 66565 86" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lsh_res = pd.read_parquet(os.path.join(lsh_output_dir, \"_buckets.parquet\"))\n", + "lsh_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "f952f074", + "metadata": {}, + "source": [ + "### 5.3 Jaccard Shuffle\n", + "In this section, we will be using `_MapBucket()` and `_Shuffle()`.\n", + "\n", + "For `_MapBucket()`, it is designed to take input text data in jsonl format and bucket information which is output of LSH, map the documents to their respective buckets, and write the resulting DataFrame containing the anchor documents and their associated bucket information to a parquet file. Arguments include:\n", + "- `id_field`: Key in input .jsonl file for identifying document ID\n", + "- `text_field`: Key in input .jsonl file which contains document text.\n", + "- `bucket_field`: Key in input _buckets.parquet which contains `bucket_id`.\n", + "- `num_anchors`: Number of anchors (document in the same buckets) to be output\n", + "\n", + "\n", + "For `_Shuffle()`, it perform a shuffling operation on the documents based on their bucket assignments, output in .parquet format. This shuffling operation is a crucial step in the deduplication process, as it helps distribute similar documents across different partitions or workers, enabling efficient parallel processing and deduplication in subsequent steps. Arguments include:\n", + "- `id_fields`: Columns in `_buckets.parquet` that maps to original `id` in .jsonl data file. In this example, it is `[\"dataset_id\", \"doc_id\"]`\n", + "- `text_field`: Key in input .jsonl file which contains document text.\n", + "- `int_to_str_id`: Key in input .jsonl file for identifying document ID\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "707ea54d", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator.utils.fuzzy_dedup_utils.io_utils import (\n", + " get_bucket_ddf_from_parquet_path,\n", + " get_text_ddf_from_json_path_with_blocksize,\n", + ")\n", + "from nemo_curator.modules.fuzzy_dedup import _MapBuckets,_Shuffle" + ] + }, + { + "cell_type": "markdown", + "id": "8f2e321d", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "70e2dff9", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "input_data_paths = [minhash_data_path]\n", + "input_bucket_path = lsh_output_dir\n", + "\n", + "#Output\n", + "jaccard_shuffle_base_output_path = os.path.join(data_dir,\"fuzzy/jaccard_shuffle\")\n", + "output_anchor_docs_with_bk_path = os.path.join(jaccard_shuffle_base_output_path, \"anchor_docs_with_bk.parquet\")\n", + "input_anchor_docs_with_bk_dir = output_anchor_docs_with_bk_path\n", + "jaccard_shuffle_log_path = os.path.join(jaccard_shuffle_base_output_path,\"log\")\n", + "output_shuffled_docs_path = os.path.join(jaccard_shuffle_base_output_path, \"shuffled_docs.parquet\")\n", + "\n", + "#Relevant parameters for _MapBucket()\n", + "text_ddf_blocksize = 256\n", + "bucket_mapping_ddf_blocksize = 256\n", + "num_files = None\n", + "shuffle_type ='tasks'\n", + "input_bucket_field = '_bucket_id'\n", + "input_id_field = 'id'\n", + "input_text_field = 'text'\n", + "\n", + "#Relevant parameters for _Shuffle()\n", + "shuffle_id_fields=[\"dataset_id\", \"doc_id\"]\n", + "int_to_str_id='id'\n", + "\n", + "!mkdir -p {jaccard_shuffle_base_output_path}\n", + "!mkdir -p {jaccard_shuffle_log_path}" + ] + }, + { + "cell_type": "markdown", + "id": "d0f19efa", + "metadata": {}, + "source": [ + "Run Jaccard map bucket" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "b2850b0a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of files being read for jaccard calculation = 1\n", + "Number of ddf_bk partitions = 1\n", + "Time taken for Bucket Mapping:1.239295244216919 s\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "num_workers = get_num_workers(client)\n", + "\n", + "# Read .jsonl input data\n", + "ddf_text = get_text_ddf_from_json_path_with_blocksize(\n", + " input_data_paths=input_data_paths,\n", + " num_files=num_files,\n", + " blocksize=text_ddf_blocksize,\n", + " id_column=input_id_field,\n", + " text_column=input_text_field,\n", + ")\n", + "# Read \"_buckets.parquet\"\n", + "ddf_bk = get_bucket_ddf_from_parquet_path(input_bucket_path=input_bucket_path, num_workers=num_workers)\n", + "\n", + "#Run _MapBuckets()\n", + "map_buckets = _MapBuckets(id_fields=shuffle_id_fields, bucket_field=input_bucket_field, logger=jaccard_shuffle_log_path)\n", + "ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(documents_df=ddf_text, buckets_df=ddf_bk, shuffle_type=shuffle_type)\n", + "\n", + "#Write to disk\n", + "ddf_anchor_docs_with_bk.to_parquet(output_anchor_docs_with_bk_path, write_index=False)\n", + "\n", + "print(f\"Time taken for Bucket Mapping:{time.time()-t0} s\")" + ] + }, + { + "cell_type": "markdown", + "id": "a1533a15", + "metadata": {}, + "source": [ + "Verify result" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "d74012c3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dataset_iddoc_idanchor_1_dataset_idanchor_1_doc_idanchor_0_dataset_idanchor_0_doc_id_output_partition_id
01692361878127258169236187812778116923618781269550
11692361878853831692361878853641692361878853740
21692361878450301692361878852001692361878450300
31692361878127259169236187812778116923618781269550
41692361878127968169236187812796116923618781279960
\n", + "
" + ], + "text/plain": [ + " dataset_id doc_id anchor_1_dataset_id anchor_1_doc_id \\\n", + "0 1692361878 127258 1692361878 127781 \n", + "1 1692361878 85383 1692361878 85364 \n", + "2 1692361878 45030 1692361878 85200 \n", + "3 1692361878 127259 1692361878 127781 \n", + "4 1692361878 127968 1692361878 127961 \n", + "\n", + " anchor_0_dataset_id anchor_0_doc_id _output_partition_id \n", + "0 1692361878 126955 0 \n", + "1 1692361878 85374 0 \n", + "2 1692361878 45030 0 \n", + "3 1692361878 126955 0 \n", + "4 1692361878 127996 0 " + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "map_bucket_res = pd.read_parquet(output_anchor_docs_with_bk_path)\n", + "map_bucket_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "1487b1ad", + "metadata": {}, + "source": [ + "**[Optional]** Remove previous Jaccard Shuffle results. Run only when there are files under the Jaccard Shuffle output path" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "b414f703", + "metadata": {}, + "outputs": [], + "source": [ + "#!rm -r {output_shuffled_docs_path}" + ] + }, + { + "cell_type": "markdown", + "id": "f33a6782", + "metadata": {}, + "source": [ + "Run Jaccard Shuffle" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "86d1b3e5", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/1 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
text_text_bytesidanchor_0_idanchor_1_id
0การแข่งขันกีฬากรีฑาในโอลิมปิกฤดูร้อน 2020 – เด...14571692361878-1354171692361878-1354631692361878-135417
1การแข่งขันกีฬากรีฑาในโอลิมปิกฤดูร้อน 2020 – เด...14571692361878-1354171692361878-1353921692361878-135447
2สุริยุปราคาบางส่วนจะเกิดขึ้นในวันที่ 13 กรกฎาค...12621692361878-833631692361878-942311692361878-83363
3สุริยุปราคาบางส่วนจะเกิดขึ้นในวันที่ 13 กรกฎาค...12621692361878-833631692361878-949051692361878-83363
4สุริยุปราคาบางส่วนจะเกิดขึ้นในวันที่ 13 กรกฎาค...12621692361878-833631692361878-949061692361878-94905
\n", + "" + ], + "text/plain": [ + " text _text_bytes \\\n", + "0 การแข่งขันกีฬากรีฑาในโอลิมปิกฤดูร้อน 2020 – เด... 1457 \n", + "1 การแข่งขันกีฬากรีฑาในโอลิมปิกฤดูร้อน 2020 – เด... 1457 \n", + "2 สุริยุปราคาบางส่วนจะเกิดขึ้นในวันที่ 13 กรกฎาค... 1262 \n", + "3 สุริยุปราคาบางส่วนจะเกิดขึ้นในวันที่ 13 กรกฎาค... 1262 \n", + "4 สุริยุปราคาบางส่วนจะเกิดขึ้นในวันที่ 13 กรกฎาค... 1262 \n", + "\n", + " id anchor_0_id anchor_1_id \n", + "0 1692361878-135417 1692361878-135463 1692361878-135417 \n", + "1 1692361878-135417 1692361878-135392 1692361878-135447 \n", + "2 1692361878-83363 1692361878-94231 1692361878-83363 \n", + "3 1692361878-83363 1692361878-94905 1692361878-83363 \n", + "4 1692361878-83363 1692361878-94906 1692361878-94905 " + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jaccard_shuffle_res = pd.read_parquet(os.path.join(output_shuffled_docs_path,\"_output_partition_id=0/batch_1_1.parquet\"))\n", + "jaccard_shuffle_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "b8644e51", + "metadata": {}, + "source": [ + "### 5.4 Jaccard Compute\n", + "We will be using `JaccardSimilarity()`.This is to computes the Jaccard similarity between document pairs. Result is a parquet dataset consisting of document id pair along with their Jaccard similarity score. To compute Jaccard similarity between two documents, we first convert the document into sets of n-grams and then compute the Jaccard similarity of the two sets.\n", + "\n", + "Arguments include:\n", + "- `id_field`: Column in input .parquet file identifying document ID\n", + "- `text_field`: Column in input .parquet file identifying document text\n", + "- `anchor_id_fields`: Column in input .parquet file identifying anchors. This can be generated by specifying number of anchor used in `_MapBucket` whose default value is 2\n", + "- `ngram_width`: n-gram used" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "b1a532a2", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator.modules.fuzzy_dedup import JaccardSimilarity" + ] + }, + { + "cell_type": "markdown", + "id": "c9e65975", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "291d3aaa", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "shuffled_docs_path = output_shuffled_docs_path\n", + "\n", + "#Output\n", + "jaccard_compute_base_output_path = os.path.join(data_dir,\"fuzzy/jaccard_compute\")\n", + "jaccard_compute_output_results_path = os.path.join(jaccard_compute_base_output_path, \"jaccard_similarity_results.parquet\")\n", + "\n", + "#Relevant parameters\n", + "input_id_field = 'id'\n", + "input_text_field = 'text'\n", + "ngram_size = 5\n", + "num_anchors = 2\n", + "\n", + "!mkdir -p {jaccard_compute_base_output_path}" + ] + }, + { + "cell_type": "markdown", + "id": "9341b58c", + "metadata": {}, + "source": [ + "Run Jaccard Compute" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "9b1b9bdd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running jaccard compute script\n", + "Time taken for Jaccard Computing: 0.735356330871582\n" + ] + } + ], + "source": [ + "# enable_spilling()\n", + "# client.run(enable_spilling)\n", + "\n", + "print(\"Running jaccard compute script\", flush=True)\n", + "t0 = time.time()\n", + "\n", + "jaccard = JaccardSimilarity(\n", + " id_field=input_id_field,\n", + " text_field=input_text_field,\n", + " anchor_id_fields=[f\"anchor_{i}_{input_id_field}\" for i in range(num_anchors)],\n", + " ngram_width=ngram_size,\n", + ")\n", + "\n", + "#Load and run Jaccard compute\n", + "result_df = jaccard.jaccard_compute(shuffled_docs_path)\n", + "\n", + "result_df.to_parquet(jaccard_compute_output_results_path, write_index=False, write_metadata_file=False)\n", + "\n", + "print(f\"Time taken for Jaccard Computing: {time.time()-t0}\")" + ] + }, + { + "cell_type": "markdown", + "id": "bb740d30", + "metadata": {}, + "source": [ + "Verify output. You might see that there are repeated `id_x` and `id_y` pairs. This is expected as a pair of similar documents is likely to share numerous same buckets." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "a41d1f09", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
id_xid_yjaccard
01692361878-1365681692361878-1365660.754448
11692361878-1365681692361878-1365660.754448
21692361878-1365681692361878-1365660.754448
31692361878-1365681692361878-1365660.754448
41692361878-928751692361878-877430.828794
\n", + "
" + ], + "text/plain": [ + " id_x id_y jaccard\n", + "0 1692361878-136568 1692361878-136566 0.754448\n", + "1 1692361878-136568 1692361878-136566 0.754448\n", + "2 1692361878-136568 1692361878-136566 0.754448\n", + "3 1692361878-136568 1692361878-136566 0.754448\n", + "4 1692361878-92875 1692361878-87743 0.828794" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jaccard_compute_res = pd.read_parquet(jaccard_compute_output_results_path)\n", + "jaccard_compute_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a505402e", + "metadata": {}, + "source": [ + "### 5.5 Connected Components\n", + "This section uses `ConnectedComponents()`.This section takes a dataset consisting of document pairs and their corresponding jaccard similarity to construct a non-directed graph. A edge will be form between documents whose Jaccard similarity is higher than the threshold (0.8 in this example). It will then identify the connected components in this graph. Documents within the same connected components are deemed duplicated\n", + "\n", + "Arguments include:\n", + "- `cache_dir`:Output path for intermediate results\n", + "- `jaccard_pairs_path`:Input path for `jaccard_similarity_results.parquet`\n", + "- `id_column`:prefix of ID column in `jaccard_similarity_results.parquet`\n", + "- `jaccard_threshold`:Threshold to determine if an edge exists between two documents" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "3bff521b", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator.modules.fuzzy_dedup import ConnectedComponents" + ] + }, + { + "cell_type": "markdown", + "id": "d8afed6a", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "b40735dd", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "jaccard_pairs_path = jaccard_compute_output_results_path\n", + "\n", + "#Output\n", + "connected_component_base_output_path = os.path.join(data_dir,\"fuzzy/cc\")\n", + "connected_component_output_path = os.path.join(connected_component_base_output_path, \"connected_components.parquet\")\n", + "connected_component_cache_dir = os.path.join(connected_component_base_output_path, \"cache\")\n", + "\n", + "#Relevant parameters\n", + "input_id_field = 'id'\n", + "jaccard_threshold = 0.8\n", + "\n", + "!mkdir -p {connected_component_base_output_path}" + ] + }, + { + "cell_type": "markdown", + "id": "33d8957f", + "metadata": {}, + "source": [ + "Run Connected Component" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "fe62dd51", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "batch_id = 0/1, time = 0.29015278816223145\n", + "# of groups 5465\n", + "# of docs removed 3079\n", + "assert num_nodes:8544==labels_df:8544 passed\n", + "Time taken for Connected Component: 4.489336729049683 s\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + " \n", + "components_stage = ConnectedComponents(\n", + " cache_dir=connected_component_cache_dir,\n", + " jaccard_pairs_path=jaccard_pairs_path,\n", + " id_column=input_id_field,\n", + " convert_str_ids=True,\n", + " jaccard_threshold=jaccard_threshold,\n", + ")\n", + "\n", + "#Load and run connected component\n", + "components_stage.cc_workflow(output_path=connected_component_output_path)\n", + "print(f\"Time taken for Connected Component: {time.time()-t0} s\")" + ] + }, + { + "cell_type": "markdown", + "id": "669495ee", + "metadata": {}, + "source": [ + "Verify the result of `Connected Components`" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "efbd6973", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dataset_iddoc_idgroup
01692361878122282903
116923618781397721952
2169236187893927112
316923618781214502046
41692361878852883030
\n", + "
" + ], + "text/plain": [ + " dataset_id doc_id group\n", + "0 1692361878 122282 903\n", + "1 1692361878 139772 1952\n", + "2 1692361878 93927 112\n", + "3 1692361878 121450 2046\n", + "4 1692361878 85288 3030" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cc_compute_res = pd.read_parquet(connected_component_output_path)\n", + "cc_compute_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "0c3e2bdc", + "metadata": {}, + "source": [ + "Let's check if the output fuzzy duplicated documents within the same group are similar. Please note that the `group` id in your output might be different from the notebook output." + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "d8fa1e8e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
groupdoc_id
075160982, 161038, 161124, 161109, 161121, 160991...
1112122007, 122124, 122020, 122282, 122010, 122134...
2151134584, 135030, 134908, 134891, 135029, 135020...
332194082, 94114, 94126, 94057, 94121, 94132, 9411...
4339116230, 116237, 116223, 116236, 116176, 116204...
.........
54608539120646
54618540158174
54628541132405
5463854249199
54648543160924
\n", + "

5465 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " group doc_id\n", + "0 75 160982, 161038, 161124, 161109, 161121, 160991...\n", + "1 112 122007, 122124, 122020, 122282, 122010, 122134...\n", + "2 151 134584, 135030, 134908, 134891, 135029, 135020...\n", + "3 321 94082, 94114, 94126, 94057, 94121, 94132, 9411...\n", + "4 339 116230, 116237, 116223, 116236, 116176, 116204...\n", + "... ... ...\n", + "5460 8539 120646\n", + "5461 8540 158174\n", + "5462 8541 132405\n", + "5463 8542 49199\n", + "5464 8543 160924\n", + "\n", + "[5465 rows x 2 columns]" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cc_compute_res['doc_id'] = cc_compute_res['doc_id'].astype(str)\n", + "cc_compute_res.groupby('group')['doc_id'].agg(lambda x: ', '.join(x)).reset_index()" + ] + }, + { + "cell_type": "markdown", + "id": "f34b8140", + "metadata": {}, + "source": [ + "Change the `group` number if necessary. By running the code below, we can obtain a list of near duplicated documents." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "fd01f5fe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dataset_iddoc_idgroup
4201692361878122007112
4251692361878122124112
6891692361878122020112
7641692361878122282112
9521692361878122010112
\n", + "
" + ], + "text/plain": [ + " dataset_id doc_id group\n", + "420 1692361878 122007 112\n", + "425 1692361878 122124 112\n", + "689 1692361878 122020 112\n", + "764 1692361878 122282 112\n", + "952 1692361878 122010 112" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cc_compute_res[cc_compute_res['group']==112].head()" + ] + }, + { + "cell_type": "markdown", + "id": "99a8d732", + "metadata": {}, + "source": [ + "Print the text of near duplicated document. Please replace the `id` if necessary, `id` should be in the format of `_`" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "68883f58", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['ประเทศสวิตเซอร์แลนด์ ได้เข้าร่วมแข่งขันกีฬาโอลิมปิกเยาวชนฤดูหนาว ครั้งที่ 3 ค.ศ. 2020 (พ.ศ. 2563) ณ เมืองโลซาน ประเทศสวิตเซอร์แลนด์ ระหว่างวันที่ 9 - 22 มกราคม พ.ศ. 2563 คณะกรรมการโอลิมปิกแห่งชาติสวิตเซอร์แลนด์ได้ส่งทีมนักกีฬาเข้าแข่งขันทั้งหมด 56 คน แบ่งเป็นเป็นชาย 32 คนและหญิง 56 คน เข้าร่วมการแข่งขันใน 15 ชนิดกีฬา\\n\\nจำนวนผู้เข้าแข่งขัน\\n\\nผลการแข่งขัน\\n\\nสเกตลีลา\\n\\nสเกตความเร็ว\\n\\nสเกตความเร็วระยะสั้น\\n\\nฮอกกี้น้ำแข็ง\\n\\nเคอร์ลิง\\n\\nสกีลงเขา\\n\\nสกีข้ามทุ่ง\\n\\nสกีกระโดดไกล\\n\\nสกีนอร์ดิกผสม\\n\\nสกีลีลา\\n\\nสกีปีนเขา\\n\\nสโนว์บอร์ด\\n\\nทวิกีฬาฤดูหนาว\\n\\nบอบสเล\\n\\nสเกเลตัน\\n\\nอ้างอิง\\n\\nแหล่งข้อมูลอื่น \\n เว็บไซต์อย่างเป็นทางการ \\n\\nประเทศสวิตเซอร์แลนด์ในโอลิมปิกเยาวชน\\nประเทศที่เข้าร่วมแข่งขันโอลิมปิกเยาวชนฤดูหนาว 2020',\n", + " 'ประเทศบัลแกเรีย ได้เข้าร่วมแข่งขันกีฬาโอลิมปิกเยาวชนฤดูหนาว ครั้งที่ 3 ค.ศ. 2020 (พ.ศ. 2563) ณ เมืองโลซาน ประเทศสวิตเซอร์แลนด์ ระหว่างวันที่ 9 - 22 มกราคม พ.ศ. 2563 คณะกรรมการโอลิมปิกแห่งชาติบัลแกเรียได้ส่งทีมนักกีฬาเข้าแข่งขันทั้งหมด 18 คน แบ่งเป็นเป็นชาย 11 คนและหญิง 7 คน เข้าร่วมการแข่งขันใน 8 ชนิดกีฬา\\n\\nจำนวนผู้เข้าแข่งขัน\\n\\nผลการแข่งขัน\\n\\nสเกตลีลา\\n\\nสเกตความเร็ว\\n\\nสเกตความเร็วระยะสั้น\\n\\nฮอกกี้น้ำแข็ง\\n\\nเคอร์ลิง\\n\\nสกีลงเขา\\n\\nสกีข้ามทุ่ง\\n\\nสกีกระโดดไกล\\n\\nสกีนอร์ดิกผสม\\n\\nสกีลีลา\\n\\nสกีปีนเขา\\n\\nสโนว์บอร์ด\\n\\nทวิกีฬาฤดูหนาว\\n\\nลูช\\n\\nบอบสเล\\n\\nสเกเลตัน\\n\\nอ้างอิง\\n\\nแหล่งข้อมูลอื่น \\n เว็บไซต์อย่างเป็นทางการ \\n\\nประเทศบัลแกเรียในโอลิมปิกเยาวชน\\nประเทศที่เข้าร่วมแข่งขันโอลิมปิกเยาวชนฤดูหนาว 2020'],\n", + " dtype=object)" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "jaccard_shuffle_res[jaccard_shuffle_res['id'].isin(['1692361878-121545','1692361878-121487'])]['text'].unique()" + ] + }, + { + "cell_type": "markdown", + "id": "3b6578b4", + "metadata": {}, + "source": [ + "Below is the English translation of the output above. We can see that the two documents are indeed very similar to each other.\n", + "- `Text 1`:\n", + "```\n", + "Switzerland participated in the 3rd Youth Olympic Winter Games in 2020 (B.E. 2563) in Lausanne, Switzerland from January 9 - 22, 2563. The Swiss Olympic Committee sent a total of 56 athletes, consisting of 32 men and 56 women, to compete in 15 sports.\n", + "Number of Competitors:\n", + "Competition Results:\n", + "Figure Skating\n", + "Speed Skating\n", + "Short Track Speed Skating\n", + "Ice Hockey\n", + "Curling\n", + "Alpine Skiing\n", + "Cross-Country Skiing\n", + "Ski Jumping\n", + "Nordic Combined\n", + "Freestyle Skiing\n", + "Ski Mountaineering\n", + "Snowboard\n", + "Biathlon\n", + "Bobsleigh\n", + "Skeleton\n", + "References:\n", + "Other Resources:\n", + "Official Website\n", + "Switzerland at the Youth Olympics\n", + "Countries at the 2020 Youth Winter Olympics\n", + "```\n", + "- `Text 2`:\n", + "```\n", + "Bulgaria participated in the 3rd Youth Olympic Winter Games in 2020 (B.E. 2563) in Lausanne, Switzerland from January 9 - 22, 2563. The Bulgarian Olympic Committee sent a total of 18 athletes, consisting of 11 men and 7 women, to compete in 8 sports.\n", + "Number of Competitors:\n", + "Competition Results:\n", + "Figure Skating\n", + "Speed Skating\n", + "Short Track Speed Skating\n", + "Ice Hockey\n", + "Curling\n", + "Alpine Skiing\n", + "Cross-Country Skiing\n", + "Ski Jumping\n", + "Nordic Combined\n", + "Freestyle Skiing\n", + "Ski Mountaineering\n", + "Snowboard\n", + "Biathlon\n", + "Luge\n", + "Bobsleigh\n", + "Skeleton\n", + "References:\n", + "Other Resources:\n", + "Official Website\n", + "Bulgaria at the Youth Olympics\n", + "Countries at the 2020 Youth Winter Olympics\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "f36436f3", + "metadata": {}, + "source": [ + "### 5.6 Fuzzy deduplication wrapper" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "eb52ec06", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "625c1828", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "fuzzy_dedup_data_path = added_id_output_path\n", + "#Output\n", + "fuzzy_dedup_base_output_path = os.path.join(data_dir,\"fuzzy_wrapper\")\n", + "fuzzy_dedup_log_dir = os.path.join(fuzzy_dedup_base_output_path,'log')\n", + "fuzzy_dedup_cache_dir = os.path.join(fuzzy_dedup_base_output_path,'cache')\n", + "fuzzy_dedup_output_dir = os.path.join(fuzzy_dedup_base_output_path,'data')\n", + "#Specify dataset name\n", + "dataset_name = 'TH_wikipedia'\n", + "\n", + "#Relevant parameters\n", + "id_field = 'id'\n", + "text_field = 'text'\n", + "filetype = \"parquet\"\n", + "\n", + "!mkdir -p {fuzzy_dedup_base_output_path}\n", + "!mkdir -p {fuzzy_dedup_log_dir}\n", + "!mkdir -p {fuzzy_dedup_cache_dir}\n", + "!mkdir -p {fuzzy_dedup_output_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "cb76d8e5", + "metadata": {}, + "source": [ + "**[Optional]** If the cache folder is not empty, please CLEAR the folder before proceeding" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "e7fb4c4c", + "metadata": {}, + "outputs": [], + "source": [ + "#!rm -r {fuzzy_dedup_cache_dir}" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "2368443f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n", + "Stage1: Starting Minhash + LSH computation\n", + "Stage1: Minhash + LSH complete!\n", + "Stage2 (False Postive Check): Starting Map_Buckets\n", + "Stage2 (False Postive Check): Map_Buckets Complete!\n", + "Stage3 (False Postive Check): Shuffle docs\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/1 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idgroup
0TH_wiki-0000134798736
1TH_wiki-00001162261526
2TH_wiki-00001267962934
3TH_wiki-0000138218156
4TH_wiki-00000854372722
\n", + "" + ], + "text/plain": [ + " id group\n", + "0 TH_wiki-0000134798 736\n", + "1 TH_wiki-0000116226 1526\n", + "2 TH_wiki-0000126796 2934\n", + "3 TH_wiki-0000138218 156\n", + "4 TH_wiki-0000085437 2722" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fuzzy_dedup_res = pd.read_parquet(fuzzy_dedup_output_dir)\n", + "fuzzy_dedup_res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "d2726cf9", + "metadata": {}, + "source": [ + "## 6. Remove duplicates\n", + "\n", + "Now we have duplicated document IDs output by both exact deduplication and fuzzy deduplication. We will run this section to remove those documents. This is done be loading the output .parquet files and the unicode fixed input dataset in .jsonl as DataFrame. Then use DataFrame operation to remove the duplicated documents." + ] + }, + { + "cell_type": "markdown", + "id": "e4dd78db", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "0027c8d2", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "dataset_dir = added_id_output_path\n", + "\n", + "#Output\n", + "dudped_output_dir = os.path.join(data_dir,\"remove_duplicate/result.parquet\")\n", + "\n", + "#Relevant parameters\n", + "input_id_field = 'id'\n", + "id_prefix = add_ID_id_prefix\n", + "\n", + "!mkdir -p {dudped_output_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "a373860d", + "metadata": {}, + "source": [ + "We will first process the result of exact deduplication. Since result of exact deduplication contains original ID used in input dataset, it is more straightforward to deal with." + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "f59e92c3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n", + "Reading 1 files\n" + ] + } + ], + "source": [ + "#Load .jsonl dataset\n", + "input_dataset = DocumentDataset.read_json(dataset_dir, backend='cudf')\n", + "\n", + "#Load exact deduplicate result and extract list of duplicated document ID\n", + "exact_duplicates = DocumentDataset.read_parquet(os.path.join(exact_dedup_output_dir,\"_exact_duplicates.parquet\"), backend='cudf')\n", + "exact_docs_to_remove = exact_duplicates.df.map_partitions(\n", + " lambda x: x[x._hashes.duplicated(keep=\"first\")]\n", + ")\n", + "\n", + "#Remove the duplicated document from input dataset\n", + "result = input_dataset.df[\n", + " ~input_dataset.df[input_id_field].isin(exact_docs_to_remove[input_id_field].compute())\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "f55d6737", + "metadata": {}, + "source": [ + "For result of fuzzy deduplication, we need to first reconstructed document ID by combining `dataset_id` and `doc_id`, then use the reconstructed `ID` for removal" + ] + }, + { + "cell_type": "markdown", + "id": "3b9c122d", + "metadata": {}, + "source": [ + "**[Optional]** Uncomment the cell to use result from step by step fuzzy deduplication" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "c6a1bb0a", + "metadata": {}, + "outputs": [], + "source": [ + "# #List of id_prefix used in Add ID\n", + "# base_ids = [id_prefix]\n", + "\n", + "# #Obtain a mapping between `dataset_id` and `id_prefix`\n", + "# df = cudf.DataFrame()\n", + "# df['base_id'] = [base_id for base_id in base_ids]\n", + "# df['dataset_id'] = df['base_id'].hash_values()\n", + "# df_pd = df.to_pandas()\n", + "# mapping = {\n", + "# hashed_id: base_id\n", + "# for base_id, hashed_id in zip(df_pd['base_id'], df_pd['dataset_id'])\n", + "# }\n", + "\n", + "# #Load result of fuzzy deduplication \n", + "# fuzzy_duplicates = pd.read_parquet(connected_component_output_path)\n", + "# #Reconstruct the original document ID\n", + "# fuzzy_duplicates['id']=fuzzy_duplicates.apply(lambda x: f\"{mapping[x['dataset_id']]}-{x['doc_id']:010d}\", axis=1)\n", + "\n", + "# #Generate list of near duplicate document ID\n", + "# fuzzy_docs_to_remove = fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first')" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "746d3673", + "metadata": {}, + "outputs": [], + "source": [ + "#Loads result from fuzzy dedup wrapper\n", + "fuzzy_duplicates = pd.read_parquet(fuzzy_dedup_output_dir)\n", + "\n", + "#Generate list of near duplicate document ID\n", + "fuzzy_docs_to_remove = fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first')" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "62b34838", + "metadata": {}, + "outputs": [], + "source": [ + "#Remove near duplicates\n", + "result = result[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])]\n", + "\n", + "#Save final result to local\n", + "result.to_parquet(dudped_output_dir, write_to_filename=True)" + ] + }, + { + "cell_type": "markdown", + "id": "edfa52ce", + "metadata": {}, + "source": [ + "Verify the result of duplicate removal. We can see that the number of document in resultant document is less than the original dataset (length = 161748)" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "78eee9b3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of duplicate removed dataset:156265\n" + ] + } + ], + "source": [ + "res = pd.read_parquet(dudped_output_dir)\n", + "print(f\"Length of duplicate removed dataset:{len(res)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "15e07a32", + "metadata": {}, + "source": [ + "Close the GPU Dask Cluster.You might encounter error such as `Caught signal 11`.It's OK, just rerun the cell again." + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "8e807bd7", + "metadata": {}, + "outputs": [], + "source": [ + "client.cluster.close()\n", + "client.shutdown()" + ] + }, + { + "cell_type": "markdown", + "id": "a416a293", + "metadata": {}, + "source": [ + "## 7. Heuristic Fitlering\n", + "\n", + "In this section, we will apply multiple heuristic filters to the dataset, record the heuristic score for documents and documents removed for each filter. For each heuristic filter, the filter calculates a quality scores based on user defined heuristics/algorithms and classifies documents into high quality documents or low quality documents if the quality score is above the user defined threshold.\n", + "\n", + "Sample lists of heuristic filters can be found in `./config/`\n", + "- `heuristic_filter_en.yaml`: Sample heuristic filter list for English dataset\n", + "- `heuristic_filter_non-en.yaml`:Sample heuristic filter list for Non-English dataset\n", + "- `heuristic_filter_code.yaml`:Sample heuristic filter list for Code language dataset\n", + "Please adjust the sample list e.g. remove/add filters or change filter threshold based on your own use case. In this example, `heuristic_filter_non-en.yaml` will be used.\n", + "\n", + "For detailed implementation and description of each heuristic filter, please refer to `./NeMo-Curator/nemo-curator/filters/heuristics_filter.py`. For customized heuristic filter implementation, user shall follow the sample implementations, write customized filters and update the .yaml files accordingly.\n", + "\n", + "For analysis of impact of each filters on the dataset, user should set `log-score` to true for the filters in the corresponding config .yaml file. This will output quality score for all filters in separate .txt files for each individual filter. With the quality score and filter threshold, use can calculate quality score distribution and other analysis to assess the effectiveness of each filter.\n", + "\n", + "In this example, in order to get a comprehensive output of each filter, we are iterating through ever filter using a for loop and saving the intermediate result. This process will involve extensive I/O operations and is less effective. Alternatively, after loading input dataset and filter pipeline, user can simply call `filter_pipeline(dataset)` to obtain the final filtered result." + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "b988ad1e", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_curator.utils.config_utils import build_filter_pipeline\n", + "from nemo_curator import Score, Filter, ScoreFilter\n", + "from nemo_curator.utils.file_utils import get_batched_files,expand_outdir_and_mkdir" + ] + }, + { + "cell_type": "markdown", + "id": "097a1b48", + "metadata": {}, + "source": [ + "**[Optional]** The following cell is to remove warning from dask." + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "44552288", + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "# Disable the metadata warning\n", + "warnings.filterwarnings(\"ignore\",module=\"dask.dataframe.core\")" + ] + }, + { + "cell_type": "markdown", + "id": "9a59699d", + "metadata": {}, + "source": [ + "Create a CPU Dask Cluster." + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "b8f80ab3", + "metadata": {}, + "outputs": [], + "source": [ + "cluster = LocalCluster(n_workers=10, processes=True, memory_limit='16GB')\n", + "client = Client(cluster)" + ] + }, + { + "cell_type": "markdown", + "id": "a7702918", + "metadata": {}, + "source": [ + "Define some helper functions" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "6f2e7523", + "metadata": {}, + "outputs": [], + "source": [ + "def get_dataframe_complement(original_df, filtered_df):\n", + " def partition_complement(part_original_df, partition_info=None):\n", + " if not partition_info:\n", + " return part_original_df\n", + " part_filtered_df = filtered_df.get_partition(partition_info[\"number\"])\n", + " complement_mask = ~part_original_df.index.isin(part_filtered_df.index.persist())\n", + " complement_df = part_original_df[complement_mask]\n", + " return complement_df\n", + "\n", + " return original_df.map_partitions(partition_complement)\n", + "\n", + "def write_scores(df, output_dir):\n", + " for column in df.columns:\n", + " output_path = os.path.join(output_dir, f\"{column}.txt\")\n", + " df[column].to_csv(output_path, single_file=True, encoding=\"utf-8\", header=False, index=False, mode=\"a\")\n", + "\n", + "def get_score_fields(pipeline):\n", + " score_fields = []\n", + " for nc_module in pipeline.modules:\n", + " if isinstance(nc_module, Score) or isinstance(nc_module, ScoreFilter):\n", + " if nc_module.score_field:\n", + " score_fields.append(nc_module.score_field)\n", + " return score_fields" + ] + }, + { + "cell_type": "markdown", + "id": "227fa8b0", + "metadata": {}, + "source": [ + "Define parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "a894f90f", + "metadata": {}, + "outputs": [], + "source": [ + "#Input\n", + "HF_input_data_dir = dudped_output_dir\n", + "input_file_type = 'parquet'\n", + "batch_size = 1\n", + "\n", + "#Output\n", + "HF_base_output_path = os.path.join(data_dir,'heuristic_filtering')\n", + "kept_document_dir = os.path.join(HF_base_output_path,'data','hq.parquet')\n", + "removed_document_dir = os.path.join(HF_base_output_path,'data','lq.parquet')\n", + "output_document_score_dir = os.path.join(HF_base_output_path,'data','score')\n", + "output_file_type = 'parquet'\n", + "\n", + "#Relevant parameters\n", + "filter_config_file = './config/heuristic_filter_non-en.yaml'\n", + "input_id_field = 'id'\n", + "\n", + "#Set to False if do not want to save intermediate results\n", + "is_cache = True\n", + "\n", + "!mkdir -p {kept_document_dir}\n", + "!mkdir -p {removed_document_dir}\n", + "!mkdir -p {output_document_score_dir}" + ] + }, + { + "cell_type": "markdown", + "id": "ccea406e", + "metadata": {}, + "source": [ + "Run heuristic filtering" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "03b3da27", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reading 1 files\n", + "Saving data for symbol_to_word\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for numbers_ratio\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for urls_ratio\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for white_space\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for parentheses_ratio\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for boilerplate_string_ratio\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeated_lines\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeated_paragraphs\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeated_lines_char\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeated_paragraphs_char\n", + "Writing to disk complete for 1 partitions\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/distributed_utils.py:379: UserWarning: Empty partition found\n", + " warnings.warn(f\"Empty partition found\")\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saving data for word_count\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeating_top_2grams\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeating_top_3grams\n", + "Writing to disk complete for 1 partitions\n", + "Saving data for repeating_top_4grams\n", + "Writing to disk complete for 1 partitions\n", + "Writing to disk complete for 1 partitions\n", + "Time taken for Heuristic filtering: 1120.5212895870209 s\n" + ] + } + ], + "source": [ + "t0 = time.time()\n", + "\n", + "#Load filters from config\n", + "filter_pipeline = build_filter_pipeline(filter_config_file)\n", + "score_fields = get_score_fields(filter_pipeline)\n", + "\n", + "# Load dataset\n", + "dataset = DocumentDataset.read_parquet(HF_input_data_dir, backend='pandas', add_filename=True)\n", + "\n", + "\n", + "# Iterate through filters. For each filter, the low quality document will be removed from the dataset and output to corresponding folder for analysis\n", + "# Output of previous filter will be input of the next filter\n", + "if is_cache:\n", + " curr_dataset = prev_dataset = dataset\n", + " for filter_module in filter_pipeline.modules:\n", + " #Apply filter\n", + " curr_dataset = filter_module(curr_dataset).persist()\n", + "\n", + " #Output filtered document\n", + " print(f\"Saving data for {filter_module.filter_obj._name}\")\n", + " removed_df = get_dataframe_complement(prev_dataset.df, curr_dataset.df)\n", + " removed_filter_dir = os.path.join(removed_document_dir, filter_module.filter_obj._name)\n", + " expand_outdir_and_mkdir(removed_filter_dir)\n", + " write_to_disk(removed_df, removed_filter_dir, write_to_filename=True, output_type=output_file_type)\n", + " prev_dataset = curr_dataset\n", + " filtered_dataset = curr_dataset\n", + "else:\n", + " filtered_dataset = filter_pipeline(dataset)\n", + "\n", + "# Write scores of retained doucment to separate directory\n", + "output_df = filtered_dataset.df[[input_id_field, *score_fields]]\n", + "write_scores(output_df, output_document_score_dir)\n", + "\n", + "# Remove scores from dataset df\n", + "filtered_dataset = DocumentDataset(filtered_dataset.df.drop(columns=score_fields))\n", + "\n", + "# Output filtered dataset\n", + "filtered_dataset.to_parquet(kept_document_dir, write_to_filename=True)\n", + "\n", + "print(f\"Time taken for Heuristic filtering: {time.time()-t0} s\")" + ] + }, + { + "cell_type": "markdown", + "id": "a53b04e9", + "metadata": {}, + "source": [ + "Verify the result." + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "07475373", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataset size after heuristic filtering:192786\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenameidlanguagesource_idtexttitleurl
1part.0.parquetTH_wiki-0000000001THthwiki-20240201-thwiki-20240201-pages-articles...ดาราศาสตร์ คือวิชาวิทยาศาสตร์ที่ศึกษาวัตถุในท้...ดาราศาสตร์https://th.wikipedia.org/wiki/%E0%B8%94%E0%B8%...
2part.0.parquetTH_wiki-0000000002THthwiki-20240201-thwiki-20240201-pages-articles...ภูมิศาสตร์ (, แปลว่า \"การพรรณนาเกี่ยวกับโลก\")...ภูมิศาสตร์https://th.wikipedia.org/wiki/%E0%B8%A0%E0%B8%...
3part.0.parquetTH_wiki-0000000003THthwiki-20240201-thwiki-20240201-pages-articles...พันทิป.คอม หรือพันทิป ก่อตั้งขึ้นเมื่อวันที่ 7...พันทิป.คอมhttps://th.wikipedia.org/wiki/%E0%B8%9E%E0%B8%...
4part.0.parquetTH_wiki-0000000004THthwiki-20240201-thwiki-20240201-pages-articles...พันธุ์ทิพย์พลาซ่า () เป็นศูนย์การค้าเกี่ยวกับเ...พันธุ์ทิพย์พลาซ่าhttps://th.wikipedia.org/wiki/%E0%B8%9E%E0%B8%...
5part.0.parquetTH_wiki-0000000005THthwiki-20240201-thwiki-20240201-pages-articles...วิทยาการคอมพิวเตอร์ศึกษาเกี่ยวกับโครงสร้างพื้น...วิทยาการคอมพิวเตอร์https://th.wikipedia.org/wiki/%E0%B8%A7%E0%B8%...
\n", + "
" + ], + "text/plain": [ + " filename id language \\\n", + "1 part.0.parquet TH_wiki-0000000001 TH \n", + "2 part.0.parquet TH_wiki-0000000002 TH \n", + "3 part.0.parquet TH_wiki-0000000003 TH \n", + "4 part.0.parquet TH_wiki-0000000004 TH \n", + "5 part.0.parquet TH_wiki-0000000005 TH \n", + "\n", + " source_id \\\n", + "1 thwiki-20240201-thwiki-20240201-pages-articles... \n", + "2 thwiki-20240201-thwiki-20240201-pages-articles... \n", + "3 thwiki-20240201-thwiki-20240201-pages-articles... \n", + "4 thwiki-20240201-thwiki-20240201-pages-articles... \n", + "5 thwiki-20240201-thwiki-20240201-pages-articles... \n", + "\n", + " text title \\\n", + "1 ดาราศาสตร์ คือวิชาวิทยาศาสตร์ที่ศึกษาวัตถุในท้... ดาราศาสตร์ \n", + "2 ภูมิศาสตร์ (, แปลว่า \"การพรรณนาเกี่ยวกับโลก\")... ภูมิศาสตร์ \n", + "3 พันทิป.คอม หรือพันทิป ก่อตั้งขึ้นเมื่อวันที่ 7... พันทิป.คอม \n", + "4 พันธุ์ทิพย์พลาซ่า () เป็นศูนย์การค้าเกี่ยวกับเ... พันธุ์ทิพย์พลาซ่า \n", + "5 วิทยาการคอมพิวเตอร์ศึกษาเกี่ยวกับโครงสร้างพื้น... วิทยาการคอมพิวเตอร์ \n", + "\n", + " url \n", + "1 https://th.wikipedia.org/wiki/%E0%B8%94%E0%B8%... \n", + "2 https://th.wikipedia.org/wiki/%E0%B8%A0%E0%B8%... \n", + "3 https://th.wikipedia.org/wiki/%E0%B8%9E%E0%B8%... \n", + "4 https://th.wikipedia.org/wiki/%E0%B8%9E%E0%B8%... \n", + "5 https://th.wikipedia.org/wiki/%E0%B8%A7%E0%B8%... " + ] + }, + "execution_count": 95, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "res = pd.read_parquet(kept_document_dir)\n", + "print(f\"Dataset size after heuristic filtering:{len(res)}\")\n", + "res.head()" + ] + }, + { + "cell_type": "markdown", + "id": "24e8b173", + "metadata": {}, + "source": [ + "Close the CPU Dask Cluster" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "12508f5e", + "metadata": {}, + "outputs": [], + "source": [ + "client.cluster.close()\n", + "client.shutdown()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83e4aed1", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}