From 2a4e92f6b28b394da91175d10005499703bb211e Mon Sep 17 00:00:00 2001 From: Noam Gat Date: Fri, 10 Nov 2023 15:30:21 +0200 Subject: [PATCH] Added LMFormatEnforcer integration (#65) * Added LMFormatEnforcer integration * Editing based on PR review * Adding frontmatter tag * Added logo * More README work * Adding introduction paragraph --- integrations/lmformatenforcer.md | 137 +++++++++++++++++++++++++++++++ logos/lmformatenforcer.png | Bin 0 -> 9674 bytes 2 files changed, 137 insertions(+) create mode 100644 integrations/lmformatenforcer.md create mode 100644 logos/lmformatenforcer.png diff --git a/integrations/lmformatenforcer.md b/integrations/lmformatenforcer.md new file mode 100644 index 00000000..0db1756f --- /dev/null +++ b/integrations/lmformatenforcer.md @@ -0,0 +1,137 @@ +--- +layout: integration +name: LM Format Enforcer +description: Use the LM Format Enforcer to enforce JSON Schema / Regex output of your Local Models. +authors: + - name: noamgat + socials: + github: noamgat + twitter: noamgat +pypi: https://pypi.org/project/lm-format-enforcer/ +repo: https://github.com/noamgat/lm-format-enforcer +type: Model Provider +report_issue: https://github.com/noamgat/lm-format-enforcer/issues +logo: /logos/lmformatenforcer.png +version: Haystack 2.0 +--- + +# LM Format Enforcer Haystack Integration Layer + +![LM Format enforcer](../logos/lmformatenforcer.png) + +Use the [LM Format Enforcer](https://github.com/noamgat/lm-format-enforcer) to enforce JSON Schema / Regex output of your local models in your haystack pipelines. + +Language models are able to generate text, but when requiring a precise output format, they do not always perform as instructed. Various prompt engineering techniques have been introduced to improve the robustness of the generated text, but they are not always sufficient. [LM Format Enforcer](https://github.com/noamgat/lm-format-enforcer) solves the issues by filtering the tokens that the language model is allowed to generate at every timestep, thus ensuring that the output format is respected, while minimizing the limitations on the language model. + +### What is the LM Format enforcer? +![Solution at a glance](https://raw.githubusercontent.com/noamgat/lm-format-enforcer/main/docs/Intro.webp) + + + +## Installation +Install the format enforcer via pip: `pip install lm-format-enforcer` + +## Usage +This integration supports both Haystack 1.x and Haystack 2.0: +- `LMFormatEnforcerPromptNode`: A Haystack 1.x `PromptNode` that activates the format enforcer. +- `LMFormatEnforcerLocalGenerator`: A Haystack 2.0 Generator component that activates the format enforcer. + +Important note: LM Format Enforcer requires a LOCAL generator - currently only Local HuggingFace transformers are supported, vLLM suport is coming soon. + +### Creating a CharacterLevelParser +The `CharacterLevelParser` is the class that connects the output parsing to the format enforcing. Two main parsers are available : `JsonSchemaParser` for JSON Schemas, and `RegexParser` for regular expressions. + +We will start off by defining the format we want to decode, regardless of Haystack. + +```python + +from pydantic import BaseModel +from lmformatenforcer import JsonSchemaParser + +class AnswerFormat(BaseModel): + first_name: str + last_name: str + year_of_birth: int + num_seasons_in_nba: int + +parser = JsonSchemaParser(AnswerFormat.schema()) +``` +### Haystack 1.x Integration + + Open In Colab + + +To activate the the enforcer with Haystack V1, a `LMFormatEnforcerPromptNode` has to be used. + +Here is a simple example: +```python +from haystack.nodes import PromptModel +from lmformatenforcer.integrations.haystackv1 import LMFormatEnforcerPromptNode + +question = 'Please give me information about {query}. You MUST answer using the following json schema: ' +schema_json_str = AnswerFormat.schema_json().replace("{", "{{").replace("}", "}}") +question_with_schema = f'{question}{schema_json_str}' +prompt = get_prompt(question_with_schema) + + +model = PromptModel(model_name_or_path="meta-llama/Llama-2-7b-chat-hf") +prompt_node = LMFormatEnforcerPromptNode(model, prompt, character_level_parser=parser) + +result = prompt_node(query='Michael Jordan') +print(result[0]) + +``` +The model will be inferred with the format enforcer, and the output will look like this: + +``` +{ +"first_name": "Michael", +"last_name": "Jordan", +"year_of_birth": 1963, +"num_seasons_in_nba": 15 +} +``` +For a full example, see the [example notebook](https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_haystackv1_integration.ipynb) + +### Haystack 2.0 Integration + + Open In Colab + + +To activate the the enforcer with Haystack V2, a `LMFormatEnforcerLocalGenerator` has to be used. + +Here is a simple example: +```python +from haystack.preview.components.generators.hugging_face.hugging_face_local import HuggingFaceLocalGenerator +from lmformatenforcer.integrations.haystackv2 import LMFormatEnforcerLocalGenerator + + +question = 'Please give me information about Michael Jordan. You MUST answer using the following json schema: ' +schema_json_str = AnswerFormat.schema_json() +prompt = f'{question}{schema_json_str}' + + +model = HuggingFaceLocalGenerator(model_name_or_path="meta-llama/Llama-2-7b-chat-hf") +format_enforcer = LMFormatEnforcerLocalGenerator(model, character_level_parser) +pipeline = Pipeline() +pipeline.add_component(instance=format_enforcer, name='model') + + +result = pipeline.run({ + "model": {"prompt": prompt} +}) +print(result['model']['replies'][0]) + +``` +The model will be inferred with the format enforcer, and the output will look like this: + +``` +{ +"first_name": "Michael", +"last_name": "Jordan", +"year_of_birth": 1963, +"num_seasons_in_nba": 15 +} +``` +For a full example, see the [example notebook](https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_haystackv2_integration.ipynb) + diff --git a/logos/lmformatenforcer.png b/logos/lmformatenforcer.png new file mode 100644 index 0000000000000000000000000000000000000000..38558e4a560c3d194eea31b5d49bff2bdfb02922 GIT binary patch literal 9674 zcmc(FRZtvH947A0LU2ujTX0)kg1ZEVUDYZgAFDU<9eu?dZVBa_5HV@wx)hOLU!6X8<_a$yE%Jk z+WXkNxACz-_MxBzv)H*fd+4}3yZLytNSFn)1h7c(bF%~@_aN^7i3Ej&{*PEdP~?A) z_+R7y!xkVH4^M6sl$Ymk73B;9ERS+=LjZ>7LtOsy#}On<7$g`o6vh?+-5CpQ4+>pH z14R#0ViHn=)0=o~2n3G-e{~mmf6jfD4D1+|A_yK1Cy5)`=al&6jptCrfmr6HhJ%E( zNXU|V)*PndVFN0j;hn?rFZm)50`oVr-BuZxDF5hT-07?eQf-AXj+|a!*8vmtOUMG zIbRDa=dl~ogLi}O60UzfUe(U*we4K$TEj+0iwR#?lq5^B2}xuvYUilpb_E{@_Nn|L zmYMo@e^OqVLkAiZl<)rm`?38C>hU-8E+Lr6(0KZ|JiqMGy{CFoh_Atb`qlMbaWc*M zXfN2H9@GK+0915%uTxGf$ll;;hMW!=Y9b{v`!G7{yf(_Guc?~~BoEo1$UXRe4#3bY zfLMv$jPf-d{9+TYB6WdQnJOq*g~F0C`n(^CWJRAoYsROCsJnN5PjHwtRji$`G-;`} zxXylpUWau2%?O$cay_=qx*sIFdw=%fX?$da^YJY+75mfOM^1|tEk;ai`0Z`e$3-E~ zOPtjWea!X`S6DOwSLi-Rb#E_gY0IYwSsBVxi}Oo#?ec1t(zz{)WsamT{)XwEykm9C z)e>EKqxm>%@HC#SqFJ6Fx>_W|gORn5j?L~kkyGG>@LmgwA88SiQ-&&8=)bEZj$|dw zXP>I9{16my?jlh;dDP_bTeQjff{2s*GCzfaBJJG0Tlz$B-*;BE9rU2F*wdr)Y>Y#t zwg9cMP*>tI-X{lRUVrYf1_K)=3OUbwk_=po3woZKzkDJJgO*91eFBmPICR~~(1e{_ zG=ZWC$S|_9K9eopZ^j!eM2=x}3_su4Rr0b=sX*4Hn>nW*1EVgdN}m5fk|96(`73H_ z3U5#N%FTv=nt_3Rl4KzzI&HzCh|cpkARYmgm_(gRCXfu{<9wy;PEQYH)rusiI8&!+ z6!@5go%!-g_*6EKBJ?Z@c+NLGpcv|Q)E9A=z-0161hd`U3RPO*Q4<8{)Cl`a|L%J0 zeA$HAiuEEH(iWe?2NFpmK|(4Y*|eN8LMV`TuJ)TNga zGwX}KY;9M!BBSt2At9(1upZylL^RvaEmOscFWjN8VB*_>#m9p(QRmx+ z^1X!Vj&Ey1`P1*r%+#&I8uPZk%Ui>$D+;;<9nY=8Qm^piZ_L}x@p!@E7fZu^}H^SkSqq219mnLWWPFrxitYY1jG#|;`39X1b` zR6UwYUQA%nEaZ8&E7*2@C(_ zB9@juGNSRzJ;1~X=@b1X`lk&vxL^E<`>RamQ|N5$-LPm3+Ev17$;%JHJGoGw3Gusx zM<&<(>umSG1SpcYP_*%&Q#rL=xH{u4F*gA@8EIiM%^w3y8qkdH%yIWUJEx$2{uyzF ziqWEF&EWfZ{2qVzyrtBmymE$b*AFVuMAXpFp9_*{^el~F*O9Uj)M}XVl(%lOu;z3; z;L}&m>ti@umW+6<_s+~3Z6irOnU%mQ?eCor| zRE&M=qW4!HB~{!gaZqc<6M_az#=;tIkHh@A92hJy+$sc4Y8~0Mn)DPs!cS zHjO}i#<9MzW0@B3?uy!!OjS+vU*nnnt8ah3u-x_hEJ(91a+SLYuR;9QLK7y3{eW^n z17=O{*aMujYK65J-$cGkrYcOX`u+asY@&1031R%#dqQqW#5riPM=>&5KdAknt>3-- zk>k;Sg3Umec{B@?>uY0@;g&SLj(n$}`>GV@?Xe_CnIt5B2(PKAA@1Mt_zsU`Fe9iw zgl7fl_F6M!TbPr>^T^@q#J_Cyg0H^*x73_83yXQa^KcirWl{Z>!_?fj**R(2=H+Gd z&JSNm0@T88`nKQ)Tz-cIotXGZG2r6;$9vcaarW4%SMm_=gYcYY335qGlckE4;++=1 zNUtIOcy&A6X2|^MaSuxHJJE%#W15sY!{Z#Y4_D*y-HAzIFsQIISw@uPzpjTjeCdbU zXo&v%`0a~{;;LBkE8VRae0Nq>w{{`x5%Qcr#k9G>Ne87Zk$+JE{>O(41{S+P>$M?E zB9;f!cqX0lBvtEzn2X=lCEGU}da>Gyi&^%%gFoqvroX*Qm@?-u?MPSael%hfet#c{ z8`hdb=_DRdw@Qw2a+--o|KDtL6nf-4ft2Fru1^?AW0xLdWY`n`#mL4{NPIYg2D)$EepyTHY_?*mi z!({iI3U~_~Tl=TdU1XP)IVF-^Da~V$+RX}ggJ?V8Z`dhn+S5%jP{GG~PmgoSCGTK~ zDz+sU29?3#K--0&BLS@Ut(fco4n>oWvaio<%ioLM|6gb_#9R^*@1*09>R@@Df*{)CRi zpV+Kw0MtA$lE*G@bsp4?_c*ds1)&%aBd_EH!VOJrCEp7CaRocLbl9ntt& zp7L_ivOj(x6sl5myduXAyT;+QM9e@8}0>&5IV z^TZBnW{yv<^1E%;vO>deDrquj6fW`JX%^aoEwKv=gMW}i!oXbl%jL-kW9o}j+ z`wl*yb;JzOAT`piH1kyCtCy#A&(LlVsC&TYs4@&L(cC$Eb9`@ELOt2Nb3UG9Cej6i zx)uS?iEnR}W_CWGHQ8*5fXaZ^tG+xdzB_Pafgjc=xnr`&sjU*OsIDq;L8!f`Hiu2# zFqq`K4kf@|L{u*~MvMGAsYSk6Tx(*xjN3waYuvnJdiT-AJPWk&=g1sj8AL5aqCQFS!(*N0 zb32zCG#O?M2%U4A!$0&poC<+#2@ec}uAtp&73v^H*sOd=lb=K|8N~W+nMp~JifcZZ z$613hr1djW6aFkMT{zhk+#PT{@sM;}b;BZP&o58SB7N~fVU$v{+zyoU&R&H^U-kXV zWW9U*mzgbLyUf$hZry5jURZb`b6d$~p-R}QwEW?V%YNA^tS93FiF+QPYJq++-12;e zy1A+*24S_6z)w7QLc#aI{e^pn4ELRcq@Zuu7PJ>@w6k7$w7A=0Cc3ft~n*nA`t0Tt5Y~Y+-JJijsr$>$NemhGiPpc4e@YR;6 zFb3y-GMC9zP6mBCrjluNGi#a;I=>vkH->l2@*w4Wp8{hUTb>VJj&u;6!rtC_UWxUm zIYPs@UvXS(4nJ{{)5@8IFqg1WU#~v%slD}-IXcLO)9zOL0}-!|6I`du=a9sMqU&yQZ8Sxi;x-o>(SmG*SCwB%x98Y#KWT=WZu=$F7x zq5on>iqA%SyY01hS7Uj_o&lhj7=oQ29_%yD!c*9+SN#?XJOOWZ03qT6}D6yl6qT=m2r$n$`V@^~+c#f^;3tep}K!S^uxXhT@U zSCUT@`DA30j_GB-N-)CHZS9WCdS_|d{hU?4FwkguTRzH2yoEY8zCQj*4!gLR>{uq@ zqQD5-YOE3cW+9PC6G#ne-hO*`-stGGXvYd$8z`T~+_@^y*A_2xJ6MDVwZBzU3+ejE zoVH)dHCZ^s(3NRa9gv>Ro1-jB)D!vc`}tbj4upwLeEzd4-&fYy%d4~3sjF*MjqS_1 zFaOLbknJJeIYu&4(&0d?Nj*0}IOeMl#K3imq%3@9U0rCZ?vI+>^jJ=BtuG3NbeU31=+e*Gm5dO^98gAR}JFU{EFp(o} zP`%yV_Q#>gKXCJZCjh*i$Z$YO5sU|6GEoFn{NDoZab(RfQ~zFC+SzQQsu)@7?=On| zfr~Q$HF2@wO{hXixqCDKLsCxi+r(t~Vgxpa!KjU@$=9b^5>;-FsdLvQ_WRXfnEc~_ zd;_)UO3AZ#019|K_}XIeq95OQ-_@{#pld zyyELcg|rv0ZaukVlT!L&YB+#divas5TW4OA;%M6Yo$+Fe>GH-j`qVrvT~kv)`d3vX z&%2=FKxEiW8$X58izIOR9j=Q;%M-D_Yc8pLyys}QQ+ecIPBCebJ?YLKX>lvPWLBuS ziXpv|iN==64+Fvh6uJ+v@&@0;7Wr8KrJk$(t9h)C zSGD#QzXPVfbJglo=RtpNyzFFoZ8B#l{U=54B^zD_pCbbI9T9^DPm=hjqGXA<7h@EN zz2JX67ewy%Jr6~bbeRTRT+;y=l-s<=C0|U4e%HsG6EG-(B|IyNAkUwWI=I9u8-tl- z#yX|BG7dYwe{<`~lzINN4L@Jho#8oAuZu0{x55Y^?@~h)r~mQxi5&)G?d#s)p4>hz zfjFb8g34ucJY3wuVZWoB?sD+i4 zM80`?bO-LJwsB7WIqBJXoIypu;RD8hgj2rIFXilsfdb+lk>u453V!NAwVlTfyNhjE zJooY=rw?1wpPE+>d&*Ra9}TS{RCZbmx1qaxa+`K#UHb!{#GW;G;qjvA zP`k~w=bGe0%zD=unSBBq8lRj1E)MQ)W!-UXbZL89k&0}B^6iVs9noxgzfn>5zoLyr z=A=C?s0)Vn2T2hl=e+s|le}EbTgI?4`k>7P$GMM;iD?|m51&uEi$^RvOejf33AHs! z0u0M<4sZhyIL9&?zZCwo?VO7bZ^-$&<2aUD1lkwBvAEeQ=Zy}V3MROx=$lq z36d>CYT6hB`Cf!0MY3tEF_H*C4fgw4s<}wvH2`Q+AM+h(9i{qYIy$Nfe!2nm ze7tJf%ldc|e5ErYaJ>D6<6=URLrvHN*q=o{e3}tnNMTQRpVF(QcCLI@3+HduM7gJ|%C-5wY~o z`w2z=V-5P(+74jv_{ybDWVO0`gPzE}(KOAj|118R5dYH?Yw|9u_mrn!=$R7eJ+YH{ zN10~v1F_7E<9-BD&tu73Kn+c`x&e06&VnPy&iA?2kdBF5^|lX7&&M5i=wT2oCW~LZ z`|vcn!yJjU^S_gGDu3Ts+w`3m%sg)9A9{NA+HhYj;Ou>_rKTLVfs+dfEtfSQS8 zNWRv3pP6*Dnk!MMMd)SQq%y3SvCznY3<#5R<@_c@UJdO^jrAT>o2_}p-+LKO>fH>y zpd*h)MG(A)MSRYjC)b*C9Fr?=-T15?(1T`t@o198X?KMYx<#}|xt`h;B zgs9QOIA5-`gTh8J-a znJ1-YnoCMe4S)fY8F0?tnwWXGPGv{2s}B)b%gOdmt^fXkc6qL|+p6uddzC|h2L(hD zPR3rA^V`=ee|H@%E*VZ$q5;$xS8;QjNSsv`d!+q{c2uY(#0MnSTmNQBiekp5=n}1o z&4?98kBWbhCQvEmubp4BeL`Z^+v*Nw<))s&#=tpdu8LoIz9n|lE3i@7Xx!Tin~A>T zJ}%r|a&HS|op~M)w^b>xH^zNU^KuSW$yE|)BO~UiFh8c+$MLN8jGt+) zi;CC=^fV3eCq)lazZy!vjWnrov;-gsWq~>1b2)n1BoiVya0FnPrAz5_+G?fZ;o9lW z;EHWTo7b0O5kQfmX<}JAGF_qT3)%6>=phW`yuUJTHaXKsQD@g+=QT;GQ_DqAougzb zIatT}{RO&j=ooX>_Haz&YF6RkWXBSsYF%eX7o_9SkrJe_`0OQMebt`^ zIsMwjGyNM4Q^%SIo9I0v`pqYjzVWeK!RP~`5$bC+IW?oJh=WEF*c)Xk^htDlJo#8r zpJu~6Fh`|jyg}|^==s};w;ECh+-}fTW34BnL>fX;U#wHfy3JfUR`3DA>_Mu32xWRY zs)R_#YnKA_1ray8jy5ONYL%oCF?)>2!ce#XH^U4n7ptN2uBv3uDdee{T&<59`Bk{* zkhaOxPXR97_(w<0X^pG))QX(~if-=hSg<_>>!V#Ptg?bKck%>SU^z)99dm8JU;9>j zF1_AjZBjzaFhSUh&NOz#NEjmO!S|3=)s9WP$0Hf=;gOyzE!8=$eRO-J z(G_?KV}x!4V>h1L9z81x1ze07=h!E`wLlK-WT7L;%-m}rkm{?AP=)Qw={W5Q%x~(B zpd4GIdl4seim+=7IkbkH#h5yA;bF!=G@|c|Wtsac8}V^4McCh>|CC9*c#y$H;Kmu% zyjQ1Q7iZ#kl4^CtAp%eXtaomh8$I8>;IhJbL|qa8S`)Kd!}sIp$x?g&1%85q2(t{O!IV;?iDb?Dvc@9eTQ|_07Zmx5Qq|zV(UL;aQ+5H#lFc z>jgPrTAuUVH%K>qa7)^kl0t1}kw|QAS(~sF>^IQV&Fgi?14tW}RmuuZ?UMTz&`?Cg zdHLxQOtDfIv=9MEj9T4vly%abBe2>u8ltEAH~AsQB&&nr>#Pp@t}!g8rhRo%G~r%` zyiGRsTY36?JvrC8au%MQMySRxdDGuD;fB_K<7f&5sEwp(ct^NWm=aKxhz4-%Q-T)k z387$A0HcI#%KhEJGOSGrP62qqMMa*FLc%Xk;fIpk7ZWH?w2_yMibx1}ExQD!*unj^ zlRH`CH0L--r7(o@2z z2%3Hr<1&G%X^9~-Y1H&$u5~S%S^S|vC`o(2XH&PiOu1x1N&#S`-ZhpbBdK8}T}a-f_zQ+6M{(l=i#scxI0xy!;h2Zep#% z&UnUjE77#kGTei5{Vijn#+l+07@vT|+Z_OBklq_k4Ydy^$$bSyQ%d;apGG!r#PH2D zm&)uP=o0 zq#bR}C&_mKm-wrR&Qy|h0Zz?PO?jRz7r8*oOT_W3>mxJwkDiD1VZe^@-gD7!b}pYf z4rA=vii$<<^A^%e%(Xh+01pTG*-2zv42UR(X|eKtK9hu=V+Q8 zz+6$Lu$o{%r7_kU4q5QeZSmW0 z8v}Upho`^+eV{4X>;ksj?k0_UW?n;+)(IBkW}*h9>>vZnmb~KT*%=AIQyT#00SOc@;sK8qofYeeK&boM4I~ZaipS@lJ-7#+}lXb<&J5h!^7!i=&@l2iCT^ z_s~y`5$P6aqQvndQ;pT3-YFtPzwHREmvm}Iv?;M!TKv~y8s_j|w2ax`E6x*4@DVCc zo||p*kxaqpQPf%VNQSqU3nN!bAL7cbyHLYZ!!%@jAA%BEkAwGKwd@(JqHtl0r=mnK zF2SYcQ;WVRxC!)=RSdZz3%#JN)VLr4pl&`zY_c(9%H~_}AO^99wQQGl+5D`4UBDve z6ERmO+;mBymJY-%(wRTQA2VpjxS_282JVVTxa)*Kmse0t>6?H32V#bg-TSPEM+uTVnvpCZtDeQ)e#(! z6qh6_1~KUy!$wk~bGUikQf_`*QRMD1&71T4kOtt(AZeuh=~Y{RB? zB!iF!3Xcxp+vevK{IarxuAb3wzV_6M?xJLe$FWUL8L(7u_dHzQEY6!4bii+qmn?uX zR%UBo4t;%$j(a}*LJ14U;i!-(M{a(=QY!G!mB7alLr!vYpb(0=V7KFB`?8#4fm9&W0>mJ|FC|IiX)u(O z_c;7(OpvHs~g%Kl#Yn%BvTo!_dn1K$(Yq%(J{H zCha|_dhdx@Udka#79mbm?awAi<)L^l*PfbM;*RBTDZ5ahAwYBVi(L>HPb95j8qtbP zXNO>^AvVOutvn2*Ou|OzrCTx@;liqEi!JiToXh`%s|te9*}fCZ;Bruec_{_E3j0*- zalcgQPjibkJdgG27xpByY?Omj*Pv6s?YaBVY_vi+T*S&7K?I@8_fji%wp>NVKT_sG zb`!juqcIWUKWcIq#QGFyit`NiqFHM+n{06qIt8lwtd3;9a2YnzltR@H>3AM%{6RyD*U4+}eW{Q?*sD35V@ zyUQ0yloz6l4tbgdpj=ieX*~SzC%)YIN7fdP6d64tkeU|Oyz#5CfO{Pib*5Ytv5TAj zKa)Ks0Uczonza-Yt*)_gB?hB`w7od27i30CfaH Ae*gdg literal 0 HcmV?d00001