diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..8a064cd --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,24 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +Nothing there! + +## [0.1.0] - 2022-07-11 + +### Added + +- First official release +- Image for Spark 3.3.0 + Hadoop 3.3.3 + +### Modified + +- All images now based on ubi8/openjdk-8 +- Update of Google Spark Operator version (v1beta2-1.3.3-3.1.1) in the instructions +- Images renaming +- Removal of unneeded resources diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..f288702 --- /dev/null +++ b/LICENSE @@ -0,0 +1,674 @@ + GNU GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + Copyright (C) + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +. diff --git a/OWNERS b/OWNERS new file mode 100644 index 0000000..a83ec44 --- /dev/null +++ b/OWNERS @@ -0,0 +1,8 @@ +# Each list is sorted alphabetically, additions should maintain that order +approvers: +- guimou +- wseaton + +reviewers: +- guimou +- wseaton \ No newline at end of file diff --git a/README.adoc b/README.adoc index dd43004..2ac8439 100644 --- a/README.adoc +++ b/README.adoc @@ -1,44 +1,73 @@ = Spark on OpenShift +:toc: -This repo provides tools and instructions for deploying and using Google Spark Operator on OpenShift, with custom images including S3 connectors, JMX exporter for monitoring,... +== General and Open Data Hub Information -== Custom Spark images +*Repository Description*: -We will first create the custom Spark images (link:https://github.com/bbenzikry/spark-eks[Credits]). +Instructions and tools for deploying and using the https://github.com/GoogleCloudPlatform/spark-on-k8s-operator[*Spark on Kubernetes Operator*] and the *Spark History Server* on OpenShift alongside Open Data Hub. The provided custom *Spark* images include S3 connectors and committers, and JMX exporter for monitoring. -From the spark-images folder: +*Compatibility*: -NOTE: You can modify the Dockerfiles in the repo to change the Spark, Hadoop or other libraries versions. +- Tested with OpenShift 4.7, 4.8, 4.9, 4.10 +- No link with Open Data Hub, so will work with any version. -.Build the base Spark 3 image +== Custom Spark Images + +We include in this project custom Spark images with the Hadoop S3a connector to connect to S3-based object storage. Those images are based on https://catalog.redhat.com/software/containers/ubi8/openjdk-8/5dd6a48dbed8bd164a09589a[ubi8/openjdk-8], and are updated accordingly. + +Pre-built images can be found here, https://quay.io/repository/guimou/spark-odh and https://quay.io/repository/guimou/pyspark-odh, but you can also choose to build your own. + +Available images: + +* Spark images: + ** Spark 2.4.4 + Hadoop 2.8.5 + ** Spark 2.4.6 + Hadoop 3.3.0 + ** Spark 3.0.1 + Hadoop 3.3.0 + ** Spark 3.3.0 + Hadoop 3.3.3 +* PySpark images: + ** Spark 2.4.4 + Hadoop 2.8.5 + Python 3.6 + ** Spark 2.4.6 + Hadoop 3.3.0 + Python 3.6 + ** Spark 3.0.1 + Hadoop 3.3.0 + Python 3.8 + ** Spark 3.3.0 + Hadoop 3.3.3 + Python 3.9 + +=== Manually building custom Spark images + +In the `spark-images` folder you will find the sources for the pre-built images. You can built them again, or use them as templates for your own images, should you want to change additional libraries versions or add others. Pay attention to the slight differences in the Dockerfiles depending on the version of Spark, Hadoop or Python you want to install. + +Example, from the `spark-images` folder: + +.Build the latest base Spark 3.3.0 + Hadoop 3.3.3 image [source,bash] ---- -docker build --file spark3.Dockerfile --tag spark-odh:s3.0.1-h3.3.0_v0.0.1 . +podman build --file spark-3.3.0_hadoop-3.3.3.Dockerfile --tag spark-odh:s3.3.0-h3.3.3_v0.0.1 . ---- -.(Optional) Push the image to your repo +.(Optional) Tag and Push the image to your repo [source,bash] ---- -docker tag spark-odh:s3.0.1-h3.3.0_v0.0.1 your_repo/spark-odh:s3.0.1-h3.3.0_v0.0.1 -docker push your_repo/spark-odh:s3.0.1-h3.3.0_v0.0.1 +podman tag spark-odh:s3.3.0-h3.3.3_v0.0.1 your_repo/spark-odh:s3.3.0-h3.3.3_v0.0.1 +podman push your_repo/spark-odh:s3.3.0-h3.3.3_v0.0.1 ---- +You can also extend a Spark image with Python to create a PySpark-compatible image: + .Build the PySpark image [source,bash] ---- -docker build --file pyspark.Dockerfile --tag pyspark-odh:s3.0.1-h3.3.0_v0.0.1 --build-arg base_img=spark-odh:s3.0.1-h3.3.0_v0.0.1 . +podman build --file pyspark.Dockerfile --tag pyspark-odh:s3.3.0-h3.3.3_v0.0.1 --build-arg base_img=spark-odh:s3.3.0-h3.3.3_v0.0.1 . ---- -.(Optional) Push the image to your repo +.(Optional) Tag and Push the image to your repo [source,bash] ---- -docker tag pyspark-odh:s3.0.1-h3.3.0_v0.0.1 quay.io/guimou/pyspark-odh:s3.0.1-h3.3.0_v0.0.1 -docker push quay.io/guimou/pyspark-odh:s3.0.1-h3.3.0_v0.0.1 +podman tag pyspark-odh:s3.3.0-h3.3.3_v0.0.1 your_repo/pyspark-odh:s3.3.0-h3.3.3_v0.0.1 +podman push your_repo/pyspark-odh:s3.3.0-h3.3.3_v0.0.1 ---- -== Spark operator installation +== Spark Operator Installation -=== Namespace for the operator +=== Namespace The operator will be installed in its own namespace but will be able to monitor all namespaces for jobs to be launched. @@ -52,7 +81,7 @@ NOTE: From now on all the `oc` commands are supposed to be run in the context of === Operator -We will use the standard version of the Google Spark Operator. +We can use the standard version of the Spark on Kubernetes Operator at its latest version. .Add the helm repo [source,bash] @@ -66,13 +95,13 @@ helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s- helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace --set image.tag=v1beta2-1.3.3-3.1.1 --set webhook.enable=true --set resourceQuotaEnforcement.enable=true ---- -=== Monitoring +=== Spark Monitoring (Optional) -We will monitor the Spark operator itself, as well as the applications it creates. +We can monitor the Spark operator itself, as well as the applications it creates with Prometheus and Grafana. Note that this is only monitoring and reporting on the workload metrics. To get information about the Spark Jobs themselves you need to deploy the Spark History Server (see below). -NOTE: Prerequisites: Prometheus and Grafana must be installed in your environment. The easiest way is to use the operators. An instance of Grafana must be created so that the ServiceAccount is provisioned. +NOTE: *Prerequisites*: Prometheus and Grafana must be installed in your environment. The easiest way is to use their respective operators. Deploy the operators in the `spark-operator` namespace, and create simple instance of Prometheus and Grafana. -From spark-operator folder: +From the `spark-operator` folder: .Create the two Services that will expose the metrics [source,bash] @@ -93,7 +122,7 @@ oc apply -f spark-service-monitor.yaml oc apply -f prometheus-datasource.yaml ---- -NOTE: We will need another datasource to retrieve base CPU and RAM metrics. To do that we'll connect to the "main" OpenShift Prometheus through the following procedure. +NOTE: We also need another datasource to retrieve base CPU and RAM metrics from Prometheus. To do that we will connect to the "main" OpenShift Prometheus with the following procedure. .Grant the Grafana Service Account the cluster-monitoring-view cluster role: [source,bash] @@ -107,7 +136,7 @@ oc adm policy add-cluster-role-to-user cluster-monitoring-view -z grafana-servic export BEARER_TOKEN=$(oc serviceaccounts get-token grafana-serviceaccount) ---- -Deploy `main-prometheus-datasource.yaml` file with the BEARER_TOKEN value. +Deploy `main-prometheus-datasource.yaml` file with the `BEARER_TOKEN` value. .Create the "main" Prometheus Datasource [source,bash] @@ -122,9 +151,13 @@ oc apply -f spark-operator-dashboard.yaml oc apply -f spark-application-dashboard.yaml ---- -=== Service Account and Role for another namespace +=== Use Spark operator from another namespace (Optional) -The operator creates a special Service Account and Role to create pods and services in the namespace where it is deployed. If you want to create SparkApplication or ScheduledSparkApplication objects in another namespace, you first have to create an account, role and rolebinding into it. This is this ServiceAccount that you need to use for your all the Spark applications in this namespace. +The operator creates a special Service Account and a Role to create pods and services in the namespace where it is deployed. + +If you want to create SparkApplication or ScheduledSparkApplication objects in another namespace, you first have to create an account, a role and a rolebinding into it. + +This *ServiceAccount* is the one you need to use for your all the Spark applications in this specific namespace. From the `spark-operator` folder, while in the target namespace (`oc project YOUR_NAMESPACE`): @@ -136,11 +169,21 @@ oc apply -f spark-rbac.yaml == Spark History Server -All the following commands are executed from the `spark-history-server` folder. +The operator only creates ephemeral workloads. So unless you look at the logs in real time, you will loose all related information after the workload is finished. + +To avoid losing this precious information, you can (and you should!) send all the logs to a specific location, and set up the Spark History Server to be able to view and interpret them at any time. -=== Object storage +The logs location has to be shared storage that all pods can access simultaneously, so Object Storage (S3), Hadoop (HDFS), NFS,... -We will use object storage to store the logs data from the Spark jobs. We will first need to create a bucket. +For this setup we will be using Object Storage from OpenShift Data Foundation. + +NOTE: All the following commands are executed from the `spark-history-server` folder. + +=== Object storage bucket + +First, create a dedicated bucket to store the logs from the Spark jobs. + +Again, here we are using an Object Bucket Claim from OpenShift Data Foundation, which will create a bucket using the Multi-Cloud Gateway. Please adapt this depending on your chosen storage solution. .Create the OBC [source,bash] @@ -148,21 +191,23 @@ We will use object storage to store the logs data from the Spark jobs. We will f oc apply -f spark-hs-obc.yaml ---- -IMPORTANT: The Spark/Hadoop instances cannot log direclty into a bucket. A "folder" must exist where the logs will be sent. We will trick Spark/Hadoop into creating this folder by uploading a hidden file to the location we want this folder. +IMPORTANT: The Spark/Hadoop instances cannot log directly into an empty bucket. A "folder" must exist where the logs will be sent. We will help Spark/Hadoop into creating this folder by uploading an empty hidden file to the location we want this folder. -Retrieve the Access and Secret Key from the Secret named `obc-spark-history-server`, the name of the bucket from the ConfigMap named `obc-spark-history-server` as well as the Route to the S3 storage (you may have to create it to access the RGW, default S3 Route in ODF points to MCG). +Retrieve the Access and Secret Key from the Secret named `obc-spark-history-server`, the name of the bucket from the ConfigMap named `obc-spark-history-server`, as well as the Route to the S3 storage. -.Upload any small file, to the bucket (here using the AWS CLI) +.Upload a small file to the bucket (here using the https://aws.amazon.com/cli/[AWS CLI]) [source,bash] ---- export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY -aws --endpoint-url YOUR_ROUTE_TO_S3 s3 cp YOUR_FILE s3://YOUR_BUCKET_NAME/logs-dir/.s3keep +export S3_ROUTE=YOUR_ROUTE_TO_S3 +export BUCKET_NAME=YOUR_BUCKET_NAME +aws --endpoint-url $S3_ROUTE s3 cp .s3keep s3://$BUCKET_NAME/logs-dir/.s3keep ---- -Renaming this file `.s3keep` will mark it as hidden from from the History Server and Spark logging mechanism perspective, but the "folder" will appear as being present, making everyone happy! +Naming this file `.s3keep` will mark it as hidden from from the History Server and Spark logging mechanism perspective, but the "folder" will appear as being present, making everyone happy! -In the spark-history-server folder you will find an empty `.s3keep` file that you can already use. +You will find an empty `.s3keep` file that you can already use in the `spark-history-server` folder. === History Server deployment @@ -175,18 +220,17 @@ We can now create the service account, Role, RoleBonding, Service, Route and Dep oc apply -f spark-hs-deployment.yaml ---- -The UI is not accessible through the Route that was created, named `spark-history-server` - +The UI of the Spark History Server is now accessible through the Route that was created, named `spark-history-server` -== Usage +== Spark Operator Usage -We can do a quick test/demo with the standard word count example from Shakespeare's sonnets. +A quick test/demo can be done with the standard word count example from Shakespeare's sonnets. -=== Object storage +=== Input data -We'll create a bucket using and ObjectBucketClaim, and populate it with the data. +Create a bucket using an Object Bucket Claim and populate it with the data. -NOTE: This OBC creates a bucket in the RGW from an OpenShift Data Foundation deployment. Adapt the instructions depending on your S3 provider. +NOTE: This OBC creates a bucket with the MCG from an OpenShift Data Foundation deployment. Adapt the instructions depending on your S3 provider. From the `test` folder: @@ -196,19 +240,21 @@ From the `test` folder: oc apply -f obc.yaml ---- -Retrieve the Access and Secret Key from the Secret named `spark-demo`, the name of the bucket from the ConfigMap named `spark-demo` as well as the Route to the S3 storage (you may have to create it to access the RGW, default S3 Route in ODF points to MCG). +Retrieve the Access and Secret Key from the Secret named `spark-demo`, the name of the bucket from the ConfigMap named `spark-demo` as well as the Route to the S3 storage. -.Upload the data, the file `shakespeare.txt`, to the bucket (here using the AWS CLI) +.Upload the data (the file `shakespeare.txt`), to the bucket [source,bash] ---- export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY -aws --endpoint-url YOUR_ROUTE_TO_S3 s3 cp shakespeare.txt s3://YOUR_BUCKET_NAME/shakespeare.txt +export S3_ROUTE=YOUR_ROUTE_TO_S3 +export BUCKET_NAME=YOUR_BUCKET_NAME +aws --endpoint-url $S3_ROUTE s3 cp shakespeare.txt s3://$BUCKET_NAME/shakespeare.txt ---- TIP: If your endpoint is using a self-signed certificate, you can add `--no-verify-ssl` to the command. -Our application file is `wordcount.py` that you can find in the folder. To make it accessible to the Spark Application, it is packaged as data inside a Config Map. This CM will be mounted as a Volume inside our Spark Application YAML definition. +Our application file is `wordcount.py` that you can find in the folder. To make it accessible to the Spark Application, we will package it as data inside a Config Map. This CM will be mounted as a Volume inside our Spark Application YAML definition. .Create the application Config Map [source,bash] @@ -220,33 +266,33 @@ oc apply -f wordcount_configmap.yaml We are now ready to launch our Spark Job using the SparkApplication CRD from the operator. Our YAML definition will: -* Use the application file (wordcount.py) from the ConfigMap mounted as a volume -* Inject the Endpoint, Bucket, Access and Secret Keys inside the containers definition so that the driver and the workers can retrieve the data to process it +* Use the application file (wordcount.py) from the ConfigMap mounted as a volume in the Spark Operator, the driver and the executors. +* Inject the Endpoint, Bucket, Access and Secret Keys inside the containers definition so that the driver and the workers can retrieve the data to process it. -.Launch the Spark Job +.Launch the Spark Job (replace the version for corresponding yaml file) [source,bash] ---- -oc apply -f spark_app_shakespeare.yaml +oc apply -f spark_app_shakespeare_version-to-test.yaml ---- If you look at the OpenShift UI you will see the driver, then the workers spawning. They will execute the program, then terminate. -image::test/app_deployment.png[App deployment] +image::doc/img/app_deployment.png[App deployment] You can now retrieve the results: .List folder content [source,bash] ---- -aws --endpoint-url YOUR_ROUTE_TO_S3 s3 ls s3://YOUR_BUCKET_NAME/ +aws --endpoint-url $S3_ROUTE s3 ls s3://$BUCKET_NAME/ ---- You will see that the results have been saved in a location called `sorted_count_timestamp`. -.Retrieve the results +.Retrieve the results (replace `timestamp` with the right value) [source,bash] ---- -aws --endpoint-url YOUR_ROUTE_TO_S3 s3 cp s3://YOUR_BUCKET_NAME/sorted_counts_timestamp ./ --recursive +aws --endpoint-url $S3_ROUTE s3 cp s3://$BUCKET_NAME/sorted_counts_timestamp ./ --recursive ---- There should be different files: @@ -279,18 +325,32 @@ There should be different files: .... ---- -So the sorted list of all the words with their occurences in the full text. +So the sorted list of all the words with their occurrences in the full text. -While a job is running you can also have a look at the Grafana dashboards for something like this: +While a job is running you can also have a look at the Grafana dashboards we created for monitoring. It will look like this: -image::test/spark_operator_dashboard.png[Dashboard] +image::doc/img/spark_operator_dashboard.png[Dashboard] === History Server Test -We'll now log the output from the job using our history server. +We will now run the same job, but log the output using our history server. Have a look at the YAML file to see how this is configured. + +To send the logs to the history server bucket, you have to modify the `sparkconf`section starting at line 9. Replace the values for YOUR_BUCKET, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with the corresponding value for the history server bucket. -.Launch the Spark Job +.Launch the Spark Job (replace the version for corresponding yaml file) [source,bash] ---- -oc apply -f spark_app_shakespeare_history_server.yaml ----- \ No newline at end of file +oc apply -f spark_app_shakespeare_version-to-test_history_server.yaml +---- + +If you go to the history server URL, you now have access to all the logs and nice dashboards like this one for the different workloads you have run. + +image::doc/img/history_server.png[History Server] + +== References + +There are endless configuration, settings and tweaks you can use with Spark. On top of the standard documentation, here are some documents you will find interesting to make the most use of Spark on OpenShift. + +- https://towardsdatascience.com/apache-spark-with-kubernetes-and-fast-s3-access-27e64eb14e0f[Spark on Kubernets with details on the S3 committers]. +- https://01.org/blogs/hualongf/2021/introduction-s3a-ceph-big-data-workloads[Spark optimization for S3 storage]. +- https://cloud.redhat.com/blog/getting-started-running-spark-workloads-on-openshift[Detailed walkthrough and code for running TPC-DS benchmark with Spark on OpenShift]. Lots of useful configuration information to interact with the storage. \ No newline at end of file diff --git a/test/app_deployment.png b/doc/img/app_deployment.png similarity index 100% rename from test/app_deployment.png rename to doc/img/app_deployment.png diff --git a/doc/img/history_server.png b/doc/img/history_server.png new file mode 100644 index 0000000..d03c1e3 Binary files /dev/null and b/doc/img/history_server.png differ diff --git a/test/spark_operator_dashboard.png b/doc/img/spark_operator_dashboard.png similarity index 100% rename from test/spark_operator_dashboard.png rename to doc/img/spark_operator_dashboard.png diff --git a/spark-history-server/spark-hs-deployment.yaml b/spark-history-server/spark-hs-deployment.yaml index ed48417..401e4b6 100644 --- a/spark-history-server/spark-hs-deployment.yaml +++ b/spark-history-server/spark-hs-deployment.yaml @@ -1,4 +1,4 @@ -### Adpated from Helm Chart: https://artifacthub.io/packages/helm/spot/spark-history-server +### Adapted from Helm Chart: https://artifacthub.io/packages/helm/spot/spark-history-server # Source: spark-history-server/templates/serviceaccount.yaml apiVersion: v1 kind: ServiceAccount diff --git a/spark-history-server/spark-hs-obc.yaml b/spark-history-server/spark-hs-obc.yaml index 0a7650b..84d5429 100644 --- a/spark-history-server/spark-hs-obc.yaml +++ b/spark-history-server/spark-hs-obc.yaml @@ -4,4 +4,4 @@ metadata: name: obc-spark-history-server spec: generateBucketName: obc-spark-history-server - storageClassName: ocs-storagecluster-ceph-rgw \ No newline at end of file + storageClassName: openshift-storage.noobaa.io \ No newline at end of file diff --git a/spark-images/pyspark-2.4.4_hadoop-2.8.5.Dockerfile b/spark-images/pyspark-2.4.4_hadoop-2.8.5.Dockerfile new file mode 100644 index 0000000..18bf3ec --- /dev/null +++ b/spark-images/pyspark-2.4.4_hadoop-2.8.5.Dockerfile @@ -0,0 +1,37 @@ +# Note: Spark 2.4.4 supports Python up to 3.7 only +# As 3.7 is not available in the ubi8 images, we will install Python 3.6 + +ARG base_img + +FROM $base_img + +EXPOSE 8080 + +ENV PYTHON_VERSION=3.6 \ + PATH=$HOME/.local/bin/:$PATH \ + PYTHONUNBUFFERED=1 \ + PYTHONIOENCODING=UTF-8 \ + LC_ALL=en_US.UTF-8 \ + LANG=en_US.UTF-8 \ + CNB_STACK_ID=com.redhat.stacks.ubi8-python-36 \ + CNB_USER_ID=1001 \ + CNB_GROUP_ID=0 \ + PIP_NO_CACHE_DIR=off + +USER 0 + +RUN INSTALL_PKGS="python36 python36-devel python3-virtualenv python3-setuptools python3-pip \ + nss_wrapper httpd httpd-devel mod_ssl mod_auth_gssapi \ + mod_ldap mod_session atlas-devel gcc-gfortran libffi-devel \ + libtool-ltdl enchant" && \ + microdnf -y module enable python36:3.6 httpd:2.4 && \ + microdnf -y --setopt=tsflags=nodocs install $INSTALL_PKGS && \ + microdnf -y clean all --enablerepo='*' && \ + ln -s /usr/bin/python3 /usr/bin/python + +ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip + +WORKDIR /opt/spark/work-dir +ENTRYPOINT [ "/opt/entrypoint.sh" ] + +USER 185 diff --git a/spark-images/pyspark-2.4.6_hadoop-3.3.0.Dockerfile b/spark-images/pyspark-2.4.6_hadoop-3.3.0.Dockerfile new file mode 100644 index 0000000..95d99fa --- /dev/null +++ b/spark-images/pyspark-2.4.6_hadoop-3.3.0.Dockerfile @@ -0,0 +1,37 @@ +# Note: Spark 2.4.6 supports Python up to 3.7 only +# As 3.7 is not available in the ubi8 images, we will install Python 3.6 + +ARG base_img + +FROM $base_img + +EXPOSE 8080 + +ENV PYTHON_VERSION=3.6 \ + PATH=$HOME/.local/bin/:$PATH \ + PYTHONUNBUFFERED=1 \ + PYTHONIOENCODING=UTF-8 \ + LC_ALL=en_US.UTF-8 \ + LANG=en_US.UTF-8 \ + CNB_STACK_ID=com.redhat.stacks.ubi8-python-36 \ + CNB_USER_ID=1001 \ + CNB_GROUP_ID=0 \ + PIP_NO_CACHE_DIR=off + +USER 0 + +RUN INSTALL_PKGS="python36 python36-devel python3-virtualenv python3-setuptools python3-pip \ + nss_wrapper httpd httpd-devel mod_ssl mod_auth_gssapi \ + mod_ldap mod_session atlas-devel gcc-gfortran libffi-devel \ + libtool-ltdl enchant" && \ + microdnf -y module enable python36:3.6 httpd:2.4 && \ + microdnf -y --setopt=tsflags=nodocs install $INSTALL_PKGS && \ + microdnf -y clean all --enablerepo='*' && \ + ln -s /usr/bin/python3 /usr/bin/python + +ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip + +WORKDIR /opt/spark/work-dir +ENTRYPOINT [ "/opt/entrypoint.sh" ] + +USER 185 diff --git a/spark-images/pyspark-3.0.1_hadoop-3.3.0.Dockerfile b/spark-images/pyspark-3.0.1_hadoop-3.3.0.Dockerfile new file mode 100644 index 0000000..97ac23e --- /dev/null +++ b/spark-images/pyspark-3.0.1_hadoop-3.3.0.Dockerfile @@ -0,0 +1,35 @@ +# Note: Spark 3.0.1 supports Python up to 3.8, this will be the installed version + +ARG base_img + +FROM $base_img + +EXPOSE 8080 + +ENV PYTHON_VERSION=3.8 \ + PATH=$HOME/.local/bin/:$PATH \ + PYTHONUNBUFFERED=1 \ + PYTHONIOENCODING=UTF-8 \ + LC_ALL=en_US.UTF-8 \ + LANG=en_US.UTF-8 \ + CNB_STACK_ID=com.redhat.stacks.ubi8-python-38 \ + CNB_USER_ID=1001 \ + CNB_GROUP_ID=0 \ + PIP_NO_CACHE_DIR=off + +USER 0 + +RUN INSTALL_PKGS="python38 python38-devel python38-setuptools python38-pip nss_wrapper \ + httpd httpd-devel mod_ssl mod_auth_gssapi mod_ldap \ + mod_session atlas-devel gcc-gfortran libffi-devel libtool-ltdl enchant" && \ + microdnf -y module enable python38:3.8 httpd:2.4 && \ + microdnf -y --setopt=tsflags=nodocs install $INSTALL_PKGS && \ + microdnf -y clean all --enablerepo='*' && \ + ln -s /usr/bin/python3 /usr/bin/python + +ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip + +WORKDIR /opt/spark/work-dir +ENTRYPOINT [ "/opt/entrypoint.sh" ] + +USER 185 diff --git a/spark-images/pyspark-3.3.0_hadoop-3.3.3.Dockerfile b/spark-images/pyspark-3.3.0_hadoop-3.3.3.Dockerfile new file mode 100644 index 0000000..e96d428 --- /dev/null +++ b/spark-images/pyspark-3.3.0_hadoop-3.3.3.Dockerfile @@ -0,0 +1,35 @@ +# Note: Spark 3.3.0 supports Python up to 3.10, but we will install 3.9 + +ARG base_img + +FROM $base_img + +EXPOSE 8080 + +ENV PYTHON_VERSION=3.9 \ + PATH=$HOME/.local/bin/:$PATH \ + PYTHONUNBUFFERED=1 \ + PYTHONIOENCODING=UTF-8 \ + LC_ALL=en_US.UTF-8 \ + LANG=en_US.UTF-8 \ + CNB_STACK_ID=com.redhat.stacks.ubi8-python-39 \ + CNB_USER_ID=1001 \ + CNB_GROUP_ID=0 \ + PIP_NO_CACHE_DIR=off + +USER 0 + +RUN INSTALL_PKGS="python39 python39-devel python39-setuptools python39-pip nss_wrapper \ + httpd httpd-devel mod_ssl mod_auth_gssapi mod_ldap \ + mod_session atlas-devel gcc-gfortran libffi-devel libtool-ltdl enchant" && \ + microdnf -y module enable python39:3.9 httpd:2.4 && \ + microdnf -y --setopt=tsflags=nodocs install $INSTALL_PKGS && \ + microdnf -y clean all --enablerepo='*' && \ + ln -s /usr/bin/python3 /usr/bin/python + +ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip + +WORKDIR /opt/spark/work-dir +ENTRYPOINT [ "/opt/entrypoint.sh" ] + +USER 185 diff --git a/spark-images/pyspark.Dockerfile b/spark-images/pyspark.Dockerfile deleted file mode 100644 index 129974f..0000000 --- a/spark-images/pyspark.Dockerfile +++ /dev/null @@ -1,32 +0,0 @@ -ARG base_img - -ARG BUILD_DATE -ARG VCS_REF - -FROM $base_img -LABEL org.label-schema.build-date=$BUILD_DATE \ - org.label-schema.vcs-ref=$VCS_REF \ - org.label-schema.schema-version="1.0.0"\ - org.label-schema.version="0.0.1" - -WORKDIR / - -# Reset to root to run installation tasks -USER 0 - -RUN apt-get update && \ - apt install -y python python-pip && \ - apt install -y python3 python3-pip && \ - # We remove ensurepip since it adds no functionality since pip is - # installed on the image and it just takes up 1.6MB on the image - rm -r /usr/lib/python*/ensurepip && \ - pip install --upgrade pip setuptools && \ - pip3 install --upgrade pip setuptools && \ - # You may install with python3 packages by using pip3.6 - # Removed the .cache to save space - rm -r /root/.cache && rm -rf /var/cache/apt/* - -ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip - -WORKDIR /opt/spark/work-dir -ENTRYPOINT [ "/opt/entrypoint.sh" ] diff --git a/spark-images/spark2.4.4_hdp2.8.5.Dockerfile b/spark-images/spark-2.4.4_hadoop-2.8.5.Dockerfile similarity index 71% rename from spark-images/spark2.4.4_hdp2.8.5.Dockerfile rename to spark-images/spark-2.4.4_hadoop-2.8.5.Dockerfile index 0a529bb..f4b929f 100644 --- a/spark-images/spark2.4.4_hdp2.8.5.Dockerfile +++ b/spark-images/spark-2.4.4_hadoop-2.8.5.Dockerfile @@ -1,13 +1,21 @@ -FROM openjdk:8-jdk-alpine AS builder +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 AS builder # set desired spark, hadoop and kubernetes client versions ARG spark_version=2.4.4 ARG hadoop_version=2.8.5 ARG kubernetes_client_version=4.6.4 ARG jmx_prometheus_javaagent_version=0.15.0 -ARG aws_java_sdk_version=1.11.682 +ARG aws_java_sdk_version=1.12.255 ARG spark_uid=185 +USER 0 + +WORKDIR / + +# Install gzip to extract archives +RUN microdnf install -y gzip && \ + microdnf clean all + # Download Spark ADD https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-without-hadoop.tgz . # Unzip Spark @@ -35,12 +43,19 @@ ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-model/${kubernetes_clie RUN chmod 0644 jars/kubernetes-*.jar -# Install aws-java-sdk +# Delete old aws-java-sdk and replace with newer version WORKDIR /hadoop/share/hadoop/tools/lib +RUN rm -f ./aws-java-sdk-*.jar ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${aws_java_sdk_version}/aws-java-sdk-bundle-${aws_java_sdk_version}.jar . RUN chmod 0644 aws-java-sdk-bundle*.jar -FROM openjdk:8-jdk-alpine as final +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 as final + +# Fix for https://issues.redhat.com/browse/OPENJDK-335 +ENV NSS_WRAPPER_PASSWD= +ENV NSS_WRAPPER_GROUP= + +USER 0 WORKDIR /opt/spark @@ -54,18 +69,33 @@ COPY --from=builder /hadoop /opt/hadoop # Copy Prometheus jars from builder stage COPY --from=builder /prometheus /prometheus +# Add an init process, check the checksum to make sure it's a match +RUN set -e ; \ + TINI_BIN=""; \ + TINI_SHA256=""; \ + TINI_VERSION="v0.19.0"; \ + case "$(arch)" in \ + x86_64) \ + TINI_BIN="tini-amd64"; \ + TINI_SHA256="93dcc18adc78c65a028a84799ecf8ad40c936fdfc5f2a57b1acda5a8117fa82c"; \ + ;; \ + aarch64) \ + TINI_BIN="tini-arm64"; \ + TINI_SHA256="07952557df20bfd2a95f9bef198b445e006171969499a1d361bd9e6f8e5e0e81"; \ + ;; \ + *) \ + echo >&2 ; echo >&2 "Unsupported architecture \$(arch)" ; echo >&2 ; exit 1 ; \ + ;; \ + esac ; \ + curl --retry 8 -S -L -O "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/${TINI_BIN}" ; \ + echo "${TINI_SHA256} ${TINI_BIN}" | sha256sum -c - ; \ + mv "${TINI_BIN}" /usr/sbin/tini ; \ + chmod +x /usr/sbin/tini + RUN set -ex && \ - apk upgrade --no-cache && \ - ln -s /lib /lib64 && \ - apk add --no-cache bash tini libc6-compat linux-pam nss && \ mkdir -p /opt/spark && \ mkdir -p /opt/spark/work-dir && \ - touch /opt/spark/RELEASE && \ - rm /bin/sh && \ - ln -sv /bin/bash /bin/sh && \ - echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ - chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ - rm -rf /var/cache/apt/* + touch /opt/spark/RELEASE # Configure environment variables for spark ENV SPARK_HOME /opt/spark diff --git a/spark-images/spark2.Dockerfile b/spark-images/spark-2.4.6_hadoop-3.3.0.Dockerfile similarity index 66% rename from spark-images/spark2.Dockerfile rename to spark-images/spark-2.4.6_hadoop-3.3.0.Dockerfile index d23aa33..7d5379f 100644 --- a/spark-images/spark2.Dockerfile +++ b/spark-images/spark-2.4.6_hadoop-3.3.0.Dockerfile @@ -1,13 +1,21 @@ -FROM openjdk:8-jdk-slim AS builder +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 AS builder # set desired spark, hadoop and kubernetes client versions ARG spark_version=2.4.6 ARG hadoop_version=3.3.0 ARG kubernetes_client_version=4.6.4 ARG jmx_prometheus_javaagent_version=0.15.0 -ARG aws_java_sdk_version=1.11.797 +ARG aws_java_sdk_version=1.12.255 ARG spark_uid=185 +USER 0 + +WORKDIR / + +# Install gzip to extract archives +RUN microdnf install -y gzip && \ + microdnf clean all + # Download Spark ADD https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-without-hadoop.tgz . # Unzip Spark @@ -15,7 +23,7 @@ RUN tar -xvzf spark-${spark_version}-bin-without-hadoop.tgz RUN mv spark-${spark_version}-bin-without-hadoop spark # Download Hadoop -ADD https://downloads.apache.org/hadoop/common/hadoop-${hadoop_version}/hadoop-${hadoop_version}.tar.gz . +ADD https://archive.apache.org/dist/hadoop/common/hadoop-${hadoop_version}/hadoop-${hadoop_version}.tar.gz . # Unzip Hadoop RUN tar -xvzf hadoop-${hadoop_version}.tar.gz RUN mv hadoop-${hadoop_version} hadoop @@ -35,13 +43,20 @@ ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-model/${kubernetes_clie RUN chmod 0644 jars/kubernetes-*.jar -# Delete old aws-java-sdk and replace with newer version that supports IRSA/iam-oidc +# Delete old aws-java-sdk and replace with newer version WORKDIR /hadoop/share/hadoop/tools/lib -RUN rm ./aws-java-sdk-bundle-*.jar +RUN rm -f ./aws-java-sdk-*.jar ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${aws_java_sdk_version}/aws-java-sdk-bundle-${aws_java_sdk_version}.jar . -RUN chmod 0644 aws-java-sdk-bundle*.jar +RUN chmod 0644 aws-java-sdk*.jar -FROM openjdk:8-jdk-slim as final +### Build final image +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 as final + +# Fix for https://issues.redhat.com/browse/OPENJDK-335 +ENV NSS_WRAPPER_PASSWD= +ENV NSS_WRAPPER_GROUP= + +USER 0 WORKDIR /opt/spark @@ -55,19 +70,33 @@ COPY --from=builder /hadoop /opt/hadoop # Copy Prometheus jars from builder stage COPY --from=builder /prometheus /prometheus +# Add an init process, check the checksum to make sure it's a match +RUN set -e ; \ + TINI_BIN=""; \ + TINI_SHA256=""; \ + TINI_VERSION="v0.19.0"; \ + case "$(arch)" in \ + x86_64) \ + TINI_BIN="tini-amd64"; \ + TINI_SHA256="93dcc18adc78c65a028a84799ecf8ad40c936fdfc5f2a57b1acda5a8117fa82c"; \ + ;; \ + aarch64) \ + TINI_BIN="tini-arm64"; \ + TINI_SHA256="07952557df20bfd2a95f9bef198b445e006171969499a1d361bd9e6f8e5e0e81"; \ + ;; \ + *) \ + echo >&2 ; echo >&2 "Unsupported architecture \$(arch)" ; echo >&2 ; exit 1 ; \ + ;; \ + esac ; \ + curl --retry 8 -S -L -O "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/${TINI_BIN}" ; \ + echo "${TINI_SHA256} ${TINI_BIN}" | sha256sum -c - ; \ + mv "${TINI_BIN}" /usr/bin/tini ; \ + chmod +x /usr/bin/tini + RUN set -ex && \ - sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \ - apt-get update && \ - ln -s /lib /lib64 && \ - apt install -y bash tini libc6 libpam-modules \ - # krb5-user \ - libnss3 procps && \ - touch /opt/spark/RELEASE && \ - rm /bin/sh && \ - ln -sv /bin/bash /bin/sh && \ - echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ - chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ - rm -rf /var/cache/apt/* + mkdir -p /opt/spark && \ + mkdir -p /opt/spark/work-dir && \ + touch /opt/spark/RELEASE # Configure environment variables for spark ENV SPARK_HOME /opt/spark diff --git a/spark-images/spark2-hdp2.7.4.Dockerfile b/spark-images/spark-3.0.1_hadoop-3.3.0.Dockerfile similarity index 57% rename from spark-images/spark2-hdp2.7.4.Dockerfile rename to spark-images/spark-3.0.1_hadoop-3.3.0.Dockerfile index c74f482..a217cca 100644 --- a/spark-images/spark2-hdp2.7.4.Dockerfile +++ b/spark-images/spark-3.0.1_hadoop-3.3.0.Dockerfile @@ -1,12 +1,20 @@ -FROM openjdk:8-jdk-slim AS builder +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 AS builder -# set desired spark, hadoop and kubernetes client versions -ARG spark_version=2.4.6 -ARG hadoop_version=2.7.4 -ARG kubernetes_client_version=4.6.4 +# Build options +ARG spark_version=3.0.1 +ARG hadoop_version=3.3.0 ARG jmx_prometheus_javaagent_version=0.15.0 -ARG aws_java_sdk_version=1.7.4 ARG spark_uid=185 +# Spark's Guava version to match with Hadoop's +ARG guava_version=27.0-jre + +USER 0 + +WORKDIR / + +# Install gzip to extract archives +RUN microdnf install -y gzip && \ + microdnf clean all # Download Spark ADD https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-without-hadoop.tgz . @@ -26,21 +34,19 @@ RUN rm -rf hadoop/share/doc ADD https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/${jmx_prometheus_javaagent_version}/jmx_prometheus_javaagent-${jmx_prometheus_javaagent_version}.jar /prometheus/ RUN chmod 0644 prometheus/jmx_prometheus_javaagent*.jar -# Delete old spark kubernetes client jars and replace them with newer version -WORKDIR /spark -RUN rm ./jars/kubernetes-*.jar -ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-model-common/${kubernetes_client_version}/kubernetes-model-common-${kubernetes_client_version}.jar jars/ -ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/${kubernetes_client_version}/kubernetes-client-${kubernetes_client_version}.jar jars/ -ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-model/${kubernetes_client_version}/kubernetes-model-${kubernetes_client_version}.jar jars/ +# Add updated Guava +WORKDIR /spark/jars +RUN rm -f guava-*.jar +ADD https://repo1.maven.org/maven2/com/google/guava/guava/${guava_version}/guava-${guava_version}.jar . -RUN chmod 0644 jars/kubernetes-*.jar +### Build final image +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 as final -# Delete old aws-java-sdk and replace with newer version that supports IRSA/iam-oidc -WORKDIR /hadoop/share/hadoop/tools/lib -ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar . -RUN chmod 0644 aws-java-sdk*.jar +# Fix for https://issues.redhat.com/browse/OPENJDK-335 +ENV NSS_WRAPPER_PASSWD= +ENV NSS_WRAPPER_GROUP= -FROM openjdk:8-jdk-slim as final +USER 0 WORKDIR /opt/spark @@ -54,19 +60,33 @@ COPY --from=builder /hadoop /opt/hadoop # Copy Prometheus jars from builder stage COPY --from=builder /prometheus /prometheus +# Add an init process, check the checksum to make sure it's a match +RUN set -e ; \ + TINI_BIN=""; \ + TINI_SHA256=""; \ + TINI_VERSION="v0.19.0"; \ + case "$(arch)" in \ + x86_64) \ + TINI_BIN="tini-amd64"; \ + TINI_SHA256="93dcc18adc78c65a028a84799ecf8ad40c936fdfc5f2a57b1acda5a8117fa82c"; \ + ;; \ + aarch64) \ + TINI_BIN="tini-arm64"; \ + TINI_SHA256="07952557df20bfd2a95f9bef198b445e006171969499a1d361bd9e6f8e5e0e81"; \ + ;; \ + *) \ + echo >&2 ; echo >&2 "Unsupported architecture \$(arch)" ; echo >&2 ; exit 1 ; \ + ;; \ + esac ; \ + curl --retry 8 -S -L -O "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/${TINI_BIN}" ; \ + echo "${TINI_SHA256} ${TINI_BIN}" | sha256sum -c - ; \ + mv "${TINI_BIN}" /usr/bin/tini ; \ + chmod +x /usr/bin/tini + RUN set -ex && \ - sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \ - apt-get update && \ - ln -s /lib /lib64 && \ - apt install -y bash tini libc6 libpam-modules \ - # krb5-user \ - libnss3 procps && \ - touch /opt/spark/RELEASE && \ - rm /bin/sh && \ - ln -sv /bin/bash /bin/sh && \ - echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ - chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ - rm -rf /var/cache/apt/* + mkdir -p /opt/spark && \ + mkdir -p /opt/spark/work-dir && \ + touch /opt/spark/RELEASE # Configure environment variables for spark ENV SPARK_HOME /opt/spark diff --git a/spark-images/spark-3.3.0_hadoop-3.3.3.Dockerfile b/spark-images/spark-3.3.0_hadoop-3.3.3.Dockerfile new file mode 100644 index 0000000..c1a30cd --- /dev/null +++ b/spark-images/spark-3.3.0_hadoop-3.3.3.Dockerfile @@ -0,0 +1,115 @@ +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 AS builder + +# Build options +ARG spark_version=3.3.0 +ARG hadoop_version=3.3.3 +ARG jmx_prometheus_javaagent_version=0.17.0 +ARG spark_uid=185 +# Spark's Guava version to match with Hadoop's +ARG guava_version=27.0-jre + +USER 0 + +WORKDIR / + +# Install gzip to extract archives +RUN microdnf install -y gzip && \ + microdnf clean all + +# Download Spark +ADD https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-without-hadoop.tgz . +# Unzip Spark +RUN tar -xvzf spark-${spark_version}-bin-without-hadoop.tgz +RUN mv spark-${spark_version}-bin-without-hadoop spark + +# Download Hadoop +ADD https://archive.apache.org/dist/hadoop/common/hadoop-${hadoop_version}/hadoop-${hadoop_version}.tar.gz . +# Unzip Hadoop +RUN tar -xvzf hadoop-${hadoop_version}.tar.gz +RUN mv hadoop-${hadoop_version} hadoop +# Delete unnecessary hadoop documentation +RUN rm -rf hadoop/share/doc + +# Download JMX Prometheus javaagent jar +ADD https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/${jmx_prometheus_javaagent_version}/jmx_prometheus_javaagent-${jmx_prometheus_javaagent_version}.jar /prometheus/ +RUN chmod 0644 prometheus/jmx_prometheus_javaagent*.jar + +# Add updated Guava +WORKDIR /spark/jars +RUN rm -f guava-*.jar +ADD https://repo1.maven.org/maven2/com/google/guava/guava/${guava_version}/guava-${guava_version}.jar . + +# Add Spark Hadoop Cloud to interact with cloud infrastructures +ADD https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/${spark_version}/spark-hadoop-cloud_2.12-${spark_version}.jar . + +### Build final image +FROM registry.access.redhat.com/ubi8/openjdk-8:1.13 as final + +# Fix for https://issues.redhat.com/browse/OPENJDK-335 +ENV NSS_WRAPPER_PASSWD= +ENV NSS_WRAPPER_GROUP= + +USER 0 + +WORKDIR /opt/spark + +# Copy Spark from builder stage +COPY --from=builder /spark /opt/spark +COPY --from=builder /spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt + +# Copy Hadoop from builder stage +COPY --from=builder /hadoop /opt/hadoop + +# Copy Prometheus jars from builder stage +COPY --from=builder /prometheus /prometheus + +# Add an init process, check the checksum to make sure it's a match +RUN set -e ; \ + TINI_BIN=""; \ + TINI_SHA256=""; \ + TINI_VERSION="v0.19.0"; \ + case "$(arch)" in \ + x86_64) \ + TINI_BIN="tini-amd64"; \ + TINI_SHA256="93dcc18adc78c65a028a84799ecf8ad40c936fdfc5f2a57b1acda5a8117fa82c"; \ + ;; \ + aarch64) \ + TINI_BIN="tini-arm64"; \ + TINI_SHA256="07952557df20bfd2a95f9bef198b445e006171969499a1d361bd9e6f8e5e0e81"; \ + ;; \ + *) \ + echo >&2 ; echo >&2 "Unsupported architecture \$(arch)" ; echo >&2 ; exit 1 ; \ + ;; \ + esac ; \ + curl --retry 8 -S -L -O "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/${TINI_BIN}" ; \ + echo "${TINI_SHA256} ${TINI_BIN}" | sha256sum -c - ; \ + mv "${TINI_BIN}" /usr/bin/tini ; \ + chmod +x /usr/bin/tini + +RUN set -ex && \ + mkdir -p /opt/spark && \ + mkdir -p /opt/spark/work-dir && \ + touch /opt/spark/RELEASE + +# Configure environment variables for spark +ENV SPARK_HOME /opt/spark + +ENV HADOOP_HOME /opt/hadoop + +ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:$HADOOP_HOME/share/hadoop/tools/lib/*" + +ENV SPARK_EXTRA_CLASSPATH="$SPARK_DIST_CLASSPATH" + +ENV LD_LIBRARY_PATH /lib64 + +# Set spark workdir +WORKDIR /opt/spark/work-dir +RUN chmod g+w /opt/spark/work-dir + +RUN mkdir -p /etc/metrics/conf +COPY conf/metrics.properties /etc/metrics/conf +COPY conf/prometheus.yaml /etc/metrics/conf + +ENTRYPOINT [ "/opt/entrypoint.sh" ] + +USER ${spark_uid} \ No newline at end of file diff --git a/spark-images/spark3.Dockerfile b/spark-images/spark3.Dockerfile deleted file mode 100644 index 3fab71d..0000000 --- a/spark-images/spark3.Dockerfile +++ /dev/null @@ -1,101 +0,0 @@ -ARG spark_uid=185 -FROM openjdk:8-jdk-slim AS builder - -# Build options -ARG spark_version=3.0.1 -ARG scala_version=2.12 -ARG hive_version=2.3.7 -ARG hadoop_version=3.3.0 -ARG hadoop_major_version=3 -ARG aws_java_sdk_version=1.11.797 -ARG jmx_prometheus_javaagent_version=0.15.0 - -WORKDIR / - -# Download JMX Prometheus javaagent jar -ADD https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/${jmx_prometheus_javaagent_version}/jmx_prometheus_javaagent-${jmx_prometheus_javaagent_version}.jar /prometheus/ -RUN chmod 0644 prometheus/jmx_prometheus_javaagent*.jar - -WORKDIR / -# Get pre-compiled spark build -ADD https://github.com/bbenzikry/spark-glue/releases/download/${spark_version}/spark-${spark_version}-bin-hadoop-provided-glue.tgz . -RUN tar -xvzf spark-${spark_version}-bin-hadoop-provided-glue.tgz -RUN mv spark-${spark_version}-bin-hadoop-provided-glue spark - - -# Hadoop -ADD https://downloads.apache.org/hadoop/common/hadoop-${hadoop_version}/hadoop-${hadoop_version}.tar.gz . -RUN tar -xvzf hadoop-${hadoop_version}.tar.gz -RUN mv hadoop-${hadoop_version} hadoop - -# Delete unnecessary hadoop documentation -RUN rm -rf hadoop/share/doc - -WORKDIR /spark/jars - -# Add updated guava -RUN rm -f jars/guava-14.0.1.jar -ADD https://repo1.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar . - -# Hadoop-cloud for S3A commiters -ADD https://github.com/bbenzikry/spark-glue/releases/download/${spark_version}/spark-hadoop-cloud_2.12-${spark_version}.jar . - -# Add GCS and BQ just in case -ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop${hadoop_major_version}.jar . -ADD https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-latest_2.12.jar . -RUN chmod 0644 guava-23.0.jar spark-bigquery-latest_2.12.jar gcs-connector-latest-hadoop${hadoop_major_version}.jar spark-hadoop-cloud_2.12-${spark_version}.jar - -# Updated AWS for IRSA -WORKDIR /hadoop/share/hadoop/tools/lib -RUN rm ./aws-java-sdk-bundle-*.jar -ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${aws_java_sdk_version}/aws-java-sdk-bundle-${aws_java_sdk_version}.jar . -RUN chmod 0644 aws-java-sdk-bundle*.jar - -FROM openjdk:8-jdk-slim as final -LABEL org.opencontainers.image.created=$BUILD_DATE \ - org.opencontainers.image.version=$spark_version \ - org.opencontainers.image.title="Spark ${spark_version} for OpenShift" \ - org.opencontainers.image.description="Spark ${spark_version} built for working with OpenShift" - -# Copy spark + glue + hadoop from builder stage -COPY --from=builder /spark /opt/spark -COPY --from=builder /spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt - -# Hadoop -COPY --from=builder /hadoop /opt/hadoop -# Copy Prometheus jars from builder stage -COPY --from=builder /prometheus /prometheus - -RUN set -ex && \ - sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \ - apt-get update && \ - ln -s /lib /lib64 && \ - apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && \ - mkdir -p /opt/spark && \ - mkdir -p /opt/spark/examples && \ - mkdir -p /opt/spark/work-dir && \ - touch /opt/spark/RELEASE && \ - rm /bin/sh && \ - ln -sv /bin/bash /bin/sh && \ - echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ - chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ - rm -rf /var/cache/apt/* - -ENV SPARK_HOME /opt/spark -ENV HADOOP_HOME /opt/hadoop -ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:$HADOOP_HOME/share/hadoop/tools/lib/*" -ENV SPARK_EXTRA_CLASSPATH="$SPARK_DIST_CLASSPATH" -ENV LD_LIBRARY_PATH /lib64 - -WORKDIR /opt/spark/work-dir -RUN chmod g+w /opt/spark/work-dir -# RUN chmod a+x /opt/decom.sh - -RUN mkdir -p /etc/metrics/conf -COPY conf/metrics.properties /etc/metrics/conf -COPY conf/prometheus.yaml /etc/metrics/conf - -ENTRYPOINT [ "/opt/entrypoint.sh" ] - -# Specify the User that the actual main process will run as -USER ${spark_uid} diff --git a/spark-operator/source/Dockerfile b/spark-operator/source/Dockerfile deleted file mode 100644 index 4247607..0000000 --- a/spark-operator/source/Dockerfile +++ /dev/null @@ -1,4 +0,0 @@ -FROM gcr.io/spark-operator/spark-operator:v1beta2-1.2.3-3.1.1 - -# add s3a connector -COPY *.jar $SPARK_HOME/jars/ diff --git a/spark-operator/source/activation-1.1.jar b/spark-operator/source/activation-1.1.jar deleted file mode 100644 index 53f82a1..0000000 Binary files a/spark-operator/source/activation-1.1.jar and /dev/null differ diff --git a/spark-operator/source/apacheds-i18n-2.0.0-M15.jar b/spark-operator/source/apacheds-i18n-2.0.0-M15.jar deleted file mode 100644 index 11dccb3..0000000 Binary files a/spark-operator/source/apacheds-i18n-2.0.0-M15.jar and /dev/null differ diff --git a/spark-operator/source/apacheds-kerberos-codec-2.0.0-M15.jar b/spark-operator/source/apacheds-kerberos-codec-2.0.0-M15.jar deleted file mode 100644 index 82e564e..0000000 Binary files a/spark-operator/source/apacheds-kerberos-codec-2.0.0-M15.jar and /dev/null differ diff --git a/spark-operator/source/api-asn1-api-1.0.0-M20.jar b/spark-operator/source/api-asn1-api-1.0.0-M20.jar deleted file mode 100644 index 68dee3a..0000000 Binary files a/spark-operator/source/api-asn1-api-1.0.0-M20.jar and /dev/null differ diff --git a/spark-operator/source/api-util-1.0.0-M20.jar b/spark-operator/source/api-util-1.0.0-M20.jar deleted file mode 100644 index ff9780e..0000000 Binary files a/spark-operator/source/api-util-1.0.0-M20.jar and /dev/null differ diff --git a/spark-operator/source/asm-3.1.jar b/spark-operator/source/asm-3.1.jar deleted file mode 100644 index 8217cae..0000000 Binary files a/spark-operator/source/asm-3.1.jar and /dev/null differ diff --git a/spark-operator/source/avro-1.7.4.jar b/spark-operator/source/avro-1.7.4.jar deleted file mode 100644 index 69dd87d..0000000 Binary files a/spark-operator/source/avro-1.7.4.jar and /dev/null differ diff --git a/spark-operator/source/aws-java-sdk-1.7.4.jar b/spark-operator/source/aws-java-sdk-1.7.4.jar deleted file mode 100644 index 02233a8..0000000 Binary files a/spark-operator/source/aws-java-sdk-1.7.4.jar and /dev/null differ diff --git a/spark-operator/source/commons-beanutils-1.7.0.jar b/spark-operator/source/commons-beanutils-1.7.0.jar deleted file mode 100644 index b1b89c9..0000000 Binary files a/spark-operator/source/commons-beanutils-1.7.0.jar and /dev/null differ diff --git a/spark-operator/source/commons-beanutils-core-1.8.0.jar b/spark-operator/source/commons-beanutils-core-1.8.0.jar deleted file mode 100644 index 87c15f4..0000000 Binary files a/spark-operator/source/commons-beanutils-core-1.8.0.jar and /dev/null differ diff --git a/spark-operator/source/commons-cli-1.2.jar b/spark-operator/source/commons-cli-1.2.jar deleted file mode 100644 index ce4b9ff..0000000 Binary files a/spark-operator/source/commons-cli-1.2.jar and /dev/null differ diff --git a/spark-operator/source/commons-codec-1.4.jar b/spark-operator/source/commons-codec-1.4.jar deleted file mode 100644 index 458d432..0000000 Binary files a/spark-operator/source/commons-codec-1.4.jar and /dev/null differ diff --git a/spark-operator/source/commons-collections-3.2.2.jar b/spark-operator/source/commons-collections-3.2.2.jar deleted file mode 100644 index fa5df82..0000000 Binary files a/spark-operator/source/commons-collections-3.2.2.jar and /dev/null differ diff --git a/spark-operator/source/commons-compress-1.4.1.jar b/spark-operator/source/commons-compress-1.4.1.jar deleted file mode 100644 index b58761e..0000000 Binary files a/spark-operator/source/commons-compress-1.4.1.jar and /dev/null differ diff --git a/spark-operator/source/commons-configuration-1.6.jar b/spark-operator/source/commons-configuration-1.6.jar deleted file mode 100644 index 2d4689a..0000000 Binary files a/spark-operator/source/commons-configuration-1.6.jar and /dev/null differ diff --git a/spark-operator/source/commons-digester-1.8.jar b/spark-operator/source/commons-digester-1.8.jar deleted file mode 100644 index 1110f0a..0000000 Binary files a/spark-operator/source/commons-digester-1.8.jar and /dev/null differ diff --git a/spark-operator/source/commons-httpclient-3.1.jar b/spark-operator/source/commons-httpclient-3.1.jar deleted file mode 100644 index 7c59774..0000000 Binary files a/spark-operator/source/commons-httpclient-3.1.jar and /dev/null differ diff --git a/spark-operator/source/commons-io-2.4.jar b/spark-operator/source/commons-io-2.4.jar deleted file mode 100644 index 90035a4..0000000 Binary files a/spark-operator/source/commons-io-2.4.jar and /dev/null differ diff --git a/spark-operator/source/commons-lang-2.6.jar b/spark-operator/source/commons-lang-2.6.jar deleted file mode 100644 index 98467d3..0000000 Binary files a/spark-operator/source/commons-lang-2.6.jar and /dev/null differ diff --git a/spark-operator/source/commons-logging-1.1.3.jar b/spark-operator/source/commons-logging-1.1.3.jar deleted file mode 100644 index ab51254..0000000 Binary files a/spark-operator/source/commons-logging-1.1.3.jar and /dev/null differ diff --git a/spark-operator/source/commons-math3-3.1.1.jar b/spark-operator/source/commons-math3-3.1.1.jar deleted file mode 100644 index 43b5215..0000000 Binary files a/spark-operator/source/commons-math3-3.1.1.jar and /dev/null differ diff --git a/spark-operator/source/commons-net-3.1.jar b/spark-operator/source/commons-net-3.1.jar deleted file mode 100644 index b75f1a5..0000000 Binary files a/spark-operator/source/commons-net-3.1.jar and /dev/null differ diff --git a/spark-operator/source/curator-client-2.7.1.jar b/spark-operator/source/curator-client-2.7.1.jar deleted file mode 100644 index 2d48056..0000000 Binary files a/spark-operator/source/curator-client-2.7.1.jar and /dev/null differ diff --git a/spark-operator/source/curator-framework-2.7.1.jar b/spark-operator/source/curator-framework-2.7.1.jar deleted file mode 100644 index 2e9eca6..0000000 Binary files a/spark-operator/source/curator-framework-2.7.1.jar and /dev/null differ diff --git a/spark-operator/source/curator-recipes-2.7.1.jar b/spark-operator/source/curator-recipes-2.7.1.jar deleted file mode 100644 index 9fa0856..0000000 Binary files a/spark-operator/source/curator-recipes-2.7.1.jar and /dev/null differ diff --git a/spark-operator/source/gson-2.2.4.jar b/spark-operator/source/gson-2.2.4.jar deleted file mode 100644 index 75fe27c..0000000 Binary files a/spark-operator/source/gson-2.2.4.jar and /dev/null differ diff --git a/spark-operator/source/guava-11.0.2.jar b/spark-operator/source/guava-11.0.2.jar deleted file mode 100644 index c8c8d5d..0000000 Binary files a/spark-operator/source/guava-11.0.2.jar and /dev/null differ diff --git a/spark-operator/source/hadoop-annotations-2.7.4.jar b/spark-operator/source/hadoop-annotations-2.7.4.jar deleted file mode 100644 index fe4896e..0000000 Binary files a/spark-operator/source/hadoop-annotations-2.7.4.jar and /dev/null differ diff --git a/spark-operator/source/hadoop-auth-2.7.4.jar b/spark-operator/source/hadoop-auth-2.7.4.jar deleted file mode 100644 index 0c6f32a..0000000 Binary files a/spark-operator/source/hadoop-auth-2.7.4.jar and /dev/null differ diff --git a/spark-operator/source/hadoop-aws-2.7.4.jar b/spark-operator/source/hadoop-aws-2.7.4.jar deleted file mode 100644 index 92c0db4..0000000 Binary files a/spark-operator/source/hadoop-aws-2.7.4.jar and /dev/null differ diff --git a/spark-operator/source/hadoop-common-2.7.4.jar b/spark-operator/source/hadoop-common-2.7.4.jar deleted file mode 100644 index a750064..0000000 Binary files a/spark-operator/source/hadoop-common-2.7.4.jar and /dev/null differ diff --git a/spark-operator/source/htrace-core-3.1.0-incubating.jar b/spark-operator/source/htrace-core-3.1.0-incubating.jar deleted file mode 100644 index 6f03925..0000000 Binary files a/spark-operator/source/htrace-core-3.1.0-incubating.jar and /dev/null differ diff --git a/spark-operator/source/httpclient-4.2.jar b/spark-operator/source/httpclient-4.2.jar deleted file mode 100644 index b6d4c1e..0000000 Binary files a/spark-operator/source/httpclient-4.2.jar and /dev/null differ diff --git a/spark-operator/source/httpcore-4.1.2.jar b/spark-operator/source/httpcore-4.1.2.jar deleted file mode 100644 index 66ae90b..0000000 Binary files a/spark-operator/source/httpcore-4.1.2.jar and /dev/null differ diff --git a/spark-operator/source/jackson-annotations-2.2.3.jar b/spark-operator/source/jackson-annotations-2.2.3.jar deleted file mode 100644 index b62c87d..0000000 Binary files a/spark-operator/source/jackson-annotations-2.2.3.jar and /dev/null differ diff --git a/spark-operator/source/jackson-core-2.2.3.jar b/spark-operator/source/jackson-core-2.2.3.jar deleted file mode 100644 index 24318a4..0000000 Binary files a/spark-operator/source/jackson-core-2.2.3.jar and /dev/null differ diff --git a/spark-operator/source/jackson-core-asl-1.9.13.jar b/spark-operator/source/jackson-core-asl-1.9.13.jar deleted file mode 100644 index bb4fe1d..0000000 Binary files a/spark-operator/source/jackson-core-asl-1.9.13.jar and /dev/null differ diff --git a/spark-operator/source/jackson-databind-2.2.3.jar b/spark-operator/source/jackson-databind-2.2.3.jar deleted file mode 100644 index 8545084..0000000 Binary files a/spark-operator/source/jackson-databind-2.2.3.jar and /dev/null differ diff --git a/spark-operator/source/jackson-jaxrs-1.8.3.jar b/spark-operator/source/jackson-jaxrs-1.8.3.jar deleted file mode 100644 index 1065a63..0000000 Binary files a/spark-operator/source/jackson-jaxrs-1.8.3.jar and /dev/null differ diff --git a/spark-operator/source/jackson-mapper-asl-1.9.13.jar b/spark-operator/source/jackson-mapper-asl-1.9.13.jar deleted file mode 100644 index 0f2073f..0000000 Binary files a/spark-operator/source/jackson-mapper-asl-1.9.13.jar and /dev/null differ diff --git a/spark-operator/source/jackson-xc-1.8.3.jar b/spark-operator/source/jackson-xc-1.8.3.jar deleted file mode 100644 index 9e168ec..0000000 Binary files a/spark-operator/source/jackson-xc-1.8.3.jar and /dev/null differ diff --git a/spark-operator/source/java-xmlbuilder-0.4.jar b/spark-operator/source/java-xmlbuilder-0.4.jar deleted file mode 100644 index e5acf92..0000000 Binary files a/spark-operator/source/java-xmlbuilder-0.4.jar and /dev/null differ diff --git a/spark-operator/source/jaxb-api-2.2.2.jar b/spark-operator/source/jaxb-api-2.2.2.jar deleted file mode 100644 index 31e5fa0..0000000 Binary files a/spark-operator/source/jaxb-api-2.2.2.jar and /dev/null differ diff --git a/spark-operator/source/jaxb-impl-2.2.3-1.jar b/spark-operator/source/jaxb-impl-2.2.3-1.jar deleted file mode 100644 index eeaf660..0000000 Binary files a/spark-operator/source/jaxb-impl-2.2.3-1.jar and /dev/null differ diff --git a/spark-operator/source/jersey-core-1.9.jar b/spark-operator/source/jersey-core-1.9.jar deleted file mode 100644 index 548dd88..0000000 Binary files a/spark-operator/source/jersey-core-1.9.jar and /dev/null differ diff --git a/spark-operator/source/jersey-json-1.9.jar b/spark-operator/source/jersey-json-1.9.jar deleted file mode 100644 index b1a4ce5..0000000 Binary files a/spark-operator/source/jersey-json-1.9.jar and /dev/null differ diff --git a/spark-operator/source/jersey-server-1.9.jar b/spark-operator/source/jersey-server-1.9.jar deleted file mode 100644 index ae0117c..0000000 Binary files a/spark-operator/source/jersey-server-1.9.jar and /dev/null differ diff --git a/spark-operator/source/jets3t-0.9.0.jar b/spark-operator/source/jets3t-0.9.0.jar deleted file mode 100644 index 6870900..0000000 Binary files a/spark-operator/source/jets3t-0.9.0.jar and /dev/null differ diff --git a/spark-operator/source/jettison-1.1.jar b/spark-operator/source/jettison-1.1.jar deleted file mode 100644 index e4e9c8c..0000000 Binary files a/spark-operator/source/jettison-1.1.jar and /dev/null differ diff --git a/spark-operator/source/jetty-6.1.26.jar b/spark-operator/source/jetty-6.1.26.jar deleted file mode 100644 index 2cbe07a..0000000 Binary files a/spark-operator/source/jetty-6.1.26.jar and /dev/null differ diff --git a/spark-operator/source/jetty-sslengine-6.1.26.jar b/spark-operator/source/jetty-sslengine-6.1.26.jar deleted file mode 100644 index 051efb5..0000000 Binary files a/spark-operator/source/jetty-sslengine-6.1.26.jar and /dev/null differ diff --git a/spark-operator/source/jetty-util-6.1.26.jar b/spark-operator/source/jetty-util-6.1.26.jar deleted file mode 100644 index cd23752..0000000 Binary files a/spark-operator/source/jetty-util-6.1.26.jar and /dev/null differ diff --git a/spark-operator/source/joda-time-2.10.10.jar b/spark-operator/source/joda-time-2.10.10.jar deleted file mode 100644 index f78cb2d..0000000 Binary files a/spark-operator/source/joda-time-2.10.10.jar and /dev/null differ diff --git a/spark-operator/source/jsch-0.1.54.jar b/spark-operator/source/jsch-0.1.54.jar deleted file mode 100644 index 426332e..0000000 Binary files a/spark-operator/source/jsch-0.1.54.jar and /dev/null differ diff --git a/spark-operator/source/jsp-api-2.1.jar b/spark-operator/source/jsp-api-2.1.jar deleted file mode 100644 index c0195af..0000000 Binary files a/spark-operator/source/jsp-api-2.1.jar and /dev/null differ diff --git a/spark-operator/source/jsr305-3.0.0.jar b/spark-operator/source/jsr305-3.0.0.jar deleted file mode 100644 index cc39b73..0000000 Binary files a/spark-operator/source/jsr305-3.0.0.jar and /dev/null differ diff --git a/spark-operator/source/log4j-1.2.17.jar b/spark-operator/source/log4j-1.2.17.jar deleted file mode 100644 index 1d425cf..0000000 Binary files a/spark-operator/source/log4j-1.2.17.jar and /dev/null differ diff --git a/spark-operator/source/netty-3.7.0.Final.jar b/spark-operator/source/netty-3.7.0.Final.jar deleted file mode 100644 index eef1aba..0000000 Binary files a/spark-operator/source/netty-3.7.0.Final.jar and /dev/null differ diff --git a/spark-operator/source/paranamer-2.3.jar b/spark-operator/source/paranamer-2.3.jar deleted file mode 100644 index ad12ae9..0000000 Binary files a/spark-operator/source/paranamer-2.3.jar and /dev/null differ diff --git a/spark-operator/source/protobuf-java-2.5.0.jar b/spark-operator/source/protobuf-java-2.5.0.jar deleted file mode 100644 index 4c4e686..0000000 Binary files a/spark-operator/source/protobuf-java-2.5.0.jar and /dev/null differ diff --git a/spark-operator/source/servlet-api-2.5.jar b/spark-operator/source/servlet-api-2.5.jar deleted file mode 100644 index fb52493..0000000 Binary files a/spark-operator/source/servlet-api-2.5.jar and /dev/null differ diff --git a/spark-operator/source/slf4j-api-1.7.10.jar b/spark-operator/source/slf4j-api-1.7.10.jar deleted file mode 100644 index 744e9ec..0000000 Binary files a/spark-operator/source/slf4j-api-1.7.10.jar and /dev/null differ diff --git a/spark-operator/source/slf4j-log4j12-1.7.10.jar b/spark-operator/source/slf4j-log4j12-1.7.10.jar deleted file mode 100644 index 957b2b1..0000000 Binary files a/spark-operator/source/slf4j-log4j12-1.7.10.jar and /dev/null differ diff --git a/spark-operator/source/snappy-java-1.0.4.1.jar b/spark-operator/source/snappy-java-1.0.4.1.jar deleted file mode 100644 index 8198919..0000000 Binary files a/spark-operator/source/snappy-java-1.0.4.1.jar and /dev/null differ diff --git a/spark-operator/source/stax-api-1.0-2.jar b/spark-operator/source/stax-api-1.0-2.jar deleted file mode 100644 index 015169d..0000000 Binary files a/spark-operator/source/stax-api-1.0-2.jar and /dev/null differ diff --git a/spark-operator/source/xmlenc-0.52.jar b/spark-operator/source/xmlenc-0.52.jar deleted file mode 100644 index ec568b4..0000000 Binary files a/spark-operator/source/xmlenc-0.52.jar and /dev/null differ diff --git a/spark-operator/source/xz-1.0.jar b/spark-operator/source/xz-1.0.jar deleted file mode 100644 index a848f16..0000000 Binary files a/spark-operator/source/xz-1.0.jar and /dev/null differ diff --git a/spark-operator/source/zookeeper-3.4.6.jar b/spark-operator/source/zookeeper-3.4.6.jar deleted file mode 100644 index 7c340be..0000000 Binary files a/spark-operator/source/zookeeper-3.4.6.jar and /dev/null differ diff --git a/test/source_spark-amazon-s3-examples/.gitignore b/test/source_spark-amazon-s3-examples/.gitignore deleted file mode 100644 index bcd2da3..0000000 --- a/test/source_spark-amazon-s3-examples/.gitignore +++ /dev/null @@ -1,14 +0,0 @@ -.idea -.metadata -.cache-main -.classpath -.project -.settings -*.class -*.orig -*.log -target/ -.DS_Store -*.iml -scalastyle-output.xml - diff --git a/test/source_spark-amazon-s3-examples/.vscode/settings.json b/test/source_spark-amazon-s3-examples/.vscode/settings.json deleted file mode 100644 index e0f15db..0000000 --- a/test/source_spark-amazon-s3-examples/.vscode/settings.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "java.configuration.updateBuildConfiguration": "automatic" -} \ No newline at end of file diff --git a/test/source_spark-amazon-s3-examples/pom.xml b/test/source_spark-amazon-s3-examples/pom.xml deleted file mode 100644 index 1a7b695..0000000 --- a/test/source_spark-amazon-s3-examples/pom.xml +++ /dev/null @@ -1,132 +0,0 @@ - - - - com.sparkbyexamples - 4.0.0 - spark-amazon-s3-examples - - 1.0-SNAPSHOT - 2008 - jar - - 2.11.12 - 2.4.6 - - - - - scala-tools.org - Scala-Tools Maven2 Repository - http://scala-tools.org/repo-releases - - - - - - scala-tools.org - Scala-Tools Maven2 Repository - http://scala-tools.org/repo-releases - - - - - - - org.scala-lang - scala-library - ${scala.version} - - - - org.specs - specs - 1.2.5 - test - - - - - org.apache.spark - spark-core_2.11 - 2.4.6 - - - - org.apache.spark - spark-sql_2.11 - 2.4.6 - - - - org.apache.hadoop - hadoop-common - 2.7.4 - - - - org.apache.hadoop - hadoop-client - 2.7.4 - - - org.apache.hadoop - hadoop-aws - 2.7.4 - - - - - - src/main/scala - src/main/resources - - - org.scala-tools - maven-scala-plugin - - - - compile - testCompile - - - - - ${scala.version} - - -target:jvm-1.5 - - - - - org.apache.maven.plugins - maven-eclipse-plugin - - true - - ch.epfl.lamp.sdt.core.scalabuilder - - - ch.epfl.lamp.sdt.core.scalanature - - - org.eclipse.jdt.launching.JRE_CONTAINER - ch.epfl.lamp.sdt.launching.SCALA_CONTAINER - - - - - - - - - org.scala-tools - maven-scala-plugin - - ${scala.version} - - - - - diff --git a/test/source_spark-amazon-s3-examples/src/main/resources/small_zipcode.csv b/test/source_spark-amazon-s3-examples/src/main/resources/small_zipcode.csv deleted file mode 100644 index 9caf356..0000000 --- a/test/source_spark-amazon-s3-examples/src/main/resources/small_zipcode.csv +++ /dev/null @@ -1,6 +0,0 @@ -id,zipcode,type,city,state,population -1,704,STANDARD,,PR,30100 -2,704,,PASEO COSTA DEL SUR,PR, -3,709,,BDA SAN LUIS,PR,3700 -4,76166,UNIQUE,CINGULAR WIRELESS,TX,84000 -5,76177,STANDARD,,TX, \ No newline at end of file diff --git a/test/source_spark-amazon-s3-examples/src/main/resources/zipcodes.parquet b/test/source_spark-amazon-s3-examples/src/main/resources/zipcodes.parquet deleted file mode 100644 index 54b9943..0000000 Binary files a/test/source_spark-amazon-s3-examples/src/main/resources/zipcodes.parquet and /dev/null differ diff --git a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/AvroAWSExample.scala b/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/AvroAWSExample.scala deleted file mode 100644 index dbbd1b0..0000000 --- a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/AvroAWSExample.scala +++ /dev/null @@ -1,5 +0,0 @@ -package com.sparkbyexamples.spark - -object AvroAWSExample extends App{ - -} diff --git a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/CsvAWSExample.scala b/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/CsvAWSExample.scala deleted file mode 100644 index 13fddbe..0000000 --- a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/CsvAWSExample.scala +++ /dev/null @@ -1,5 +0,0 @@ -package com.sparkbyexamples.spark - -object CsvAWSExample extends App { - -} diff --git a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/JsonAWSExample.scala b/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/JsonAWSExample.scala deleted file mode 100644 index ddaa961..0000000 --- a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/JsonAWSExample.scala +++ /dev/null @@ -1,5 +0,0 @@ -package com.sparkbyexamples.spark - -object JsonAWSExample extends App{ - -} diff --git a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/ParquetAWSExample.scala b/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/ParquetAWSExample.scala deleted file mode 100644 index b65478a..0000000 --- a/test/source_spark-amazon-s3-examples/src/main/scala/com/sparkbyexamples/spark/ParquetAWSExample.scala +++ /dev/null @@ -1,54 +0,0 @@ -package com.sparkbyexamples.spark - -import org.apache.spark.sql.SparkSession - -object ParquetAWSExample extends App{ - - val spark: SparkSession = SparkSession.builder() - .appName("SparkByExamples.com") - .getOrCreate() - spark.sparkContext - .hadoopConfiguration.set("fs.s3a.access.key", sys.env("AWS_ACCESS_KEY_ID")) - spark.sparkContext - .hadoopConfiguration.set("fs.s3a.secret.key", sys.env("AWS_SECRET_ACCESS_KEY")) - spark.sparkContext - .hadoopConfiguration.set("fs.s3a.endpoint", sys.env("S3_ENDPOINT_URL")) - spark.sparkContext - .hadoopConfiguration.set("fs.s3a.path.style.access", "true") - - val bucket = sys.env("BUCKET_NAME") - - val data = Seq(("JamesĀ ","Rose","Smith","36636","M",3000), - ("Michael","Rose","","40288","M",4000), - ("Robert","Mary","Williams","42114","M",4000), - ("Maria","Anne","Jones","39192","F",4000), - ("Jen","Mary","Brown","1234","F",-1) - ) - - val columns = Seq("firstname","middlename","lastname","dob","gender","salary") - import spark.sqlContext.implicits._ - val df = data.toDF(columns:_*) - - df.show() - df.printSchema() - - df.write - .parquet("s3a://" + bucket + "/parquet/people.parquet") - - - val parqDF = spark.read.parquet("s3a://" + bucket + "/parquet/people.parquet") - parqDF.createOrReplaceTempView("ParquetTable") - - spark.sql("select * from ParquetTable where salary >= 4000").explain() - val parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ") - - parkSQL.show() - parkSQL.printSchema() - - df.write - .partitionBy("gender","salary") - .parquet("s3a://" + bucket + "/parquet/people2.parquet") - - spark.stop() - -} diff --git a/test/spark-amazon-s3-examples-1.0-SNAPSHOT.jar b/test/spark-amazon-s3-examples-1.0-SNAPSHOT.jar deleted file mode 100644 index 4bc5169..0000000 Binary files a/test/spark-amazon-s3-examples-1.0-SNAPSHOT.jar and /dev/null differ diff --git a/test/spark-pi.yaml b/test/spark-pi.yaml deleted file mode 100644 index c2b9ed9..0000000 --- a/test/spark-pi.yaml +++ /dev/null @@ -1,37 +0,0 @@ -apiVersion: "sparkoperator.k8s.io/v1beta2" -kind: SparkApplication -metadata: - name: spark-pi -spec: - type: Scala - mode: cluster - image: "quay.io/guimou/spark-benchmark:s3.0.1-h3.3.0_v0.0.1" - imagePullPolicy: IfNotPresent - mainClass: org.apache.spark.examples.SparkPi - mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar" - arguments: - - "1000" - sparkConf: - "spark.kubernetes.local.dirs.tmpfs": "true" - # History Server - "spark.eventLog.enabled": "true" - "spark.eventLog.dir": "s3a://YOUR_BUCKET/logs-dir/" - # S3 Configuration for History server - "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.endpoint": "rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc" - "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.access.key": "AWS_ACCESS_KEY_ID" - "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.secret.key": "AWS_SECRET_ACCESS_KEY" - "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.path.style.access": "true" - "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.connection.ssl.enabled": "false" - sparkVersion: "3.0.1" - restartPolicy: - type: Never - driver: - cores: 1 - coreLimit: "1" - memory: "512m" - serviceAccount: 'spark-operator-spark' - executor: - cores: 1 - coreLimit: "1" - instances: 2 - memory: "1000m" \ No newline at end of file diff --git a/test/spark_app_shakespeare_2.4.6.yaml b/test/spark_app_shakespeare_2.4.6.yaml index 550f337..9d99000 100644 --- a/test/spark_app_shakespeare_2.4.6.yaml +++ b/test/spark_app_shakespeare_2.4.6.yaml @@ -4,13 +4,14 @@ metadata: name: shk-app spec: type: Python - sparkVersion: 3.0.1 + sparkVersion: 2.4.6 pythonVersion: '3' sparkConf: "spark.metrics.conf.*.source.jvm.class": "org.apache.spark.metrics.source.JvmSource" "spark.metrics.appStatusSource.enabled": "true" mainApplicationFile: 'local:///home/wordcount.py' - image: "quay.io/guimou/pyspark-odh:s2.4.6-h3.3.0_v0.0.3" + image: "quay.io/guimou/pyspark-odh:s2.4.6-h3.3.0_latest" + imagePullPolicy: Always volumes: - name: wordcount configMap: @@ -23,7 +24,7 @@ spec: onSubmissionFailureRetryInterval: 20 timeToLiveSeconds: 15 driver: - serviceAccount: 'spark' + serviceAccount: 'spark-operator-spark' labels: type: spark-application env: diff --git a/test/spark_app_shakespeare.yaml b/test/spark_app_shakespeare_3.0.1.yaml similarity index 95% rename from test/spark_app_shakespeare.yaml rename to test/spark_app_shakespeare_3.0.1.yaml index 128c74b..8d1cc8c 100644 --- a/test/spark_app_shakespeare.yaml +++ b/test/spark_app_shakespeare_3.0.1.yaml @@ -10,7 +10,8 @@ spec: "spark.metrics.conf.*.source.jvm.class": "org.apache.spark.metrics.source.JvmSource" "spark.metrics.appStatusSource.enabled": "true" mainApplicationFile: 'local:///home/wordcount.py' - image: "quay.io/guimou/pyspark-odh:s3.0.1-h3.3.0_v0.0.3" + image: "quay.io/guimou/pyspark-odh:s3.0.1-h3.3.0_latest" + imagePullPolicy: Always volumes: - name: wordcount configMap: diff --git a/test/spark_app_shakespeare_history_server.yaml b/test/spark_app_shakespeare_3.3.0.history_server.yaml similarity index 92% rename from test/spark_app_shakespeare_history_server.yaml rename to test/spark_app_shakespeare_3.3.0.history_server.yaml index bc32e51..edc1ec5 100644 --- a/test/spark_app_shakespeare_history_server.yaml +++ b/test/spark_app_shakespeare_3.3.0.history_server.yaml @@ -4,7 +4,7 @@ metadata: name: shk-app spec: type: Python - sparkVersion: 3.0.1 + sparkVersion: 3.3.0 pythonVersion: '3' sparkConf: "spark.metrics.conf.*.source.jvm.class": "org.apache.spark.metrics.source.JvmSource" @@ -17,7 +17,8 @@ spec: "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.path.style.access": "true" "spark.hadoop.fs.s3a.bucket.YOUR_BUCKET.connection.ssl.enabled": "false" mainApplicationFile: 'local:///home/wordcount.py' - image: "quay.io/guimou/pyspark-odh:s3.0.1-h3.3.0_v0.0.3" + image: "quay.io/guimou/pyspark-odh:s3.3.0-h3.3.3_latest" + imagePullPolicy: Always volumes: - name: wordcount configMap: @@ -69,6 +70,6 @@ spec: exposeDriverMetrics: true exposeExecutorMetrics: true prometheus: - jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.15.0.jar" + jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.17.0.jar" portName: 'tcp-prometheus' \ No newline at end of file diff --git a/test/spark_app_shakespeare_3.3.0.yaml b/test/spark_app_shakespeare_3.3.0.yaml new file mode 100644 index 0000000..11068be --- /dev/null +++ b/test/spark_app_shakespeare_3.3.0.yaml @@ -0,0 +1,68 @@ +apiVersion: sparkoperator.k8s.io/v1beta2 +kind: SparkApplication +metadata: + name: shk-app +spec: + type: Python + sparkVersion: 3.3.0 + pythonVersion: '3' + sparkConf: + "spark.metrics.conf.*.source.jvm.class": "org.apache.spark.metrics.source.JvmSource" + "spark.metrics.appStatusSource.enabled": "true" + mainApplicationFile: 'local:///home/wordcount.py' + image: "quay.io/guimou/pyspark-odh:s3.3.0-h3.3.3_latest" + imagePullPolicy: Always + volumes: + - name: wordcount + configMap: + name: wordcount + restartPolicy: + type: OnFailure + onFailureRetries: 3 + onFailureRetryInterval: 10 + onSubmissionFailureRetries: 5 + onSubmissionFailureRetryInterval: 20 + timeToLiveSeconds: 15 + driver: + serviceAccount: 'spark-operator-spark' + labels: + type: spark-application + env: + - name: S3_ENDPOINT_URL + valueFrom: + configMapKeyRef: + name: spark-demo + key: BUCKET_HOST + - name: BUCKET_NAME + valueFrom: + configMapKeyRef: + name: spark-demo + key: BUCKET_NAME + - name: AWS_ACCESS_KEY_ID + valueFrom: + secretKeyRef: + name: spark-demo + key: AWS_ACCESS_KEY_ID + - name: AWS_SECRET_ACCESS_KEY + valueFrom: + secretKeyRef: + name: spark-demo + key: AWS_SECRET_ACCESS_KEY + cores: 1 + coreLimit: "1" + volumeMounts: + - name: wordcount + mountPath: '/home/' + executor: + labels: + type: spark-application + instances: 2 + cores: 1 + coreLimit: "1" + monitoring: + exposeDriverMetrics: true + exposeExecutorMetrics: true + prometheus: + jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.17.0.jar" + portName: 'tcp-prometheus' + \ No newline at end of file