Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ocr summ notebook #35

Merged
merged 1 commit into from
Oct 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,11 @@
"1) It reads the table of the [Contract Understanding Atticus Dataset (CUAD)](https://www.atticusprojectai.org/cuad) dataset located in the [gs://dataproc-metastore-public-binaries/cuad_v1/full_contract_pdf/](https://console.cloud.google.com/storage/browser/dataproc-metastore-public-binaries/cuad_v1) \n",
" We will create a metadata table poiting to the paths of the image files in the bucket. \n",
"2) It runs OCR using Vision API - it start a series of async operations and then checks its completion status.\n",
"3) It calls [Vertex AI Gemini API](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart#try_text_prompts) to summarize each text page.\n",
"3) It calls [Vertex AI Gemini API](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart#try_text_prompts) to summarize the text.\n",
"4) It saves the output to BigQuery\n",
"\n",
"#### Related content\n",
"\n",
"- [Summarization with Large Documents using LangChain](https://github.com/GoogleCloudPlatform/generative-ai/blob/dev/language/examples/oss-samples/langchain/summarization_with_large_documents_langchain.ipynb)\n",
"- [Design summarization prompts](https://cloud.google.com/vertex-ai/docs/generative-ai/text/summarization-prompts)"
]
},
Expand Down Expand Up @@ -262,7 +261,7 @@
"output_path_prefix = \"cuad_v1/output_ocr\" # path prefix after bucket name where the folder structure will be created\n",
"# BigQuery\n",
"output_dataset_bq = \"output_dataset\" # create the BigQuery dataset beforehand\n",
"output_table_bq = \"ocr_page_summaries\"\n",
"output_table_bq = \"ocr_summaries\"\n",
"bq_temp_bucket_name = \"workspaces-bq-temp-bucket-dev\""
]
},
Expand Down Expand Up @@ -692,35 +691,29 @@
"metadata": {},
"outputs": [],
"source": [
"import vertexai\n",
"from vertexai.generative_models import GenerativeModel, Part , HarmCategory, HarmBlockThreshold\n",
"\n",
"vertexai.init(project=project_id, location=\"us-central1\")\n",
"def gemini_predict(prompt, temperature=0.5, model_name=\"gemini-1.5-pro\"):\n",
" \n",
" from vertexai.generative_models import GenerativeModel, Part, Content, HarmCategory, HarmBlockThreshold\n",
"\n",
"def gemini_predict(prompt):\n",
" \n",
" gemini_pro_model = GenerativeModel(\"gemini-1.0-pro\")\n",
" config = {\"max_output_tokens\": 2048, \"temperature\": 0.4, \"top_p\": 1, \"top_k\": 32}\n",
" safety_config = {\n",
" HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,\n",
" HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,\n",
" }\n",
" model = GenerativeModel(model_name=model_name)\n",
" \n",
" prediction = gemini_pro_model.generate_content([\n",
" prompt\n",
" ],\n",
" generation_config=config,\n",
" safety_settings=safety_config,\n",
" stream=True\n",
" prompt_content = Content(\n",
" role=\"user\",\n",
" parts=[Part.from_text(prompt)]\n",
" )\n",
"\n",
" response = model.generate_content(\n",
" prompt_content,\n",
" generation_config={\n",
" \"temperature\": temperature,\n",
" \"response_mime_type\": \"text/x.enum\"\n",
" },\n",
" safety_settings={\n",
" HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_ONLY_HIGH\n",
" }\n",
" )\n",
" \n",
" text_responses = []\n",
" try:\n",
" for response in prediction:\n",
" text_responses.append(response.text)\n",
" except:\n",
" pass\n",
" return \"\".join(text_responses)"
" return response.text"
]
},
{
Expand All @@ -730,16 +723,21 @@
"metadata": {},
"outputs": [],
"source": [
"def summarize_page(page):\n",
"def summarize_text(page):\n",
" \n",
" prompt = f\"\"\"Provide a summary with about two sentences for the following article page:\n",
" {page}\n",
" Summary:\"\"\"\n",
" prompt = f\"\"\"You an expert in reading contracts, articles, agreements, or text in general.\n",
"You are able to create concise summaries of the text provided to you.\n",
"Try your best to summarize the text even if the information is not so well understandable.\n",
"Here is an article I will ask you to summarize:\n",
"{page}\n",
"Provide a summary with about 3 sentences with the most important information from the text.\n",
"Summary:\n",
"\"\"\"\n",
" \n",
" summary = gemini_predict(prompt)\n",
" return summary\n",
" \n",
"generate_descriptions_udf = udf(summarize_page)"
"generate_descriptions_udf = udf(summarize_text)"
]
},
{
Expand All @@ -749,51 +747,49 @@
"metadata": {},
"outputs": [],
"source": [
"summarize_page = udf(summarize_page)"
"summarize_text = udf(summarize_text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f21e9ae-1d0a-4633-867d-23c52773b573",
"id": "3184c463-6deb-42e8-9696-007d2c52f7f3",
"metadata": {},
"outputs": [],
"source": [
"ocr_pages_df = ocr_df.select(\"pdf_path\", explode(ocr_df[\"ocr_pages\"]).alias(\"page\"))"
"summaries_df = ocr_df.withColumn(\"summary\", summarize_text(ocr_df[\"ocr_text\"]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3184c463-6deb-42e8-9696-007d2c52f7f3",
"id": "68b62926-1f4f-40ed-bb60-21e81cba905b",
"metadata": {},
"outputs": [],
"source": [
"summaries_df = ocr_pages_df.withColumn(\"summary\", summarize_page(ocr_pages_df[\"page\"]))"
"summaries_df.show(5,50)"
]
},
{
"cell_type": "markdown",
"id": "a03765e2-f28f-4c04-a2c3-761b6dbc615a",
"metadata": {},
"source": [
"| pdf_path| page| summary|\n",
"|--------------------------------------------------|--------------------------------------------------|--------------------------------------------------|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|NON-COMPETITION AGREEMENT AND RIGHT OF FIRST OF...|In an agreement dated May 3, 2006, Glamis Gold ...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|-2-\\nPART 1\\nINTERPRETATION\\nDefinitions\\n1.1\\n...|This agreement defines key terms used throughou...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|-3-\\n(b)\\na reference to a Part means a Part of...|This agreement defines terms and conditions, in...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|-4-\\n(b)\\nadvise, lend money to, guarantee the ...|This agreement between Glamis and Western Coppe...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|-5-\\nfor by monetary award alone. Accordingly, ...|This agreement outlines the remedies available ...|"
"| pdf_path| ocr_text| ocr_pages|number_pages| summary|\n",
"|--------------------------------------------------|--------------------------------------------------|--------------------------------------------------|------------|--------------------------------------------------|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|EXECUTION COPY\\nConfidential\\nExhibit 10.18\\nCE...|[EXECUTION COPY\\nConfidential\\nExhibit 10.18\\nC...| 85|This Development and Option Agreement outlines ...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|Source: UPJOHN INC, 10-12G, 1/21/2020\\nFORM OF\\...|[Source: UPJOHN INC, 10-12G, 1/21/2020\\nFORM OF...| 82|This Manufacturing and Supply Agreement outline...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|Exhibit 10.1\\nCERTAIN CONFIDENTIAL PORTIONS OF ...|[Exhibit 10.1\\nCERTAIN CONFIDENTIAL PORTIONS OF...| 71|This Network Build and Maintenance Agreement ou...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|Exhibit 10.2\\nCERTAIN INFORMATION (INDICATED BY...|[Exhibit 10.2\\nCERTAIN INFORMATION (INDICATED B...| 68|This is a Distributorship Agreement between Zog...|\n",
"|gs://dataproc-metastore-public-binaries/cuad_v1...|Exhibit 10.12\\n[***] Certain information in thi...|[Exhibit 10.12\\n[***] Certain information in th...| 85|This Collaboration Agreement outlines the terms...|"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68b62926-1f4f-40ed-bb60-21e81cba905b",
"cell_type": "markdown",
"id": "8e28b838-b61b-4409-a908-31cb5ebfbc6d",
"metadata": {},
"outputs": [],
"source": [
"summaries_df.show(5,50)"
"Example: "
]
},
{
Expand All @@ -803,7 +799,7 @@
"source": [
"|page|\n",
"|----------|\n",
"|[THIS AGREEMENT is dated May 3, 2006. NON-COMPETITION AGREEMENT AND RIGHT OF FIRST OFFER BETWEEN: AND: WHEREAS: GLAMIS GOLD LTD., a company incorporated under the laws of the Province of British Columbia, having an office at 310-5190 Neil Road, Reno, Nevada 89502 (\"Glamis\") WESTERN COPPER CORPORATION, a company incorporated under the laws of the Province of British Columbia, having an office at 2050-1111 West Georgia Street, Vancouver, B.C. V6E 4M3 (\"Western Copper\") (A) Glamis, Western Copper and Western Silver Corporation (\"Western Silver\") are parties to an arrangement agreement dated as of February 23, 2006 (the \"Arrangement Agreement\"), pursuant to which, among other things, Western Copper will acquire certain assets of Western Silver and Glamis will become the sole shareholder of Western Silver and the indirect owner, through Western Silver, of certain corporations and mineral properties in Mexico (the \"Arrangement\"); and 1162967.3...|"
"|EXECUTION COPY\\nConfidential\\nExhibit 10.18\\nCERTAIN CONFIDENTIAL INFORMATION CONTAINED IN THIS DOCUMENT, MARKED BY \\*\\*\\*, HAS BEEN OMITTED BECAUSE IT IS BOTH NOT MATERIAL AND WOULD BE COMPETITIVELY\\nHARMFUL IF PUBLICLY DISCLOSED.\\nDEVELOPMENT AND OPTION AGREEMENT\\nbetween\\nHARPOON THERAPEUTICS, INC.\\nand\\nABBVIE BIOTECHNOLOGY LTD\\nDated as of November 20, 2019\\nSource: HARPOON THERAPEUTICS, INC., 10-K, 3/12/2020TABLE OF CONTENTS\\nARTICLE 1\\nDEFINITIONS\\n1\\nARTICLE 2\\n18\\nCOLLABORATION\\nMANAGEMENT\\n2.1\\nJoint Governance Committee.\\n2.2\\n2.3\\nDiscontinuation of the JGC.\\n2.4\\nGeneral Provisions Applicable to the JGC.\\nInteractions Between the JGC and Internal Teams.\\n18\\n19\\n20\\n2.5\\nCMC Working Group.\\n2.6\\nWorking Groups.\\n2.7\\nExpenses.\\n21\\n21\\n21\\nARTICLE 3\\n21\\nDEVELOPMENT\\nAND\\nREGULATORY\\n3.1\\n3.2\\n3.3\\n\\*\\*\\*.\\n3.4\\n3.5\\nInitial Development Plan and Activities.\\nAbbVie Option.\\n24\\nPost-Exercise Development Activities.\\nSupply of Technology for Development Purposes.\\n21\\n25\\n3.6\\n3.7\\n3.8\\nARTICLE 4\\nExpenses and Invoicing.\\nSubcontracting.\\nRegulatory Matters.\\n30\\n26\\n27\\n28\\n28\\nCOMMERCIALIZATION\\n4.1\\n4.2\\n4.3\\n4.4\\n4.5\\nProducts.\\nARTICLE 5\\n33\\nGRANT OF\\nRIGHTS\\nIn General.\\nCommercialization Diligence.\\nBooking of Sales; Distribution.\\n30\\n30\\n31\\n31\\nProduct Trademarks.\\nCommercial Supply of Licensed Compounds or Licensed\\n31\\n20\\n27\\n27\\n5.1\\nGrants to AbbVie.\\n5.2\\nGrants to Harpoon.\\n5.3\\nSublicenses.\\n5.4\\nDistributorships.\\n5.5\\nCo-Promotion Rights.\\n5.6\\nRetention of Rights.\\n5.7\\n5.8\\n5.9\\nConfirmatory Patent License.\\nExclusivity with Respect to the Territory.\\nIn-License Agreements.\\n33\\n34\\n34\\n34\\n34\\n34\\n35\\n35\\n35\\nARTICLE 6\\n36\\nPAYMENTS AND\\nRECORDS\\n6.1\\nUpfront Payment.\\n36\\n6.2\\n6.3\\nDevelopment and Regulatory Milestones.\\nFirst Commercial Sales Milestones.\\n36\\n37\\n6.4\\nSales-Based Milestones.\\n37\\n6.5\\nRoyalties.\\n38\\n6.6\\nRoyalty Payments and Reports.\\n39\\n6.7\\nMode of Payment; Offsets.\\n40\\n6.8\\nWithholding Taxes.\\n40\\nSource: HARPOON THERAPEUTICS, INC., 10-K, 3/12/202040\\n41\\n6.9\\nIndirect Taxes.\\n6.10\\nInterest on Late Payments.\\n6.11\\nAudit.\\n41\\n6.12\\nAudit Dispute.\\n6.13\\nConfidentiality.\\n41\\n41\\n6.14\\n\\*\\*\\*\\n41\\n6.15\\nNo Other Compensation.\\nARTICLE 7\\n42\\nINTELLECTUAL\\nPROPERTY\\n42\\n7.1\\nOwnership of Intellectual Property.\\n7.2\\n7.3\\n7.4\\n7.5\\n7.6\\n7.7\\n7.8\\n7.9\\nARTICLE 8\\nMaintenance and Prosecution of Patents.\\n42\\n43\\nEnforcement of Patents.\\n45\\nInfringement Claims by Third Parties.\\n48\\nInvalidity or unenforceability Defenses or Actions.\\n48\\nProduct Trademarks.\\n49\\nInternational Nonproprietary Name.\\n50\\nInventor's Remuneration.\\n50\\nCommon Interest.\\n50\\n50\\nPHARMACOVIGILANCE\\nAND SAFETY\\n8.1\\n8.2\\nPharmacovigilance.\\nGlobal Safety Database.\\n50\\n50\\n50\\nARTICLE 9\\n51\\nCONFIDENTIALITY\\nAND NON-\\nDISCLOSURE\\n9.1\\n9.2\\n9.3\\nProduct Information.\\nConfidentiality Obligations.\\nPermitted Disclosures.\\n51\\n51\\n52\\n2\\n9.4\\nUse of Name.\\n53\\n553\\n9.5\\nPublic Announcements.\\n9.6\\nPublications.\\n53\\n54\\n9.7\\n9.8\\nReturn of Confidential Information.\\nSurvival.\\n54\\n54\\nARTICLE 10\\n55\\nREPRESENTATIONS\\nAND WARRANTIES\\n10.1\\n10.2\\n10.3\\n10.4\\n10.5\\nMutual Representations and Warranties.\\n55\\nAdditional Representations and Warranties of Harpoon.\\nCovenants of Harpoon.\\n58\\nCovenants of AbbVie.\\n58\\nDISCLAIMER OF WARRANTIES.\\n59\\nARTICLE 11\\n60\\nINDEMNITY\\n11.1\\n11.2\\nIndemnification of Harpoon.\\nIndemnification of AbbVie.\\n11.3\\n11.4\\n11.5\\n11.6\\n60\\n66\\n60\\nNotice of Claim.\\n60\\nControl of Defense.\\n61\\nSpecial, Indirect, and Other Losses.\\n61\\nInsurance.\\n61\\nARTICLE 12\\n62\\nTERM AND\\nTERMINATION\\n12.1\\n- ii -\\nTerm.\\n62\\nSource: HARPOON THERAPEUTICS, INC., 10-K, 3/12/2020\\n55\\n5512.2\\n12.3\\nTermination for Material Breach.\\nAdditional Termination Rights by AbbVie.\\n12.4\\nTermination for Insolvency.\\n12.5\\nRights in Bankruptcy.\\n12.6\\nTermination in Entirety.\\n12.7\\nReversion of Harpoon Products.\\n12.8\\n12.9\\n12.10\\nTermination of Terminated Territory.\\nRemedies.\\nAccrued Rights; Surviving Obligations.\\n67\\n62\\n63\\n63\\n63\\n66\\n67\\n67\\n12\\n63\\n63\\nARTICLE 13\\n68\\nMISCELLANEOUS\\n13.1\\nForce Majeure.\\n68\\n13.2\\nChange in Control of Harpoon.\\n68\\n13.3\\nExport Control.\\n69\\n13.4\\nAssignment.\\n69\\n13.5\\nSeverability.\\n70\\n13.6\\nGoverning Law, Jurisdiction and Service.\\n70\\n13.7\\nDispute Resolution.\\n70\\n13.8\\nNotices.\\n71\\n13.9\\nEntire Agreement; Amendments.\\n72\\n13.10\\nEnglish Language.\\n72\\n13.11\\nEquitable Relief.\\n72\\n13.12\\nWaiver and Non-Exclusion of Remedies.\\n72\\n13.13\\nNo Benefit to Third Parties.\\n72\\n13.14\\nFurther Assurance.\\n73\\n13.15\\nRelationship of the Parties.\\n13.16\\nPerformance by Affiliates.\\n73\\nWW\\n73\\n13.17\\nCounterparts; Facsimile Execution.\\n73\\n13.18\\nReferences.\\n73\\n13.19\\nSchedules.\\n73\\n13.20\\nSCHEDULES\\nSchedule 1.84\\nSchedule 1.99\\nSchedule 3.7\\nSchedule 10.2\\nSchedule 10.2.1\\nSchedule 13.7.3\\nConstruction.\\nInitial Development Plan\\nLicensed Compound\\nPre-Approved Third Party Providers\\nDisclosure Schedules\\nExisting Patents\\nArbitration\\n73\\n- 111 -\\nSource: HARPOON THERAPEUTICS, INC., 10-K, 3/12/2020DEVELOPMENT AND OPTION AGREEMENT\\nThis Development and Option Agreement (the \"Agreement\") is made and entered into effective as of\\nNovember 20, 2019 (the \"Effective Date\") by and between Harpoon Therapeutics, Inc., a Delaware corporation (\"Harpoon”), and\\nAbbVie Biotechnology Ltd, a Bermuda corporation (“AbbVie”). Harpoon and AbbVie are sometimes referred to herein individually\\nas a \"Party\" and collectively as the \"Parties.\"\\nRECITALS\\nWHEREAS, Harpoon Controls (as defined herein) certain intellectual property rights with respect to the\\nLicensed Compound (as defined herein) and Licensed Products (as defined herein) in the Territory (as defined herein); and\\nWHEREAS, Harpoon wishes to grant an option to a license to AbbVie, and AbbVie wishes to take, such option\\nto a license under such intellectual property rights to develop and commercialize Licensed Products in the Territory, in each case in\\naccordance with the terms and conditions set forth below.....................|"
]
},
{
Expand All @@ -813,7 +809,7 @@
"source": [
"|summary|\n",
"|----------|\n",
"|[This is a non-competition agreement and right of first offer between Glamis Gold Ltd. and Western Copper Corporation. Glamis Gold Ltd. will not compete with Western Copper Corporation in certain areas of Mexico and will grant Western Copper Corporation a right of first offer with respect to the proposed disposition by Glamis Gold Ltd. of mineral properties or legal interests therein located in Mexico that Glamis Gold Ltd. acquired under the Arrangement., (b) the headings in this Agreement are for convenience of reference only and shall not affect its interpretation...|"
"|This Development and Option Agreement outlines the collaboration between Harpoon Therapeutics, Inc. and AbbVie Biotechnology Ltd for the development and commercialization of a compound known as HPN217. The agreement grants AbbVie an exclusive option to license the compound after reviewing the results of a Phase I/IB trial conducted by Harpoon. Upon exercising the option, AbbVie will take over development and commercialization responsibilities, with Harpoon receiving milestone payments and royalties on net sales. |"
]
},
{
Expand Down Expand Up @@ -842,8 +838,8 @@
"metadata": {},
"outputs": [],
"source": [
"agreggated_df.write \\\n",
" .format(\"com.google.cloud.spark.bigquery\") \\\n",
"summaries_df.write \\\n",
" .format(\"bigquery\") \\\n",
" .option(\"table\", project_id + \":\" + output_dataset_bq + \".\" + output_table_bq) \\\n",
" .option(\"temporaryGcsBucket\", bq_temp_bucket_name) \\\n",
" .option(\"enableListInference\", True) \\\n",
Expand All @@ -854,9 +850,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "runtime-galileo on Serverless Spark (Remote)",
"display_name": "delta-runtime on Serverless Spark (Remote)",
"language": "python",
"name": "9c39b79e5d2e7072beb4bd59-runtime-galileo"
"name": "9c39b79e5d2e7072beb4bd59-delta-runtime"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -868,7 +864,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
"version": "3.12.3"
}
},
"nbformat": 4,
Expand Down
Loading