Update quick start demo

databricks · Jun 13, 2024 · 29c07f1 · 29c07f1
1 parent d078c07
commit 29c07f1
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 30 deletions.
diff --git a/genai_cookbook/10-min-demo/Mosaic-AI-Agents-10-Minute-Demo.ipynb b/genai_cookbook/10-min-demo/Mosaic-AI-Agents-10-Minute-Demo.ipynb
@@ -662,49 +662,21 @@
     "    {\n",
     "        \"request_id\": \"5482\",\n",
     "        \"request\": \"What happens if I try to access an index that is out of bounds in an array using the [ ] operator in Databricks SQL when spark.sql.ansi.enabled is set to false?\",\n",
-    "        \"response\": \"If you try to access an index that is out of bounds in an array using the [ ] operator in Databricks SQL when `spark.sql.ansi.enabled` is set to false, Databricks will return `NULL` instead of raising an `INVALID_ARRAY_INDEX` error. However, if `spark.sql.ansi.enabled` is set to true, an `INVALID_ARRAY_INDEX` error will be raised.\",\n",
-    "        \"expected_retrieved_context\": [\n",
-    "            {\n",
-    "                \"content\": \"[ ] (bracket sign) operator  \\nApplies to: Databricks SQL Databricks Runtime  \\nReturns an array element or map value given an index or key.  \\nSyntax\\nSyntax\\nexpr [ keyExpr ]  \\nNote that the use of brackets here is as literal text and not indicating optional syntax.\\n\\nArguments\\nArguments\\nexpr: An ARRAY or MAP expression.  \\nkeyExpr: If expr is an ARRAY an integral numeric. Otherwise an expression matching the type of the key of the MAP.\\n\\nReturns\\nReturns\\nThe result type is the element type of the ARRAY or the value type of the MAP.  \\nThe first element of an ARRAY is at index 0.  \\nIf the keyExpr is not a valid key for the MAP expr Databricks returns null.  \\nIf the keyExpr is out of bound for the ARRAY expr Databricks raises a INVALID_ARRAY_INDEXerror.  \\nNote  \\nIn Databricks Runtime, if spark.sql.ansi.enabled is false, the operator returns NULL instead of an out of bounds error.\\n\\nExamples\\nExamples\\n> SELECT a[2] FROM VALUES(array(10, 20, 30)) AS T(a); 30 > SELECT m[1] FROM VALUES(map(1, 'Hello', 2, 'World')) AS T(m); Hello\\n\\nRelated functions\\nRelated functions\\narray function  \\narray_contains function  \\narray_position function  \\nmap function  \\nmap_keys function  \\nmap_values function\",\n",
-    "                \"doc_uri\": \"https://docs.databricks.com/sql/language-manual/functions/bracketsign.html\",\n",
-    "            }\n",
-    "        ],\n",
     "        \"expected_response\": \"If you try to access an index that is out of bounds in an array using the [ ] operator in Databricks SQL when spark.sql.ansi.enabled is set to false, Databricks will return NULL instead of raising an error.\",\n",
     "    },\n",
     "    {\n",
     "        \"request_id\": \"2112\",\n",
     "        \"request\": \"Why is a long-running stage in my Spark job only showing one task, and how can I resolve this issue?\",\n",
-    "        \"response\": \"A long-running stage in a Spark job with only one task could be caused by expensive UDFs on small data, window functions without a `PARTITION BY` statement, reading from an unsplittable file type, setting the `multiLine` option when reading a JSON or CSV file, schema inference of a large file, or using `repartition(1)` or `coalesce(1)`. To diagnose this issue, you can check the Spark UI to identify the longest stage of the job and look at the Input, Output, Shuffle Read, and Shuffle Write columns. If there is only one task, it could be a sign of a problem. To resolve this issue, you can try optimizing your Spark job by avoiding the aforementioned causes. For example, you can avoid using `repartition(1)` or `coalesce(1)` unless necessary, use a splittable file type, and optimize your UDFs and window functions.\",\n",
-    "        \"expected_retrieved_context\": [\n",
-    "            {\n",
-    "                \"content\": \"One Spark task  \\nIf you see a long-running stage with just one task, that’s likely a sign of a problem. While this one task is running only one CPU is utilized and the rest of the cluster may be idle. This happens most frequently in the following situations:  \\nExpensive UDF on small data  \\nWindow function without PARTITION BY statement  \\nReading from an unsplittable file type. This means the file cannot be read in multiple parts, so you end up with one big task. Gzip is an example of an unsplittable file type.  \\nSetting the multiLine option when reading a JSON or CSV file  \\nSchema inference of a large file  \\nUse of repartition(1) or coalesce(1)\",\n",
-    "                \"doc_uri\": \"https://docs.databricks.com/optimizations/spark-ui-guide/one-spark-task.html\",\n",
-    "            }\n",
-    "        ],\n",
     "        \"expected_response\": \"A long-running stage with one task in a Spark job could be due to several reasons such as:\\n\\n1. Expensive User Defined Functions (UDFs) on small data\\n2. Window function without a PARTITION BY statement\\n3. Reading from an unsplittable file type like gzip\\n4. Setting the multiLine option when reading a JSON or CSV file\\n5. Schema inference of a large file\\n6. Use of repartition(1) or coalesce(1)\\n\\nTo resolve this issue, you can:\\n\\n1. Optimize your UDFs or replace them with built-in functions if possible.\\n2. Ensure that you have a proper PARTITION BY statement in your window functions.\\n3. Avoid using unsplittable file types like gzip. Instead, use splittable file types like snappy or lz4.\\n4. Avoid setting the multiLine option when reading JSON or CSV files.\\n5. Perform schema inference on a small sample of your data and then apply it to the entire dataset.\\n6. Avoid using repartition(1) or coalesce(1) unless necessary.\\n\\nBy implementing these changes, you should be able to resolve the issue of a long-running stage with only one task in your Spark job.\",\n",
     "    },\n",
     "    {\n",
     "        \"request_id\": \"5054\",\n",
     "        \"request\": \"How can I represent 4-byte single-precision floating point numbers in Databricks SQL and what are their limits?\",\n",
-    "        \"response\": \"4-byte single-precision floating point numbers in Databricks SQL are represented using the `FLOAT` data type. The range of numbers that can be represented is from -3.402E+38 to +3.402E+38, including negative infinity, positive infinity, and NaN (not a number). The literals for `FLOAT` type can be expressed in base-10 or base-2 format, with the `F` postfix and `E` exponent being case insensitive. However, due to the base-2 representation, the literal may not be exact. For accurate representation of fractional or large base-10 numbers, it is recommended to use the `DECIMAL` data type.\",\n",
-    "        \"expected_retrieved_context\": [\n",
-    "            {\n",
-    "                \"content\": \"FLOAT type  \\nApplies to: Databricks SQL Databricks Runtime  \\nRepresents 4-byte single-precision floating point numbers.  \\nSyntax\\nSyntax\\n{ FLOAT | REAL }\\n\\nLimits\\nLimits\\nThe range of numbers is:  \\n-∞ (negative infinity)  \\n-3.402E+38 to -1.175E-37  \\n0  \\n+1.175E-37 to +3.402E+38  \\n+∞ (positive infinity)  \\nNaN (not a number)\\n\\nLiterals\\nLiterals\\ndecimal_digits [ exponent ] F | [ + | - ] digit [ ... ] [ exponent ] F decimal_digits: [ + | - ] { digit [ ... ] . [ digit [ ... ] ] | . digit [ ... ] } exponent: E [ + | - ] digit [ ... ]  \\ndigit: Any numeral from 0 to 9.  \\nThe F postfix and E exponent are case insensitive.\\n\\nNotes\\nNotes\\nFLOAT is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact. Use DECIMAL type to accurately represent fractional or large base-10 numbers.\\n\\nExamples\\nExamples\\n> SELECT +1F; 1.0 > SELECT 5E10F; 5E10 > SELECT 5.3E10F; 5.3E10 > SELECT -.1F; -0.1 > SELECT 2.F; 2.0 > SELECT -5555555555555555.1F -5.5555558E15 > SELECT CAST(6.1 AS FLOAT) 6.1\\n\\nRelated\\nRelated\\nTINYINT type  \\nSMALLINT type  \\nINT type  \\nBIGINT type  \\nDECIMAL type  \\nDOUBLE type  \\ncast function  \\nSpecial floating point values\",\n",
-    "                \"doc_uri\": \"https://docs.databricks.com/sql/language-manual/data-types/float-type.html\",\n",
-    "            }\n",
-    "        ],\n",
     "        \"expected_response\": \"4-byte single-precision floating point numbers can be represented in Databricks SQL using the `FLOAT` or `REAL` syntax. The range of numbers that can be represented is from -3.402E+38 to +3.402E+38, including negative infinity, positive infinity, and NaN (not a number). Here are some examples of how to represent these numbers:\\n\\n* `+1F` represents 1.0\\n* `5E10F` represents 5E10\\n* `5.3E10F` represents 5.3E10\\n* `-.1F` represents -0.1\\n* `2.F` represents 2.0\\n* `-5555555555555555.1F` represents -5.5555558E15\\n* `CAST(6.1 AS FLOAT)` represents 6.1\\n\\nNote that `FLOAT` is a base-2 numeric type, so the representation of base-10 literals may not be exact. If you need to accurately represent fractional or large base-10 numbers, consider using the `DECIMAL` type instead.\",\n",
     "    },\n",
     "    {\n",
     "        \"request_id\": \"2003\",\n",
     "        \"request\": \"How can I identify the reason for failing executors in my Databricks workspace, and what steps can I take to resolve memory issues?\",\n",
-    "        \"response\": \"To identify the reason for failing executors in your Databricks workspace, you should first check the compute's Event log in the Spark UI to see if there's any explanation for why the executors failed. If you don't find any information in the event log, navigate to the Executors tab in the Spark UI to get the logs from the failed executors.\\n\\nThe most common reasons for executors being removed are autoscaling, spot instance losses, and executors running out of memory. If you see any failing jobs, click on them to get to their pages, scroll down to see the failed stage and a failure reason, and check the failed tasks to identify the issue.\\n\\nIf you suspect a memory issue, you can verify it by doubling the memory per core to see if it impacts your problem. If it takes longer to fail with the extra memory or doesn't fail at all, that's a good sign that you're on the right track. If you can fix your issue by increasing the memory, great! If it doesn't fix the issue, or you can't bear the extra cost, you should dig deeper into memory issues.\",\n",
-    "        \"expected_retrieved_context\": [\n",
-    "            {\n",
-    "                \"content\": \"Failing jobs or executors removed  \\nSo you’re seeing failed jobs or removed executors:  \\nThe most common reasons for executors being removed are:  \\nAutoscaling: In this case it’s expected and not an error. See Enable autoscaling.  \\nSpot instance losses: The cloud provider is reclaiming your VMs. You can learn more about Spot instances here.  \\nExecutors running out of memory  \\nFailing jobs\\nFailing jobs\\nIf you see any failing jobs click on them to get to their pages. Then scroll down to see the failed stage and a failure reason:  \\nYou may get a generic error. Click on the link in the description to see if you can get more info:  \\nIf you scroll down in this page, you will be able to see why each task failed. In this case it’s becoming clear there’s a memory issue:\\n\\nFailing executors\\nFailing executors\\nTo find out why your executors are failing, you’ll first want to check the compute’s Event log to see if there’s any explanation for why the executors failed. For example, it’s possible you’re using spot instances and the cloud provider is taking them back.  \\nSee if there are any events explaining the loss of executors. For example you may see messages indicating that the cluster is resizing or spot instances are being lost.  \\nIf you are using spot instances, see Losing spot instances.  \\nIf your compute was resized with autoscaling, it’s expected and not an error. See Learn more about cluster resizing.  \\nIf you don’t see any information in the event log, navigate back to the Spark UI then click the Executors tab:  \\nHere you can get the logs from the failed executors:\\n\\nNext step\\nNext step\\nIf you’ve gotten this far, the likeliest explanation is a memory issue. The next step is to dig into memory issues. See Spark memory issues.\",\n",
-    "                \"doc_uri\": \"https://docs.databricks.com/optimizations/spark-ui-guide/failing-spark-jobs.html\",\n",
-    "            }\n",
-    "        ],\n",
     "        \"expected_response\": \"1. Identify failing executors: In your Databricks workspace, navigate to the compute's Event log to check for any explanations regarding executor failures. Look for messages indicating spot instance losses or cluster resizing due to autoscaling. If using spot instances, refer to 'Losing spot instances' documentation. For autoscaling, refer to 'Learn more about cluster resizing' documentation.\\n\\n2. Check executor logs: If no information is found in the event log, go to the Spark UI and click the Executors tab. Here, you can access logs from failed executors to investigate further.\\n\\n3. Identify memory issues: If the above steps do not provide a clear reason for failing executors, it is likely a memory issue. To dig into memory issues, refer to the 'Spark memory issues' documentation.\\n\\n4. Resolve memory issues: To resolve memory issues, consider the following steps:\\n\\n   a. Increase executor memory: Allocate more memory to executors by adjusting the 'spark.executor.memory' property in your Spark configuration.\\n\\n   b. Increase driver memory: Allocate more memory to the driver by adjusting the 'spark.driver.memory' property in your Spark configuration.\\n\\n   c. Use off-heap memory: Enable off-heap memory by setting the 'spark.memory.offHeap.enabled' property to 'true' and allocating off-heap memory using the 'spark.memory.offHeap.size' property.\\n\\n   d. Optimize data processing: Review your data processing workflows and optimize them for memory efficiency. This may include reducing data shuffling, using broadcast variables, or caching data strategically.\\n\\n   e. Monitor memory usage: Monitor memory usage in your Databricks workspace to identify potential memory leaks or inefficient memory utilization. Use tools like the Spark UI, Ganglia, or Grafana to monitor memory usage.\",\n",
     "    },\n",
     "]\n",