20490: Unifies .nas, and .nan under null keyword, MAJOR (#155)

.nas and .nan have been removed and have been replaced simply by (null) --------- Co-authored-by: Cade Mack <[email protected]>
howsoai · Jun 25, 2024 · 7d51dea · 7d51dea
1 parent 72bd910
commit 7d51dea
Show file tree

Hide file tree

Showing 30 changed files with 522 additions and 623 deletions.
diff --git a/docs/index.html b/docs/index.html
@@ -352,7 +352,7 @@ <h2>Language syntax</h2>
 	(call (retrieve_from_contained_entity "MyLibrary" "MyFunction") (assoc parameter_a 1 parameter_b 2))
 	</span>
 	<p>
-	Numbers are represented via numbers, as well as ".", "-", and "e" for base-ten exponents.  Further, infinity and negative infinity are represented as ".infinity" and "-.infinity" respectively, and not-a-number is represented as ".nan".  The special null value with string type, not-a-string, is represented as ".nas".
+	Numbers are represented via numbers, as well as ".", "-", and "e" for base-ten exponents.  Further, infinity and negative infinity are represented as ".infinity" and "-.infinity" respectively.  Not-a-number and non-string results are represented via the opcode (null).
 	<p>
 	All regular expressions are EMCA-standard regular expressions.  See <a href="https://en.cppreference.com/w/cpp/regex/ecmascript">https://en.cppreference.com/w/cpp/regex/ecmascript</a> or <a href="https://262.ecma-international.org/5.1/#sec-15.10">https://262.ecma-international.org/5.1/#sec-15.10</a> for further details on syntax.
 

diff --git a/docs/language.js b/docs/language.js
@@ -408,7 +408,7 @@ var data = [
 	{
 		"parameter" : "generalized_distance list|assoc|number weights list|assoc distance_types list|assoc attributes list|assoc|number deviations number p_value list|assoc|* vector1 [list|assoc|* vector2] [list value_names] [bool surprisal_space]",
 		"output" : "number",
-		"description" : "Computes the generalized norm between vector1 and vector2 (or an equivalent zero vector if unspecified) with parameter specified by the p_value (2 being Euclidian distance), using the numerical distance or edit distance as appropriate.  The parameter value_names, if specified as a list of the names of the values, will transform via unzipping any assoc into a list for the respective parameter in the order of the value_names, or if a number will use the number repeatedly for every element.  weights is a list of dimension weights to use for the query, each value mapping to its respective element in the vectors.  If weights is null, then it will assume that the weights are 1 and additionally will ignore null values for the vectors instead of treating them as unknown differences.  The parameter distance_types is either a list strings or an assoc of strings indicating the type of distance for each feature.  Allowed values are \"nominal_numeric\", \"nominal_string\", \"nominal_code\", \"continuous_numeric\", \"continuous_numeric_cyclic\", \"continuous_string\", and \"continuous_code\".  Nominals evaluate whether the two values are the same and continuous evaluates the difference between the two values.  The numeric, string, or code modifier specifies how the difference is measured, and cyclic means it is a difference that wraps around.  \nFor attributes, the particular distance_types specifies what particular attributes are expected.  For a nominal distance_type, a number indicates the nominal count, whereas null will infer from the values given.  Cyclic requires a single value, which is the upper bound of the difference for the cycle range (e.g., if the value is 360, then the supremum difference between two values will be 360, leading 1 and 359 to have a difference of 2).\n  Deviations are used during distance calculation to specify uncertainty per-element, the minimum difference between two values prior to exponentiation.  Specifying null as a deviation is equivalent to setting each deviation to 0.  Each deviation for each feature can be a single value or a list.  If it is a single value, that value is used as the deviation and differences and deviations for null values will automatically computed from the data based on the maximum difference.  If a deviation is provided as a list, then the first value is the deviation, the second value is the difference to use when one of the values being compared is null, and the third value is the difference to use when both of the values are null.  If the third value is omitted, it will use the second value for both.  If both of the null values are omitted, then it will compute the maximum difference and use that for both.  For nominal types, the value for each feature can be a numeric deviation, an assoc, or a list.  If the value is an assoc it specifies deviation information, where each key of the assoc is the nominal value, and each value of the assoc can be a numeric deviation value, a list, or an assoc, with the list specifying either an assoc followed optionally by the default deviation.  This inner assoc, regardless of whether it is in a list, maps the value to each actual value's deviation.   If any vector value is null or evaluates to nan, or any of the differences between vector1 and vector2 evaluate to null or nan, then it will compute a corresponding maximum distance value based on the properties of the feature.  If surprisal space is true, which defaults to false, it will perform all computations in surprisal space.",
+		"description" : "Computes the generalized norm between vector1 and vector2 (or an equivalent zero vector if unspecified) with parameter specified by the p_value (2 being Euclidian distance), using the numerical distance or edit distance as appropriate.  The parameter value_names, if specified as a list of the names of the values, will transform via unzipping any assoc into a list for the respective parameter in the order of the value_names, or if a number will use the number repeatedly for every element.  weights is a list of dimension weights to use for the query, each value mapping to its respective element in the vectors.  If weights is null, then it will assume that the weights are 1 and additionally will ignore null values for the vectors instead of treating them as unknown differences.  The parameter distance_types is either a list strings or an assoc of strings indicating the type of distance for each feature.  Allowed values are \"nominal_numeric\", \"nominal_string\", \"nominal_code\", \"continuous_numeric\", \"continuous_numeric_cyclic\", \"continuous_string\", and \"continuous_code\".  Nominals evaluate whether the two values are the same and continuous evaluates the difference between the two values.  The numeric, string, or code modifier specifies how the difference is measured, and cyclic means it is a difference that wraps around.  \nFor attributes, the particular distance_types specifies what particular attributes are expected.  For a nominal distance_type, a number indicates the nominal count, whereas null will infer from the values given.  Cyclic requires a single value, which is the upper bound of the difference for the cycle range (e.g., if the value is 360, then the supremum difference between two values will be 360, leading 1 and 359 to have a difference of 2).\n  Deviations are used during distance calculation to specify uncertainty per-element, the minimum difference between two values prior to exponentiation.  Specifying null as a deviation is equivalent to setting each deviation to 0.  Each deviation for each feature can be a single value or a list.  If it is a single value, that value is used as the deviation and differences and deviations for null values will automatically computed from the data based on the maximum difference.  If a deviation is provided as a list, then the first value is the deviation, the second value is the difference to use when one of the values being compared is null, and the third value is the difference to use when both of the values are null.  If the third value is omitted, it will use the second value for both.  If both of the null values are omitted, then it will compute the maximum difference and use that for both.  For nominal types, the value for each feature can be a numeric deviation, an assoc, or a list.  If the value is an assoc it specifies deviation information, where each key of the assoc is the nominal value, and each value of the assoc can be a numeric deviation value, a list, or an assoc, with the list specifying either an assoc followed optionally by the default deviation.  This inner assoc, regardless of whether it is in a list, maps the value to each actual value's deviation.   If any vector value is null or any of the differences between vector1 and vector2 evaluate to null, then it will compute a corresponding maximum distance value based on the properties of the feature.  If surprisal space is true, which defaults to false, it will perform all computations in surprisal space.",
 		"example" : "(print (generalized_distance 0.01 (null) (null) (list null (list 0 360)) (list 0.5 0.0) (list 0 2 3) (list 1 2 3)))\n(print (generalized_distance 0.01 (list 0.25 0.25 0.5) (null) (null) (null) (list 1 2 3) (list 0 2 3) ))\n(generalized_distance 1 (list 0.3333 0.3333 0.3333) (list 5 0) (null) (null) (list 1 2 3) (list 10 2 10) )"
 	},
 
@@ -498,7 +498,7 @@ var data = [
 		"new value" : "new",
 		"concurrency" : true,
 		"new target scope": true,
-		"description" : "For each element in the collection, pushes a new target scope onto the stack, so that current_value accesses the element in the list and current_index accesses the list or assoc index, with target representing the original list or assoc, and evaluates the function.  If function evaluates to true, then the element is put in a new list or assoc (matching the input type) that is returned.  If function is omitted, then it will remove any elements in the collection that are null, .nan, or .nas string.",
+		"description" : "For each element in the collection, pushes a new target scope onto the stack, so that current_value accesses the element in the list and current_index accesses the list or assoc index, with target representing the original list or assoc, and evaluates the function.  If function evaluates to true, then the element is put in a new list or assoc (matching the input type) that is returned.  If function is omitted, then it will remove any elements in the collection that are null.",
 		"example" : "(print (filter (lambda (> (current_value) 2)) (list 1 2 3 4)))"
 	},
 
@@ -732,7 +732,7 @@ var data = [
 		"output" : "bool",
 		"new value" : "new",
 		"concurrency" : true,
-		"description" : "Evaluates to true if all values are equal (will recurse into data structures), false otherwise. Values of nan (not a number) are considered equal because they represent the same node, unlike many other floating point representation systems.",
+		"description" : "Evaluates to true if all values are equal (will recurse into data structures), false otherwise. Values of null are considered equal.",
 		"example" : "(print (= 4 4 5))\n(print (= 4 4 4))"
 	},
 
@@ -809,7 +809,7 @@ var data = [
 	{
 		"parameter" : "weighted_rand [list of lists|assoc weighted_values] [number number_to_generate] [bool unique]",
 		"output" : "*",
-		"description" : "Each entity has its own random stream, and if called from a sandbox, then it uses a new stream without interrupting the stream of the calling entity. If the parameter is a list, it will uniformly randomly choose and evaluate to one element of the list. If an assoc, then it will randomly evaluate to one of the keys using the values as the weights for the probabilities.  Nans and negative numbers are treated as zero.  Infinities are normalized as to only select from infinities in the list.  If all values are 0, then they are normalized to having the same weight. If a list of lists, it will use the first list as a list of values and the second list as a list of weights and otherwise work like it would for an assoc.  If  number_to_generate is specified, it will generate a list of multiple values (even if  number_to_generate is 1).  If unique is true (it defaults to false), then it will only return unique values, the same as selecting from the list or assoc without replacement.",
+		"description" : "Each entity has its own random stream, and if called from a sandbox, then it uses a new stream without interrupting the stream of the calling entity. If the parameter is a list, it will uniformly randomly choose and evaluate to one element of the list. If an assoc, then it will randomly evaluate to one of the keys using the values as the weights for the probabilities.  Nulls and negative numbers are treated as zero.  Infinities are normalized as to only select from infinities in the list.  If all values are 0, then they are normalized to having the same weight. If a list of lists, it will use the first list as a list of values and the second list as a list of weights and otherwise work like it would for an assoc.  If  number_to_generate is specified, it will generate a list of multiple values (even if  number_to_generate is 1).  If unique is true (it defaults to false), then it will only return unique values, the same as selecting from the list or assoc without replacement.",
 		"example" : "(print (rand (list (list 1 2 4 5 7) (list 0.2 0.2 0.1 0.1 0.4))))\n(print (rand (assoc \"a\" 1 \"b\" 3))\n(print (rand (assoc \"a\" .25 \"b\" .75)) \"\\n\")\n(print (rand (assoc \"a\" .25 \"b\" .75) 4) \"\\n\")\n(print (rand (range 0 10) 10 (true)) \"\\n\")"
 	},
 
@@ -1549,7 +1549,7 @@ var data = [
 		"parameter" : "query_mode string label_name [string weight_label_name] [bool numeric]",
 		"output" : "query",
 		"new value" : "new",
-		"description" : "When used as a query argument, finds the statistical mode of label_name for numerical data.  If weight_label_name is specified, it will find the weighted mode.  If numeric is true, its default, then it will treat all values as numeric, otherwise it will treat them all as strings.  If numeric and no numeric mode exists, it will return .nan, but if string and no string mode exists, it will return null.",
+		"description" : "When used as a query argument, finds the statistical mode of label_name for numerical data.  If weight_label_name is specified, it will find the weighted mode.  If numeric is true, its default, then it will treat all values as numeric, otherwise it will treat them all as strings.  If numeric and no numeric mode exists, it will return (null), but if string and no string mode exists, it will return null.",
 		"example" : "(compute_on_contained_entities \"TestEntity\" (list\n (query_mode \"TargetLabel\")\n))"
 	},
 

diff --git a/src/Amalgam/GeneralizedDistance.h b/src/Amalgam/GeneralizedDistance.h
@@ -373,8 +373,8 @@ class GeneralizedDistanceEvaluator
 	__forceinline double ComputeDistanceTermNominal(EvaluableNodeImmediateValue a, EvaluableNodeImmediateValue b,
 		EvaluableNodeImmediateValueType a_type, EvaluableNodeImmediateValueType b_type, size_t index, bool high_accuracy)
 	{
-		bool a_is_null = EvaluableNodeImmediateValue::IsNullEquivalent(a_type, a);
-		bool b_is_null = EvaluableNodeImmediateValue::IsNullEquivalent(b_type, b);
+		bool a_is_null = EvaluableNodeImmediateValue::IsNull(a_type, a);
+		bool b_is_null = EvaluableNodeImmediateValue::IsNull(b_type, b);
 		if(a_is_null && b_is_null)
 			return ComputeDistanceTermUnknownToUnknown(index, high_accuracy);
 
@@ -754,8 +754,8 @@ class GeneralizedDistanceEvaluator
 	__forceinline double LookupNullDistanceTerm(EvaluableNodeImmediateValue a, EvaluableNodeImmediateValue b,
 		EvaluableNodeImmediateValueType a_type, EvaluableNodeImmediateValueType b_type, size_t index, bool high_accuracy)
 	{
-		bool a_unknown = EvaluableNodeImmediateValue::IsNullEquivalent(a_type, a);
-		bool b_unknown = EvaluableNodeImmediateValue::IsNullEquivalent(b_type, b);
+		bool a_unknown = EvaluableNodeImmediateValue::IsNull(a_type, a);
+		bool b_unknown = EvaluableNodeImmediateValue::IsNull(b_type, b);
 		if(a_unknown && b_unknown)
 			return ComputeDistanceTermUnknownToUnknown(index, high_accuracy);
 		if(a_unknown || b_unknown)
@@ -1245,9 +1245,9 @@ class RepeatedGeneralizedDistanceEvaluator
 					return distEvaluator->ComputeDistanceTermNominalUniversallySymmetricExactMatch(index, high_accuracy);
 			}
 
-			if(EvaluableNodeImmediateValue::IsNullEquivalent(other_type, other_value))
+			if(EvaluableNodeImmediateValue::IsNull(other_type, other_value))
 			{
-				if(feature_data.targetValue.IsNullEquivalent())
+				if(feature_data.targetValue.IsNull())
 					return distEvaluator->ComputeDistanceTermUnknownToUnknown(index, high_accuracy);
 				else
 					return distEvaluator->ComputeDistanceTermKnownToUnknown(index, high_accuracy);

diff --git a/src/Amalgam/Opcodes.cpp b/src/Amalgam/Opcodes.cpp
@@ -10,7 +10,7 @@ void StringInternPool::InitializeStaticStrings()
 	stringToID.reserve(numStaticStrings);
 	idToStringAndRefCount.resize(numStaticStrings);
 
-	EmplaceStaticString(ENBISI_NOT_A_STRING, ".nas");
+	EmplaceStaticString(ENBISI_NOT_A_STRING, "(null)");
 	EmplaceStaticString(ENBISI_EMPTY_STRING, "");
 
 
@@ -293,7 +293,6 @@ void StringInternPool::InitializeStaticStrings()
 	//end opcodes
 
 	//built-in common values
-	EmplaceStaticString(ENBISI_nan, ".nan");
 	EmplaceStaticString(ENBISI_infinity, ".infinity");
 	EmplaceStaticString(ENBISI_neg_infinity, "-.infinity");
 	EmplaceStaticString(ENBISI_zero, "0");

diff --git a/src/Amalgam/Opcodes.h b/src/Amalgam/Opcodes.h
@@ -509,9 +509,7 @@ enum EvaluableNodeBuiltInStringId
 	//leave space for ENT_ opcodes, start at the end
 
 	//built-in common values
-	ENBISI_nas = NUM_VALID_ENT_OPCODES + NUM_ENBISI_SPECIAL_STRING_IDS,
-	ENBISI_nan,
-	ENBISI_infinity,
+	ENBISI_infinity = NUM_VALID_ENT_OPCODES + NUM_ENBISI_SPECIAL_STRING_IDS,
 	ENBISI_neg_infinity,
 	ENBISI_zero,
 	ENBISI_one,

diff --git a/src/Amalgam/Parser.cpp b/src/Amalgam/Parser.cpp
@@ -510,18 +510,10 @@ EvaluableNode *Parser::GetNextToken(EvaluableNode *parent_node, EvaluableNode *n
 
 		//check for special values
 		double value = 0.0;
-		if(s == ".nas")
-		{
-			new_token->SetType(ENT_STRING, evaluableNodeManager, false);
-			new_token->SetStringID(StringInternPool::NOT_A_STRING_ID);
-			return new_token;
-		}
 		if(s == ".infinity")
 			value = std::numeric_limits<double>::infinity();
 		else if(s == "-.infinity")
 			value = -std::numeric_limits<double>::infinity();
-		else if(s == ".nan")
-			value = std::numeric_limits<double>::quiet_NaN();
 		else
 		{
 			auto [converted_value, success] = Platform_StringToNumber(s);
@@ -598,6 +590,29 @@ EvaluableNode *Parser::ParseNextBlock()
 			}
 			else if(cur_node->IsAssociativeArray())
 			{
+				//if it's not an immediate value, then need to retrieve closing parenthesis
+				if(!IsEvaluableNodeTypeImmediate(n->GetType()))
+				{
+					SkipWhitespaceAndAccumulateAttributes(n);
+					if(pos <= code->size())
+					{
+						auto cur_char = (*code)[pos];
+						if(cur_char == ')')
+						{
+							pos++;
+							numOpenParenthesis--;
+						}
+						else
+						{
+							std::cerr << "Warning: " << "Missing ) at line " << lineNumber + 1 << " of " << originalSource << std::endl;
+						}
+					}
+					else //no more code
+					{
+						std::cerr << "Warning: " << "Mismatched ) at line " << lineNumber + 1 << " of " << originalSource << std::endl;
+					}
+				}
+
 				//n is the id, so need to get the next token
 				StringInternPool::StringID index_sid = EvaluableNode::ToStringIDTakingReferenceAndClearing(n);
 
@@ -863,7 +878,7 @@ void Parser::Unparse(UnparseData &upd, EvaluableNode *tree, EvaluableNode *paren
 			auto sid = tree->GetStringIDReference();
 			if(sid == string_intern_pool.NOT_A_STRING_ID)
 			{
-				upd.result.append(".nas");
+				upd.result.append("(null)");
 			}
 			else //legitimate string
 			{