Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

21583: Refactored inferFeatureAttributes to require source format argument for performance and additional attributes definition, MAJOR #20

Merged
merged 25 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
03c4ce2
Renaming aa directory
lancegliser Sep 24, 2024
7aff5a8
File renames
lancegliser Sep 24, 2024
2377ec0
@types/jest@^29.5.13
lancegliser Sep 24, 2024
dbafa11
Added FeatureAttributesIndex
lancegliser Sep 24, 2024
5297193
Added tests around inferFeatureAttributes for Array implementations
lancegliser Sep 24, 2024
cfb69a4
Pushed much of the inference code into InferFeatureAttributesBase
lancegliser Sep 24, 2024
7a18d6f
Moved InferFeatureAttributesBase to it's own file
lancegliser Sep 24, 2024
23f1df8
We're so not ready for recommendedTypeChecked
lancegliser Sep 24, 2024
047a5fa
Fixing some linting errors.
lancegliser Sep 24, 2024
c614e22
Refactored to use basic stats to generate more attributes
lancegliser Sep 25, 2024
1774365
Removed redudant double negation
lancegliser Sep 25, 2024
ca97cc6
Some additional bounds expectations
lancegliser Sep 25, 2024
eb35d10
Updated inferTime to return inferString for now.
lancegliser Sep 25, 2024
035f0fb
Removed the data_type: string from inferUnknown
lancegliser Sep 25, 2024
552811e
Added README details for Inferring feature attributes
lancegliser Sep 25, 2024
e1b1a1f
Added MIGRATION notes
lancegliser Sep 25, 2024
30b2c57
Added Worker suggestion
lancegliser Sep 25, 2024
c910ed8
Refactored infer method for performance and less side effects
lancegliser Sep 25, 2024
98a562b
Refactored inferFloat for performance
lancegliser Sep 25, 2024
f3decbe
Prevented condition in precision that lead to infinate loops
lancegliser Sep 25, 2024
9c3ac4b
Added a comment about upcoming support for inferTime
lancegliser Sep 26, 2024
46d8bd0
Updated tests to check original_type.data_type
lancegliser Sep 26, 2024
4f682e4
Added a case to ensure even supplied types get original_type.data_type
lancegliser Sep 26, 2024
a8f1c5b
Removed stray console.info
lancegliser Sep 26, 2024
b09498c
Updated infer's interior promises to mutate attributes directly
lancegliser Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions MIGRATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Migration notes

## 5.x

The `inferFeatureAttributes` function now requires a `sourceFormat` argument, and is strongly typed through a union.
Update your calls to include the appropriate argument:

Previous:

```ts
import { inferFeatureAttributes, type ArrayData } from "@howso/engine";

const columns = ["id", "number", "date", "boolean"];
const data: ArrayData = {
columns,
data: [
["0", 1.2, yesterday.toISOString(), false],
["1", 2.4, now.toISOString(), true],
["3", 2.4, null, true],
["4", 5, now.toISOString(), true],
],
};
const featureAttributes = await inferFeatureAttributes(data);
```

Required:

```ts
import { inferFeatureAttributes, type ArrayData } from "@howso/engine";

const columns = ["id", "number", "date", "boolean"];
const data: ArrayData = {
columns,
data: [
["0", 1.2, yesterday.toISOString(), false],
["1", 2.4, now.toISOString(), true],
["3", 2.4, null, true],
["4", 5, now.toISOString(), true],
],
};
const featureAttributes = await inferFeatureAttributes(data, "array");
```

## 4.x

Initial public version.

## 1.x - 3.x

Internal versions.
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,43 @@ An interface surrounding `@howso/amalgam-lang` WASM to create a simplified clien
npm install @howso/engine
```

### Create a client using a Worker
### Inferring feature attributes

During trainee creation, you'll need to iterate on your data to describe its
[feature attributes](https://docs.howso.com/user_guide/basic_capabilities/feature_attributes.html).

This package supplied methods to assist with inference from data generically, or directly through dedicated classes.
The primary entry point is through `inferFeatureAttributes`:

```ts
import { inferFeatureAttributes, type ArrayData } from "@howso/engine";

const columns = ["id", "number", "date", "boolean"];
const data: ArrayData = {
columns,
data: [
["0", 1.2, yesterday.toISOString(), false],
["1", 2.4, now.toISOString(), true],
["3", 2.4, null, true],
["4", 5, now.toISOString(), true],
],
};
const featureAttributes = await inferFeatureAttributes(data, "array");
```

If your data's source is always the same, you may bypass the method, creating and calling a source handler directly.
For example, the data above could be used directly with the `InferFeatureAttributesFromArray` class:

```ts
const service = new InferFeatureAttributesFromArray(data);
const features = await service.infer();
```

This process can be CPU intensive, you are encouraged to use a web `Worker` if run in a user's browser.

### Using a client

#### Through a web Worker

```ts
import { AmalgamWasmService, initRuntime } from "@howso/amalgam-lang";
Expand Down
6 changes: 6 additions & 0 deletions eslint.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ export default tseslint.config(
...tseslint.configs.recommended,
eslintPluginPrettierRecommended,
{
languageOptions: {
parserOptions: {
projectService: true,
tsconfigRootDir: import.meta.dirname,
},
},
rules: {
"@typescript-eslint/no-empty-object-type": "warn",
"@typescript-eslint/no-explicit-any": "off",
Expand Down
21 changes: 21 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
"@rollup/plugin-typescript": "^11.1.6",
"@types/emscripten": "^1.39.10",
"@types/eslint__js": "^8.42.3",
"@types/jest": "^29.5.13",
"@types/node": "^18.15.2",
"@types/uuid": "^9.0.1",
"@typescript-eslint/parser": "^8.5.0",
Expand Down
163 changes: 5 additions & 158 deletions src/features/base.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
import type { FeatureAttributes, FeatureOriginalType } from "../types";
import type { FeatureAttributesIndex } from "../types";

export type FeatureSourceFormat = "unknown" | "array" | "parsed_array";

export interface InferFeatureBoundsOptions {
tightBounds?: boolean | string[];
Expand All @@ -11,11 +13,12 @@ export interface InferFeatureTimeSeriesOptions {
}

export interface InferFeatureAttributesOptions {
defaults?: Record<string, FeatureAttributes>;
defaults?: FeatureAttributesIndex;
inferBounds?: boolean | InferFeatureBoundsOptions;
timeSeries?: InferFeatureTimeSeriesOptions;
ordinalFeatureValues?: Record<string, string[]>;
dependentFeatures?: Record<string, string[]>;
includeSample?: boolean;
}

export interface ArrayData<T = any, C extends string = string> {
Expand All @@ -36,159 +39,3 @@ export function isArrayData(data: any): data is ArrayData {
export function isParsedArrayData(data: any): data is ParsedArrayData {
return Array.isArray(data?.columns) && Array.isArray(data) && !Array.isArray(data[0]);
}

export abstract class InferFeatureAttributesBase {
/* Entrypoint */
public async infer(options: InferFeatureAttributesOptions = {}): Promise<Record<string, FeatureAttributes>> {
const attributes: Record<string, FeatureAttributes> = options.defaults || {};
const { ordinalFeatureValues = {}, dependentFeatures = {} } = options;
const columns = await this.getFeatureNames();

// Determine base feature attributes
for (let i = 0; i < columns.length; i++) {
const feature = columns[i];
if (feature in attributes && attributes[feature]?.type) {
// Attributes exist for feature, skip
continue;
}

const featureType = await this.getFeatureType(feature);

// Explicitly declared ordinals
if (feature in ordinalFeatureValues) {
attributes[feature] = {
type: "ordinal",
bounds: { allowed: ordinalFeatureValues[feature] },
};
} else if (featureType != null) {
switch (featureType.data_type) {
case "numeric":
attributes[feature] = await this.inferFloat(feature);
break;
case "integer":
attributes[feature] = await this.inferInteger(feature);
break;
case "string":
attributes[feature] = await this.inferString(feature);
break;
case "boolean":
attributes[feature] = await this.inferBoolean(feature);
break;
case "datetime":
attributes[feature] = await this.inferDatetime(feature);
break;
case "date":
attributes[feature] = await this.inferDate(feature);
break;
case "time":
attributes[feature] = await this.inferTime(feature);
break;
case "timedelta":
attributes[feature] = await this.inferTimedelta(feature);
break;
default:
attributes[feature] = await this.inferString(feature);
break;
}
} else {
attributes[feature] = await this.inferUnknown(feature);
}

// Add original type
if (featureType != null) {
attributes[feature].original_type = featureType;
}
}

// Determine feature properties
for (let i = 0; i < columns.length; i++) {
const feature = columns[i];

// Set unique flag
if (attributes[feature].unique == null && (await this.inferUnique(feature))) {
attributes[feature].unique = true;
}

// Add dependent features
if (attributes[feature].dependent_features == null && feature in dependentFeatures) {
attributes[feature].dependent_features = dependentFeatures[feature];
}

// Infer bounds
const { inferBounds = true } = options;
if (inferBounds && attributes[feature].bounds == null) {
const bounds = await this.inferBounds(
attributes[feature],
feature,
typeof inferBounds === "boolean" ? {} : inferBounds,
);
if (bounds != null) {
attributes[feature].bounds = bounds;
}
}

// Infer time series attributes
if (options.timeSeries && attributes[feature].time_series == null) {
// TODO - infer time series
}
}

return attributes;
}

/* Feature types */
protected abstract getFeatureType(featureName: string): Promise<FeatureOriginalType | undefined>;
protected abstract inferBoolean(featureName: string): Promise<FeatureAttributes>;
protected abstract inferTimedelta(featureName: string): Promise<FeatureAttributes>;
protected abstract inferDatetime(featureName: string): Promise<FeatureAttributes>;
protected abstract inferDate(featureName: string): Promise<FeatureAttributes>;
protected abstract inferTime(featureName: string): Promise<FeatureAttributes>;
protected abstract inferString(featureName: string): Promise<FeatureAttributes>;
protected abstract inferInteger(featureName: string): Promise<FeatureAttributes>;
protected abstract inferFloat(featureName: string): Promise<FeatureAttributes>;
protected async inferUnknown(
/* eslint-disable-next-line @typescript-eslint/no-unused-vars*/
featureName: string,
): Promise<FeatureAttributes> {
return { type: "nominal" };
}

/* Feature properties */
protected abstract inferUnique(featureName: string): Promise<boolean>;
public abstract inferBounds(
attributes: Readonly<FeatureAttributes>,
featureName: string,
options: InferFeatureBoundsOptions,
): Promise<FeatureAttributes["bounds"] | undefined>;
public abstract inferTimeSeries(
attributes: Readonly<FeatureAttributes>,
featureName: string,
options: InferFeatureTimeSeriesOptions,
): Promise<Partial<FeatureAttributes>>;

/* Descriptive operations */
public abstract getFeatureNames(): Promise<string[]>;
public abstract getNumCases(): Promise<number>;
public abstract getNumFeatures(): Promise<number>;
}

export abstract class FeatureSerializerBase {
public abstract serialize(data: AbstractDataType, features: Record<string, FeatureAttributes>): Promise<any[][]>;
public abstract deserialize(
data: any[][],
columns: string[],
features?: Record<string, FeatureAttributes>,
): Promise<AbstractDataType>;

protected deserializeCell(value: any, attributes?: FeatureAttributes): any {
switch (attributes?.original_type?.data_type) {
case "date":
case "datetime":
if (typeof value === "string") {
return new Date(value);
}
break;
}
return value;
}
}
1 change: 1 addition & 0 deletions src/features/index.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
export * from "./base";
export * from "./infer";
export * from "./serializer";
export * from "./sources";
40 changes: 40 additions & 0 deletions src/features/infer.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import { FeatureAttributes, FeatureAttributesIndex } from "../types";

it.todo("TODO implement generic infer tests");

export const expectFeatureAttributesIndex = (index: FeatureAttributesIndex | undefined) => {
if (!index) {
throw new Error("index is undefined");
}
Object.entries(index).forEach(([key, attributes]) => {
expect(typeof key).toBe("string");
expectFeatureAttributes(attributes);
});
};

export const expectFeatureAttributes = (attributes: FeatureAttributes | undefined) => {
if (!attributes) {
throw new Error("attributes is undefined");
}

expect(attributes.type).toBeTruthy();
expect(attributes.data_type).toBeTruthy();
expectFeatureAttributeBounds(attributes);
// TODO expand on this testing
};

const expectFeatureAttributeBounds = (attributes: FeatureAttributes) => {
if (!attributes.bounds) {
return;
}

expect(typeof attributes.bounds.allow_null).toBe("boolean");
if (attributes.bounds.min && attributes.bounds.max) {
if (typeof attributes.bounds.min === "number") {
expect(attributes.bounds.min).toBeLessThan(attributes.bounds.max);
}
if (attributes.data_type === "formatted_date_time") {
expect(new Date(attributes.bounds.min).getTime()).toBeLessThan(new Date(attributes.bounds.max).getTime());
}
}
};
Loading