Data Discovery AI

Environment variables

In the root directory of the project, create a .env file.

Open the .env file and add the following line to include your API key:

API_KEY=your_actual_api_key_here

Run the API server with Docker

Simply run ./startServer.sh to run the app, this will create a docker image and run the image for you.

Host will be http://localhost:8000.

Run the API server for development

Requirements

Conda (recommended for creating a virtual environment)

Install Conda (if not already installed):

Follow the instructions at Conda Installation.
Create Conda virtual environment:
```
conda env create -f environment.yml
```

Dependencies management

Poetry is used for dependency management, the pyproject.toml file is what is the most important, it will orchestrate the project and its dependencies.

You can update the file pyproject.toml for adding/removing dependencies by using

poetry add <pypi-dependency-name> # e.g poetry add numpy
poetry remove <pypi-dependency-name> # e.g. poetry remove numpy

You might want to update the poetry.lock file after manually modifying pyproject.toml with poetry lock command. To update all dependencies, use poetry update command.

Installation and Usage

Activate Conda virtual environment:
```
conda activate data-discovery-ai
```

Install environment dependencies:

# after cloning the repo with git clone command
cd data-discovery-ai
poetry install

Run the FastAPI server:

poetry run uvicorn data_discovery_ai.server:app --reload --log-config=log_config.yaml

Run the tests:
```
poetry run pytest
```

Code formatting

The command below is for manual checks; checks are also executed when you run git commit.

The configurations for pre-commit hooks are defined in .pre-commit-config.yaml.

pre-commit run --all-files

Edge/syste/prod

models name stricly controller:

available options : development,experimental, staging, production, benchmark

Option	Purpose	Typical Use
`development`	Dedicated to active model development, testing, and iteration.	Building and refining new model versions, features, or datasets.
`experimental`	Supports exploratory work for new techniques or fine-tuning.	Experimenting with new architectures, features, or hyperparameter tuning.
`staging`	Prepares the model for production with real-use evaluations.	Conducting final testing in a production-like environment to verify stability and performance.
`production`	Deployment environment for live model usage in real-world scenarios.	Running and monitoring models in active use by API.
`benchmark`	Baseline model used to assess improvements or changes.	Comparing performance metrics against new models.

Devlopment

syntax is :/....

File Structure

data_discovery_ai/
├── common/         # Common utilities, including shared configurations and constants, used across modules
├── model/          # Core ML logic, including model training, evaluation, and inference implementations
├── pipeline/       # Data pipelines for using ML models
├── resources/      # Stored assets such as pretrained models, sample datasets, and other resources required for model inference
├── services/       # Service modules for providing service functions for API use
├── utils/          # Utility functions and helper scripts for various tasks
├── extras/         # Supplementary files
├── notebooks/      # Jupyter notebooks documenting the design, experiments, and practical usage of AI features
├── tests/          # Unit test for critical functions

Required Configuration Files

Elasticsearch configuration file File name esManager.ini saved under folder data_discovery_ai/common. Specific fileds & values required:
1. end_point: the Elasticsearch endpoint of a deployment
2. api_key: the API key used for access Elasticsearch

Keyword classification parameter configuration file File name keyword_classification_parameters.ini saved under folder data_discovery_ai/common. Required two sections: preprocessor to set up parameters used for data preprocessing module, and keywordModel to set up parameters used for training and evaluation of the keyword model. Here are the definitions of fields:

preprocessor

Parameter	Definition	Default Value used
vocabs	Titles of vocabularies used to identify samples from raw data; multiple values can be separated by ', '.	AODN Instrument Vocabulary, AODN Discovery Parameter Vocabulary, AODN Platform Vocabulary
rare_label_threshold	The threshold for identifying a rare label, defined as the number of occurrences of the label across all sample records, should be an integer.	10
test_size	A floating-point number in the range [0, 1], indicating the percentage of the test set size relative to all samples.	0.2
n_splits	Number of re-shuffling & splitting iterations for cross validation, used as the value of parameter `n_splits` when initialise an object of `MultilabelStratifiedShuffleSplit`.	5
train_test_random_state	The seed for splitting the train and test sets, used as the value of the `random_state` parameter when initialising an instance of `MultilabelStratifiedShuffleSplit`.	42

keywordModel

Parameter	Definition	Defalt Value used
dropout	The probability of a neuron being dropped. A strategy used for avoiding overfitting.	0.3
learning_rate	A hyperparameter determines how much the model's parameters are adjusted with respect to the gradient of the loss function.	0.001
fl_gamma	The $\gamma$ parameter of the focal loss function, which adjusts the focus of the loss function on hard-to-classify samples. It should be an integer.	2
fl_alpha	The $\alpha$ parameter of the focal loss function, which balances the importance of positive and negative samples. It should be a floating-point number between 0 and 1.	0.7
epoch	The number of times the train set is passed through the model for training. It should be an integer.	100
batch	The batch size which defines the number of samples in each batch.	32
validation_split	The percentage of the training set to be used as the validation set.	0.2
confidence	The probability threshold for identifying a label as positive (value 1).	0.5
top_N	The number of labels to select using argmax(probability) if no labels reach the confidence threshold.	2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Discovery AI

Environment variables

Run the API server with Docker

Run the API server for development

Requirements

Dependencies management

Installation and Usage

Code formatting

Edge/syste/prod

Devlopment

File Structure

Required Configuration Files

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Discovery AI

Environment variables

Run the API server with Docker

Run the API server for development

Requirements

Dependencies management

Installation and Usage

Code formatting

Edge/syste/prod

Devlopment

File Structure

Required Configuration Files