Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database action changes. #1062

Closed

Conversation

nupur-khare
Copy link
Contributor

  1. Separated Metadata from cognition data.
  2. Validated metadata related to cognition data.
  3. Added unit and integration test cases.
  4. Fixed unit and integration test cases.

1. Separated Metadata from cognition data.
2. Validated metadata related to cognition data.
3. Added unit and integration test cases.
4. Fixed unit and integration test cases.
Copy link
Contributor

@udit-pandey udit-pandey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a CognitionDataProcessor in shared and move all related code to that pkg

Utility.retrieve_data(data, metadata_item.dict())
content_type = values.get("content_type")
if isinstance(data, dict) and content_type != CognitionDataType.json.value:
raise ValueError("data of type dict is required if content type is json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

content_type = values.get("content_type")
if isinstance(data, dict) and content_type != CognitionDataType.json.value:
raise ValueError("data of type dict is required if content type is json")
if (not isinstance(data, dict) and Utility.check_empty_string(data)) or (isinstance(data, dict) and data == {}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not data or (isinstance(data, str) and Utility.check_empty_string(data))

from kairon.shared.utils import Utility

if isinstance(self.data, dict) and self.content_type != CognitionDataType.json.value:
raise ValidationError("data of type dict is required if content type is json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??


if isinstance(self.data, dict) and self.content_type != CognitionDataType.json.value:
raise ValidationError("data of type dict is required if content type is json")
if (not isinstance(self.data, dict) and Utility.check_empty_string(self.data)) or (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not data or (isinstance(data, str) and Utility.check_empty_string(data))

:raises: AppException
"""

collections = self.list_collection(bot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to list_cognition_collections

@@ -895,6 +917,7 @@ class BotSettings(Auditlog):
data_importer_limit_per_day = IntField(default=5)
multilingual_limit_per_day = IntField(default=2)
data_generation_limit_per_day = IntField(default=3)
collection_limit = IntField(default=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cognition_collections_limit = 3
cognition_columns_per_collection_limit = 5
we must have both

def save_cognition_schema(self, metadata: Dict, user: Text, bot: Text):
column_name = [meta['column_name'] for meta in metadata.get('metadata')]
for column in column_name:
Utility.is_exist(CognitionSchema, bot=bot, metadata__column_name=column,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata__column_name__in=column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add validation for duplicate collection also

else:
return data[column_name]

@staticmethod
def get_embeddings_and_payload_data(data: Any, metadata: Dict):
def find_matching_metadata(data: Any, metadata: List, collection: Text = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a simple mongo query where you find schema using column name

matched_metadata = Utility.find_matching_metadata(data, metadata, collection)
if not matched_metadata:
raise AppException("Metadata related to payload not found!")
results = [metadata_entry for metadata_dict in matched_metadata for metadata_entry in metadata_dict['metadata']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matched_metadata will not be dict?

@@ -52,10 +55,11 @@ async def train(self, *args, **kwargs) -> Dict:
await self.__create_collection__(collection)
for content in tqdm(collections['content'], desc="Training FAQ"):
if content['content_type'] == CognitionDataType.json.value:
if not content['metadata'] or []:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json will always have metadata


collections = self.list_cognition_collections(bot)
doc_count = CognitionData.objects(
bot=bot, collection__ne=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why collection__ne=None?


def save_content(self, content: Text, user: Text, bot: Text, collection: Text = None):
if collection:
if self.is_collection_limit_exceeded(bot, collection):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check here?
we must check if collection is created yet!!

if self.is_collection_limit_exceeded(bot, collection):
raise AppException('Collection limit exceeded!')
bot_settings = MongoProcessor.get_bot_settings(bot=bot, user=user)
if not bot_settings["llm_settings"]['enable_faq']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not the first validation in this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validation column data

if len(content.split()) < 10:
raise AppException("Content should contain atleast 10 words.")

Utility.is_exist(CognitionData, bot=bot, id__ne=content_id, data=content,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if collection exists!
validation column data

exp_message="Column already exists!")
metadata_obj = CognitionSchema(bot=bot, user=user)
metadata_obj.metadata = [ColumnMetadata(**meta) for meta in metadata.get('metadata')]
metadata_obj.collection_name = metadata.get('collection_name', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None??

Utility.is_exist(CognitionSchema, bot=bot, collection_name=metadata.get('collection_name'),
exp_message="Collection already exists!")
column_name = [meta['column_name'] for meta in metadata.get('metadata')]
for column in column_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???????????


def update_cognition_schema(self, metadata_id: str, metadata: Dict, user: Text, bot: Text):
metadata_items = metadata.get('metadata')
Utility.is_exist(CognitionSchema, bot=bot, id__ne=metadata_id, metadata=metadata_items,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

yield final_data

def __validate_metadata_and_payload(self, payload, bot: Text):
data = payload.get('data')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

async def save_bot_text(
text: TextData,
@router.get("/text/faq", response_model=Response)
async def get_text(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not already covered by list_cognition_data api?


@router.get("/text/faq/collection", response_model=Response)
async def list_collection(
current_user: User = Security(Authentication.get_current_user_and_bot, scopes=DESIGNER_ACCESS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the use when we already have list_cognition_schema ?

raise ValidationError("query type is required")
if not self.value or self.value is None:
raise ValidationError("query value is required")
# class DbOperation(EmbeddedDocument):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

@@ -189,7 +189,7 @@ def validate(self, clean=True):
class DatabaseAction(Auditlog):
name = StringField(required=True)
collection = StringField(required=True)
query = EmbeddedDocumentField(DbOperation, required=True)
query = StringField(required=True, choices=[payload.value for payload in DbActionOperationType])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to query type

"""

collections = list(CognitionSchema.objects(bot=bot).distinct(field='collection_name'))
if collection not in collections and len(collections) >= BotSettings.objects(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use get_bot_settings in mongo_processor

Q(bot=bot)).get()
return matching_metadata
except DoesNotExist as e:
raise AppException("Metadata related to payload not found!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns do not exist in the schema

yield final_data

@staticmethod
def retrieve_data(data: Any, schema: Dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate column values

except DoesNotExist:
raise AppException("Payload does not exists!")

def list_cognition_data(self, bot: Text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paginate

else:
search_payload, vector_embeddings = Utility.get_embeddings_and_payload_data(content['data'], content['metadata'])
metadata = processor.find_matching_metadata(self.bot, content['data'], content.get('collection'))
search_payload, vector_embeddings = processor.get_embeddings_and_payload_data(content['data'], metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supposed to be a utility
either move to Utility or create static method within Processor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also rename to retrieve_search_payload_and_embedding_payload()

else:
search_payload, vector_embeddings = Utility.get_embeddings_and_payload_data(content['data'], content['metadata'])
metadata = processor.find_matching_metadata(self.bot, content['data'], content.get('collection'))
search_payload, vector_embeddings = processor.get_embeddings_and_payload_data(content['data'], metadata)
else:
search_payload, vector_embeddings = {'content': content["data"]}, content["data"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename vector_embeddings to embedding_payload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants