Augmenting TEA with open-source LLMs #414

chrisdburr · 2024-04-10T06:50:42Z

chrisdburr
Apr 10, 2024
Maintainer

Opening this discussion about a possible research avenue to explore. Please review the following note to see the results of an early experiment using open-source LLMs to evaluate an assurance case:

https://hackmd.io/@tea-platform/HkjWQnQeR

Research Questions

How could we improve the answers through modifications to the JSON file provided?
Would fine-tuning of an existing GPT be feasible? Would it improve the outputs?
What features would improve the user experience of the TEA platform by adding GenAI support?
What are the ethical and legal issues associated with this?
What repository of documents could be useful to improve the outputs by using RAG?

cptanalatriste · 2024-04-10T16:16:29Z

cptanalatriste
Apr 10, 2024
Maintainer

Those are great first results! I took a deep look at Mistral response, and I was really impressed by two of its suggestions:

The lack of evidence and the need of it.
The need to involve additional stakeholders.

However, it also produced some imprecisions. Which is expected, given we're using the model out-of-the-box:

Its uses the terms strategy, objective and sub-goal is inconsistently.
It suggests improving "robustness, consistency, and interoperability", which are unusual non-functional requirements for synthetic data systems.
It suggests using GANs for increasing fidelity. This is not necessarily the case.
It suggests addressing data leakage, but this is addressed as part of the "Privacy Property"

I can see two ways to improve this: 1) we can fine-tune the model, and/or 2) we can implement Retrieval-Augmented Generation. In both cases, we would need to develop a dataset that suits our needs.

I would also suggest, for PoC purposes, to focus on a more modest instruction: "describe" and "improve" seem very ambitious. Maybe, "what evidence can I use for property {X}"?

0 replies

chrisdburr · 2024-04-11T06:08:23Z

chrisdburr
Apr 11, 2024
Maintainer Author

I would also suggest, for PoC purposes, to focus on a more modest instruction: "describe" and "improve" seem very ambitious. Maybe, "what evidence can I use for property {X}"?

These are great suggestions, @cptanalatriste.

One of the other reasons I want to have the TEA schema developed is so that we can be more precise with keywords such as these.

With LangChain and AutoGen, you can also split the instructions down further by customising specific prompt templates for different agents and allowing them to access skills that would provide further structure.

0 replies

chrisdburr · 2024-04-11T06:33:21Z

chrisdburr
Apr 11, 2024
Maintainer Author

So, here are some updates for mistral-7b, building on suggestions from @cptanalatriste:

Please summarise this assurance case. Focus on the top-level goal and the key strategies. Identify any relevant property claims to help explain the respective strategies.

The assurance case discussed here is named "Synthetic Data Quality (Pattern)" and its top-level goal is that "The synthetic data generated meets the required quality standards for its intended use." This goal is achieved through three key strategies: S1, S2, and S3.

Strategy S1 focuses on fidelity, ensuring that synthetic data accurately reflects the statistical properties and distributions of the original dataset. Property claims associated with this strategy include P1 stating that the mean, variance, and correlation of synthetic data closely match those of the original dataset, and P2 indicating high similarity between the synthetic and original data distributions upon visual inspection.

Strategy S2 aims to ensure diversity in the synthetic data by encompassing a wide range of scenarios, including rare or edge cases, to accurately represent the variability in the real dataset. Property claim P3 supports this strategy by confirming that the synthetic data includes representations of all known edge cases from the original dataset.

Strategy S3 addresses privacy concerns and ensures that the synthetic data protects sensitive information and prevents re-identification of individuals. It includes property claims P4 stating that differential privacy techniques are used to minimize re-identification risks, and P5 confirming that re-identification risk assessments show a low risk of identifying individuals from the synthetic data.

Are there any missing property claims that could be added to Strategy S3?

Based on the provided context, there don't seem to be any explicitly stated missing property claims for Strategy S3 (Privacy) in the assurance case. However, it is always a good idea to consider additional property claims that could strengthen or provide more detail about the strategy. Here are a few suggestions:

P8: Synthetic data does not contain any sensitive attributes or identifiable information from the original dataset (e.g., names, addresses, social security numbers).

P9: Re-identification risk assessments are performed regularly to ensure that the synthetic data remains non-identifiable.

P10: Synthetic data is generated using methods designed to prevent membership inference attacks (i.e., attacking the privacy of individuals by determining whether they belong to a certain group based on the synthetic data).

P11: The synthetic data generation process is compliant with relevant privacy regulations, such as HIPAA, GDPR, or CCPA, depending on the domain.

Please suggest relevant forms of evidence for all of the property claims associated with S3, including the new ones you have proposed.
For each evidence, please list the linked property claim and say why this type of evidence is relevant.

Based on the context provided and the additional property claims suggested, here are some potential forms of evidence for each property claim related to S3 (Privacy):

P4:

Differential privacy techniques implementation documentation or code review reports confirming usage.

Reports from privacy experts or auditors stating that the implemented techniques meet industry standards and best practices.

P5:

Outputs of re-identification risk assessments, such as risk scores and reports, showing that the risks are below acceptable thresholds.

Records of regular re-identification risk assessments being performed by a designated team or individual.

New property claim P8:

Reports from privacy experts or auditors confirming that no sensitive information is leaked in the synthetic data, based on an analysis of the dataset and comparison to original data.

New property claim P9:

Records of compliance with relevant privacy regulations, such as HIPAA, GDPR, or CCPA, and any related certifications or attestations.

New property claim P10:

Reports from users or stakeholders indicating that their privacy is respected in the synthetic data, through surveys, focus groups, or other feedback mechanisms.

These types of evidence are relevant because they provide objective evidence that the synthetic data meets certain privacy standards and guidelines. Differential privacy techniques implementation documents and code reviews ensure that appropriate methods have been applied to protect individual's privacy. Re-identification risk assessments demonstrate that the risks of re-identifying individuals in the synthetic data are minimized. Compliance with relevant regulations shows that privacy best practices are being followed, while user feedback validates that those practices result in a respect for individual privacy.

0 replies

cptanalatriste · 2024-04-22T09:26:55Z

cptanalatriste
Apr 22, 2024
Maintainer

@chrisdburr these look great! Specially the "missing property claims". However, the "summary" question ignored a strategy, and I'm not convinced about its suggestion for P10 (asking for users if their privacy is respected feels strange).

I do believe using RAG can limit the number of allucinations. If you agree, I can spend a couple of weeks building a PoC and see if its an avenue worth pursuing.

0 replies

chrisdburr · 2024-05-01T04:21:05Z

chrisdburr
May 1, 2024
Maintainer Author

For fine-tuning, this may be a good option: https://unsloth.ai/blog/llama3

Notebook is here: https://colab.research.google.com/drive/1mPw6P52cERr93w3CMBiJjocdTnyPiKTX#scrollTo=6bZsfBuZDeCL

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Augmenting TEA with open-source LLMs #414

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Augmenting TEA with open-source LLMs #414

chrisdburr Apr 10, 2024 Maintainer

Research Questions

Replies: 5 comments

cptanalatriste Apr 10, 2024 Maintainer

chrisdburr Apr 11, 2024 Maintainer Author

chrisdburr Apr 11, 2024 Maintainer Author

cptanalatriste Apr 22, 2024 Maintainer

chrisdburr May 1, 2024 Maintainer Author

chrisdburr
Apr 10, 2024
Maintainer

cptanalatriste
Apr 10, 2024
Maintainer

chrisdburr
Apr 11, 2024
Maintainer Author

chrisdburr
Apr 11, 2024
Maintainer Author

cptanalatriste
Apr 22, 2024
Maintainer

chrisdburr
May 1, 2024
Maintainer Author