Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] add pipeline configuration/structure #3

Merged
merged 44 commits into from
Nov 25, 2024

Conversation

jdkent
Copy link
Member

@jdkent jdkent commented Oct 22, 2024

Creates backbone of pipeline infrastructure.

Copy link
Member

@adelavega adelavega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good! I will have a better review once I start to implement my own Pipeline but for now just have a question about how we hash things.

I also updated my PR to your branch addressing some of my suggestions here.

ns_pipelines/dataset.py Outdated Show resolved Hide resolved
ns_pipelines/dataset.py Outdated Show resolved Hide resolved
ns_pipelines/pipeline.py Outdated Show resolved Hide resolved
@adelavega
Copy link
Member

Can you also give this PR a better description and update the README.md?

@jdkent jdkent changed the title add testing [ENH] add pipeline configuration/structure Oct 31, 2024
@adelavega
Copy link
Member

Looks good! Just need the tests to pass

Copy link
Member

@adelavega adelavega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I'm going to say approve but I have a few comments, mostly minor / clean up related. Go ahead and merge after clean up.

Also, can you delete the umls_disease pipeline since its not implemented? I can do it in another PR which will help me check the code more thoroughly.

Finally, delete the pipelines folder (i.e. not the ns_pipelines folder).


return predictions, clean_preds


def _load_client(model_name):
if 'gpt' in model_name:
client = OpenAI(api_key=os.getenv('MYOPENAI_API_KEY'))
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I had it this way was because if the environment variable is set to OPENAI_API_KEY it will automatically be ingested by OpenAI and thus this is not necessary.

So I wanted to have the option to not pass that key to OpenAI.

Specifically, this is for when you want to use the OpenAI client for another API (such as OpenRouter).

So what I would do is add which API key to use as a configuration parameter, and in the production environment name it something else that is not OPENAI_API_KEY

@@ -100,7 +55,8 @@ def ParticipantDemographics(IndependentPipeline):
def __init__(
self,
extraction_model,
prompt_set, inputs=("text",),
prompt_set,
inputs=("text",),
input_sources=("pubget", "ace"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would do is add the key as part of the __init__.

Later on, we could define a base class OpenAIPipeline that sets up the client for the subclass automatically, and know which parameters to expect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. the _load_client function could be part of t his new parent class and is alwasy called. For now it's fine though, we can cross that bridge later.

For now just make the key a config parameter and rename the value of the key something else.

_version = "1.0.0"

def __init__(self, inputs=("text",), input_sources=("pubget", "ace"), square_root=False):
"""add any pipeline configuration here (as opposed to runtime arguments like n_cpus or n_cores)"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""add any pipeline configuration here (as opposed to runtime arguments like n_cpus or n_cores)"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably only need this commont in the base class

ns_pipelines/word_count/run.py Outdated Show resolved Hide resolved
ns_pipelines/word_count/run.py Outdated Show resolved Hide resolved
ns_pipelines/word_count/run.py Outdated Show resolved Hide resolved
ns_pipelines/word_count/run.py Outdated Show resolved Hide resolved
ns_pipelines/participant_demographics/run.py Outdated Show resolved Hide resolved
ns_pipelines/participant_demographics/run.py Outdated Show resolved Hide resolved
if 'gpt' in model_name:
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle we can run this using other API keys and hence other model names, so perhaps let's not worry about validation here

@jdkent jdkent merged commit 830dd60 into neurostuff:main Nov 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants