-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use NLP model to generate name and description for data sources #43
Comments
Also worth exploring if there is any semi-consistent way to source a description directly from any of the tags |
I mentioned this in #16, but taking information from the home page of any url -- which my PR #36 is aiming to do -- is likely to aid in providing additional context -- since the home page would either be for the entire police department, or for the local government which the police department is based out of. |
I additionally posed the question to ChatGPT about possible options we could take with this, and its answer seemed useful and relevant. https://chat.openai.com/share/c08fcb30-7012-443d-8a2e-0a8d448e05d7 |
@maxachis thanks, I clarified the |
@josh-chamberlain Do we have ideas of what model to use? I don't have much in the way of prior experience in NLP, so I'd definitely defer to if someone such as @EvilDrPurple has a better idea of what model to use, but I have begun looking at some existing models that may have promise, such as https://huggingface.co/Falconsai/text_summarization |
@maxachis I have not used any models for summarization yet, but if you haven't already found this page it may be of some help. It lists models that can be used for summarization near the top: |
So doing some preliminary research on this (and bearing in mind that my NLP experience is quite limited), here are my initial thoughts:
|
@josh-chamberlain I tested out the following entry on several use cases. The below is 467 from PDAP/urls-and-headers
I tested this on a naive implementation of the t-5 model: from transformers import pipeline
summarizer = pipeline("summarization", model="t5-small")
example_text = """ ... """
summary = summarizer(example_text, max_length=30, do_sample=False)
print("Summary:", summary[0]['summary_text']) And got the result: Assuming the punctuation and tag identifiers might be a problem, I removed them and got
I then gave the original format as a prompt to ChatGPT with the prompt For GPT 3.5:
For GPT 4.0:
Obviously, the GPT summaries are the best and the least complicated to setup, but also the most expensive. Back-of-the-envelope math suggests at most $0.05 for a GPT-4 summary (aka $50 for 1000 summaries) and at most $0.005 for a GPT-3.5 summary (aka $5 for 1000 summaries). Being back-of-the-envelope, it's quite possible the actual costs would be cheaper, but that'd take more time to investigate. There are likely other solutions, but finding them and testing their feasibility would take time. |
After investigating more deeply into the OpenAI option, it seems I may have been off by a factor of 10 for GPT 3.5. I ran the above example with a prompt through the following code: client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You will receive a set of html content for a web page and provide a json "
"object with two keys: 'summary' (single sentence summary of web page) "
"and 'name' (descriptive name of web page)."},
{"role": "user", "content": example_text},
],
temperature=0,
)
print(response.choice) Response was below:
Total input tokens: 730 Cost of input token ($0.0005/1K tokens) 0.000365 + 0.000105 = $0.00047 Assuming we made 1000 similar calls: We can probably further reduce the amount of tokens through requiring shorter outputs and/or trimming the fat on the html content provided. |
@maxachis thanks for doing the initial testing and groundwork. Since we're already going to be sending things through hugging face pipeline, could we pick a model there instead? There are a bunch of text classification models there. We could pretrain our own, or use an existing one. random thought: rather than removing punctuation and headers, can we just explain "the page was scraped for the following meta and header content"? Seems more straightforward, in a way. re: your points above, we can also have the model omit names and summaries where it thinks the |
I’m skittish about doing so, for a few reasons:
Let me know your thoughts, @josh-chamberlain |
@maxachis you can feel free to use chat GPT with an API call since that's faster. eventually, we may need to use our own LLM, so:
|
Context
Requirements
As part of the data source identification pipeline, create these text fields for each data source automatically:
submitted_name
description
Suggested path
HTML tag collector
The text was updated successfully, but these errors were encountered: