Skip to content

Commit

Permalink
quality review extraction from 2 links
Browse files Browse the repository at this point in the history
  • Loading branch information
ThomasOh92 committed Dec 21, 2024
1 parent 55f76ef commit afb42d8
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 18 deletions.
19 changes: 9 additions & 10 deletions README.MD
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# Next Steps
- Review text is looking a bit thin. When extracting, need to qc for longer quotes. It seemed to ignore character requirement
- SO MUCH HALLUCINATION. NEED TO DO LINK BY LINK. GENERATE 3 QUOTES. CHECK ACCURACY OF ALL 3 QUOTES

- Work on extracting quality reviews first - hand extracted is fine
- Consider reworking individual pages - left hand column (Generated Insight). Consider adding Course Metadata
- Maybe change "Review Source Data Notes" to the actual list of sources
- Maybe further add - Sources have multiple fields (links, tag - trusted, high priority, experimental, name, description, type, added date, status)
Expand All @@ -10,11 +8,12 @@

# Process Definition - Starting a new page - Every phase has a human in the loop
## Phase A - Collect and Manually Verify Links
1. Create a json file to hold the collected, titled "{course_name}_firstore.json"
2. Collect manually verified links and place them into the json file under "AllLinks", as an array <!-- human -->
1. Create a json file to hold the collected links, titled "{course_name}_firestore.json"
2. Collect manually verified links and place them into the json file under the "AllLinks" key, as an array <!-- human -->


## Phase B - Generate the quotes one link at a time and save the quotes
5. Open ChatGPT and provide the following prompt:
## Phase B - WIP - Manual quote extraction (copilot enhancments to move faster)? Webscraped with LLMs? Using specific APIs? (Reddit)
### Possible Prompt for extracting specific quotes from raw data
{insert link}
This link contains valuable reviews about the course {course name}. I would like you to go to the page and extract quotes that provide a review about the course. Find as many relevant quotes as possible.
- Each quote should be
Expand All @@ -27,17 +26,17 @@ This link contains valuable reviews about the course {course name}. I would like
-- Remove special characters (except punctuation) unless they are critical to the meaning of the quote.

Present the quotes back to me in a formatted json file. All quotes are in array, within a key called "CollectedReviews"
Each quote should have the following fields - 'quote', 'source_url' 'bolded text', 'flagged status', 'keywords', 'sentiment', 'date'.
Each quote should have the following fields - 'quote', 'source', 'source_url' 'bolded text', 'flagged status', 'keywords', 'sentiment', 'date'.
- For 'bolded text', look at the relevant quote and identify snippets that should be bolded. This should be the value
- For 'source', look at the url and identify the general place the source is from (e.g. Subreddit, Course Website, etc. )
- For 'source_url', refer to the url that the quote is taken from
- For 'flagged status', just set it as null
- For 'keywords', identify 3 key words from the quote that would be relevant to someone considering taking the course
- For 'sentiment', identify if the review is positive, neutral, or negative
- For 'date', check the associated url for the date of the quote
- Make the new json file available to me for download

Accelerations possible - did 4 reddit links at one go, but didnt get as many quotes. Tried with a batch of 5 and it worked as well.
But remember, I started with one link first.


## Phase C - Generate the final output to upload into firestore
7. Reupload cleaned document "{course name}_firestore.json" to ChatGPT and ask the following
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,56 @@ import { StyledCard, StyledCardContent, StyledTypography } from '../custom-style

interface Quote {
quote: string;
source: string;
source_url: string;
key: number;
bolded_text: string;
flagged_status: boolean;
keywords: string[];
sentiment: string;
date: string;
}



const CollectedReviewsAll: React.FC<Quote> = ({ quote, source_url, key }) => {
const urlParts = source_url.replace(/^(?:https?:\/\/)?(?:www\.)?/i, "").split('/');
const title = urlParts.length > 2 ? `${urlParts[0]}/${urlParts[1]}/${urlParts[2]}` : urlParts.join('/');
const CollectedReviewsAll: React.FC<Quote> = ({ quote, source, source_url, bolded_text, flagged_status, keywords, sentiment, date, key }) => {

let formattedQuote: (string | JSX.Element)[] = [quote];
const normalize = (str: string) => str.replace(/[.,/#!$%^&*;:{}=\-_`~()]/g, '').toLocaleLowerCase();

if (normalize(quote).includes(normalize(bolded_text))) {

const escapeRegExp = (string: string) => string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const escapedBoldedText = escapeRegExp(bolded_text.replace(/\.$/, ''));
const parts = quote.split(new RegExp(`(${escapedBoldedText})`, 'gi'));
formattedQuote = parts.map((part, index) =>
normalize(part) === normalize(bolded_text) ? (
<strong key={index}>{part}</strong>
) : (
part
)
);
}

return (
<>
<StyledCard key={key}>
<StyledCardContent>
<Box>
<Typography variant="subtitle2" color="textSecondary" sx={{ mb: 1 }}>
{formattedQuote}
</Typography>
<Typography variant="body2" color="textSecondary" sx={{ mb: 1 }}>
<strong>{title}:</strong> {quote}
<strong>Sentiment:</strong> {sentiment}
</Typography>
{date && (
<Typography variant="body2" color="textSecondary" sx={{ mb: 1 }}>
<strong>Date:</strong> {date}
</Typography>
)}
<Typography variant="body2" color="textSecondary">
<a href={source_url} target="_blank" rel="noopener noreferrer">
{source_url.replace(/^(?:https?:\/\/)?(?:www\.)?/i, "").split('/')[0]}
{source}
</a>
</Typography>
</Box>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,13 @@ const ReviewPage: React.FC<ReviewPageProps> = ({data }) => {
<ThemeProvider theme={showCustomTheme ? mainTheme : defaultTheme}>
<CssBaseline enableColorScheme />
<Container
maxWidth="lg"
maxWidth="xl"
component="main"
sx={{ display: 'flex', flexDirection: 'column', my: 6, gap: 4 }}
>
<Box sx={{ display: 'flex', flexDirection: 'column', gap: 2, alignItems: 'center' }}>
<Typography variant="h2" gutterBottom maxWidth='700px' align="center">{data?.Title}</Typography>
<Box sx={{ display: 'flex', flexDirection: { xs: 'column', md: 'row' }, gap: 4, alignItems: { xs: 'center', md: 'flex-start' }, width: '100%' }}>
<Box sx={{ display: 'flex', flexDirection: { xs: 'column', md: 'row' }, gap: 4, alignItems: { xs: 'center', md: 'flex-start' }, width: '100%', justifyContent: 'center' }}>
{/* Left hand Column */}
<Box sx={{ display: 'flex', flexDirection: 'column', alignItems:'center', gap: 2 }}>
<Link
Expand All @@ -72,7 +72,16 @@ const ReviewPage: React.FC<ReviewPageProps> = ({data }) => {
<Grid xs={12} md={6}>
<Box sx={{ display: 'flex', flexDirection: 'column', gap: 2, height: '100%', alignItems: 'center' }}>
{data?.CollectedReviews?.map((review: any, index: number) => (
<CollectedReviewsAll quote={review.quote} source_url={review.source_url} key={index} />
<CollectedReviewsAll
quote={review.quote}
source={review.source}
source_url={review.source_url}
bolded_text={review.bolded_text}
flagged_status={review.flagged_status}
keywords={review.keywords}
sentiment={review.sentiment}
date={review.date}
key={index} />
)) || 'No data available'}
</Box>
</Grid>
Expand Down

0 comments on commit afb42d8

Please sign in to comment.