Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AI Chat] Introduces support for host-specific distiller scripts #25722

Merged
merged 33 commits into from
Dec 11, 2024

Conversation

jonathansampson
Copy link
Contributor

@jonathansampson jonathansampson commented Sep 27, 2024

This changelist introduces a framework for distilling content from specific hosts using distiller scripts. The functionality is being introduced behind a FEATURE_DISABLED_BY_DEFAULT flag.

The key elements of this change include supporting the x.com and github.com domains, but it is designed to accommodate more sites in the future. Changes are made to only a few files, though many more files (mostly TypeScript) are being introduced.

The newly introduced distillation framework is crucial for site-specific content extraction, allowing the Leo to provide distilled information from major sites. This change lays the groundwork for future expansions by adding more distiller scripts for additional hosts.

For GitHub, the distillation scripts focus on extracting data from branches, pull requests, and repository pages. Specifically, it captures information from tables like branch details (e.g., updated times, pull request statuses) and translates them into a distilled format for easy consumption. This provides a cleaner summary of repository activity, enhancing the ability to extract actionable insights from GitHub pages.

For X (formerly Twitter), the distillation scripts handle content from posts, notifications, user profiles, and media entities. The framework extracts key elements such as tweet text, user interactions (likes, retweets), and attached media (photos, videos), as well as structured content like notifications and quoted posts. This allows for a comprehensive summary of the user's feed or interactions on X, presented in a simplified and structured format.

The distillation scripts for both platforms are considered incomplete, but get us considerably further than where we are today.

Security Review: https://github.com/brave/reviews/issues/1754

Resolves brave/brave-browser#40794

Submitter Checklist:

  • I confirm that no security/privacy review is needed and no other type of reviews are needed, or that I have requested them (Review requested at https://github.com/brave/reviews/issues/1754)
  • There is a ticket for my issue
  • Used Github auto-closing keywords in the PR description above
  • Wrote a good PR/commit description
  • Squashed any review feedback or "fixup" commits before merge, so that history is a record of what happened in the repo, not your PR
  • Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
  • Checked the PR locally:
    • npm run test -- brave_browser_tests, npm run test -- brave_unit_tests wiki
    • npm run presubmit wiki, npm run gn_check, npm run tslint
  • Ran git rebase master (if needed)

Reviewer Checklist:

  • A security review is not needed, or a link to one is included in the PR description
  • New files have MPL-2.0 license header
  • Adequate test coverage exists to prevent regressions
  • Major classes, functions and non-trivial code blocks are well-commented
  • Changes in component dependencies are properly reflected in gn
  • Code follows the style guide
  • Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

Navigate to https://github.com/brave/brave-core/branches/all and ask Leo for a list of branches with an open (as opposed to a draft) pull-request. This is information Leo was previously not able to provide.

@jonathansampson jonathansampson added enhancement javascript Pull requests that update Javascript code labels Sep 27, 2024
@jonathansampson jonathansampson self-assigned this Sep 27, 2024
@jonathansampson jonathansampson requested a review from a team as a code owner September 27, 2024 20:43
@github-actions github-actions bot added CI/storybook-url Deploy storybook and provide a unique URL for each build puLL-Merge labels Sep 27, 2024
@diracdeltas
Copy link
Member

if we intend to ship this, please file a sec review ticket

@jonathansampson
Copy link
Contributor Author

if we intend to ship this, please file a sec review ticket

Working on that now :)

@jonathansampson jonathansampson force-pushed the sampson-script-distillers branch from 5e36be3 to 9495662 Compare September 27, 2024 21:19
@brave-builds
Copy link
Collaborator

A Storybook has been deployed to preview UI for the latest push

@jonathansampson jonathansampson changed the title [AIChat] Introduces support for host-specific distiller scripts [AI Chat] Introduces support for host-specific distiller scripts Oct 9, 2024
@jonathansampson jonathansampson force-pushed the sampson-script-distillers branch from 5bf95be to c76399f Compare October 22, 2024 13:57
@jonathansampson
Copy link
Contributor Author

I'm going to work on breaking this logic up and restructuring it into an updatable component.

@jonathansampson jonathansampson force-pushed the sampson-script-distillers branch from c76399f to 00dcfea Compare November 28, 2024 22:12
@brave-builds
Copy link
Collaborator

A Storybook has been deployed to preview UI for the latest push

No more events. By outputting an ES6 module we can wrap the logic in our own IIFE, and return the distilled representation of the page (if any).
There's ultimately no need to do any special wrapping here if we can do it instead at build/transpilation time.
This approach fits the convention more commonly seen in the codebase.
The main change here is the detection of a user's profile page, which is something we can distill (unlike other pages on X which may have a similar path-style/pattern).

Additionally, this change introduces a small amount of refactoring, as well as a bug fix (need to invoke `isSupportedPage`).

Lastly, this change introduces a small preface to the in-situ metadata, to assist the LLM in understanding where the actual page content begins within the distillation result.
@jonathansampson jonathansampson force-pushed the sampson-script-distillers branch from d873edf to 46abf01 Compare December 10, 2024 16:19
Copy link
Contributor

[puLL-Merge] - brave/brave-core@25722

Description

This PR introduces a new feature for host-specific distillation scripts in the Brave AI Chat functionality. It adds support for custom site-specific content extraction, particularly for GitHub and X (formerly Twitter) websites. The changes include new TypeScript files for content distillation, modifications to existing C++ files to incorporate the new feature, and updates to build configurations.

Changes

Changes

  1. browser/about_flags.cc:

    • Added a new feature flag for host-specific distillation scripts.
  2. browser/ui/BUILD.gn:

    • Added a new dependency for custom site distiller scripts.
  3. components/ai_chat/core/common/features.cc and features.h:

    • Introduced a new feature flag kCustomSiteDistillerScripts.
  4. components/ai_chat/renderer/page_content_extractor.cc:

    • Modified to support the new distillation method.
  5. components/ai_chat/renderer/page_text_distilling.cc and page_text_distilling.h:

    • Significant changes to incorporate host-specific distillation scripts.
  6. New directory components/ai_chat/resources/custom_site_distiller_scripts/:

    • Added multiple TypeScript files for X.com and GitHub.com content distillation.
  7. components/resources/BUILD.gn:

    • Updated to include the new custom site distiller scripts.
  8. resources/resource_ids.spec:

    • Added resource IDs for the new custom site distiller scripts.
sequenceDiagram
    participant User
    participant BraveUI
    participant ContentExtractor
    participant DistillationScript
    participant Website

    User->>BraveUI: Requests content distillation
    BraveUI->>ContentExtractor: Initiates content extraction
    ContentExtractor->>Website: Loads webpage
    ContentExtractor->>ContentExtractor: Checks for custom script
    alt Custom script available
        ContentExtractor->>DistillationScript: Executes custom script
        DistillationScript->>Website: Extracts specific content
        Website-->>DistillationScript: Returns raw content
        DistillationScript-->>ContentExtractor: Returns distilled content
    else No custom script
        ContentExtractor->>Website: Extracts content using default method
        Website-->>ContentExtractor: Returns raw content
    end
    ContentExtractor-->>BraveUI: Returns distilled content
    BraveUI-->>User: Displays distilled content
Loading

Possible Issues

  1. The new distillation scripts are quite complex and may require thorough testing to ensure they work correctly across different scenarios on the target websites.
  2. The changes to the content extraction process might impact performance, especially for websites without custom scripts.

Security Hotspots

  1. components/ai_chat/renderer/page_text_distilling.cc:

    • The function DistillPageTextViaSiteScript executes JavaScript in the context of the webpage. Ensure that the scripts are properly sandboxed and cannot access sensitive information.
  2. components/ai_chat/resources/custom_site_distiller_scripts/scripts/x.com/data.ts:

    • The XStateStore class accesses React internal properties. This approach might break if the website's structure changes and could potentially expose sensitive data.

std::string_view script_content,
int32_t world_id,
base::OnceCallback<void(const std::optional<std::string>&)> callback) {
// TODO (jonathansampson): Wrap scripts at build/transpile-time instead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK(ai_chat::features::IsCustomSiteDistillerScriptsEnabled())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TODO(jonathansampson) and also is there an issue you can link to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issue to link to at this time; how we approach that might rely on whether or not the code is moved to a component.

Copy link
Collaborator

@cdesouza-chromium cdesouza-chromium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jonathansampson jonathansampson merged commit 97fccd9 into master Dec 11, 2024
18 checks passed
@jonathansampson jonathansampson deleted the sampson-script-distillers branch December 11, 2024 14:06
@github-actions github-actions bot added this to the 1.75.x - Nightly milestone Dec 11, 2024
@brave-builds
Copy link
Collaborator

Released in v1.75.92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/run-upstream-tests Run upstream unit and browser tests on Linux and Windows (otherwise only on Linux) CI/storybook-url Deploy storybook and provide a unique URL for each build enhancement javascript Pull requests that update Javascript code needs-security-review puLL-Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Custom Site Distillers
9 participants