feat: Add EpubFileLoader for EPUB file processing #192

danik-tro · 2025-01-10T01:23:31Z

Add EpubFileLoader for EPUB file processing.

Solves #160

Changes

Implemented a new loader type, EpubFileLoader, in rig-core/src/loaders/epub.rs under the epub feature.
Added an optional dependency on epub-rs for handling EPUB files.
Extracts chapters from the EPUB file. Currently, the text is retrieved in XML format, as EPUB files are archives of XML files.
Potential enhancement: Add methods to EpubFileLoader for stripping XML tags to produce plain text.

0xMochan · 2025-01-10T21:52:55Z

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was, I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

danik-tro · 2025-01-11T22:00:33Z

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was,

Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.

I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example.

0xMochan · 2025-01-17T21:52:08Z

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was,

Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.

I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example.

Hey, wanted to check up on this. Were you going to introduce anything extra here or is this ready for review? It does look like you need to rebase to main to ensure that this can be merged w/o conflicts. When resolved, you can mark me for review!

danik-tro · 2025-01-18T12:01:38Z

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was,

Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.

I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example.

Hey, wanted to check up on this. Were you going to introduce anything extra here or is this ready for review? It does look like you need to rebase to main to ensure that this can be merged w/o conflicts. When resolved, you can mark me for review!

Hi! I've resolved conflicts using the GitHub tool. Let me know if rebase is mandatory; I'll squash commits locally and push them to the branch. The PR is ready to be reviewed. No additional changes aren't going to be added for now, maybe in the next PRs.

0xMochan · 2025-02-03T16:23:25Z

@danik-tro Could this PR be rebased / merged w/ main! thanks!

danik-tro · 2025-02-04T09:58:27Z

@0xMochan Hi!
Yep, I've fixed conflicts.

0xMochan · 2025-02-05T19:26:52Z

Hey @danik-tro ! A big PR was merged this morning. It likely doesn't conflict but it be good to rebase to main. I'll finish a review ASAP to get this merged soon

0xMochan

Hey, I used this with a real epub, It's pretty good! I think one suggestion I'd make is to include a way to strip html codes so that there's less symbols and content (like add a strip_html_symbols, etc). If you don't think that's a good idea, then we can go ahead and merge but it's something I noticed in testing.

danik-tro · 2025-02-10T00:20:01Z

Hi! Good point. I will add it today or tomorrow.

Hey, I used this with a real epub, It's pretty good! I think one suggestion I'd make is to include a way to strip html codes so that there's less symbols and content (like add a strip_html_symbols, etc). If you don't think that's a good idea, then we can go ahead and merge but it's something I noticed in testing.

danik-tro · 2025-02-15T20:51:09Z

@0xMochan Hi! I've refactored the Epub loader. Now, text processing after extracting a chapter is handled by separate processors. This allows users to define their processors and customize the processing of Epub chapters as needed. Maybe it's a bit complicated, but I’d appreciate your feedback on this implementation!

0xMochan · 2025-02-17T02:48:09Z

@0xMochan Hi! I've refactored the Epub loader. Now, text processing after extracting a chapter is handled by separate processors. This allows users to define their processors and customize the processing of Epub chapters as needed. Maybe it's a bit complicated, but I’d appreciate your feedback on this implementation!

Oh this is interesting. a general text-processing aspect is intriguing, esp since it could help with chunking strategies which we also lack. currently, it looks baked into the epub one but there might be an opportunity to generalize later on (XML isn't specific to epub after-all) in a future PR.

there's an argument to whether the whole loader thing can be over-engineering as "another" way to deal with loading in general text. Also another aspect is whether it should be worked thru via the pipelines module (currently loaders are not well integrated).

I started my review but I'll get back to this tomorrow or Tuesday to finish it up!

0xMochan

This is an exciting approach to this problem, def. an abstraction that can grow into the loaders. I wanna bring this up with the team before we merge (and do wanna test locally with some more complex epubs) as it adds a new dep and concept within rig.

now, a bit of a ramble ;)

I'll say, there's a sorta natural web and flow when it comes to abstractions. Often times, u can generalize to an extreme that ends up something similar to a natural language construct. In Rig, the opinionated-ness of the abstractions with the context of llms and agents help inform the design of the framework. If we make certain assumptions, we can simplify dev-ex from "doing things from scratch" to hopefully save time and I do believe the TextProcessing trait does that very well!

rig-core/src/loaders/epub.rs

danik-tro · 2025-02-25T06:11:00Z

This is an exciting approach to this problem, def. an abstraction that can grow into the loaders. I wanna bring this up with the team before we merge (and do wanna test locally with some more complex epubs) as it adds a new dep and concept within rig.

now, a bit of a ramble ;)

I'll say, there's a sorta natural web and flow when it comes to abstractions. Often times, u can generalize to an extreme that ends up something similar to a natural language construct. In Rig, the opinionated-ness of the abstractions with the context of llms and agents help inform the design of the framework. If we make certain assumptions, we can simplify dev-ex from "doing things from scratch" to hopefully save time and I do believe the TextProcessing trait does that very well!

I appreciate your thoughtful feedback!

and do wanna test locally with some more complex epubs

Have you had a chance to test it?

danik-tro added 2 commits January 10, 2025 03:05

feat: add loaders for epub files

cf8e76d

style: Remove redundant closure

2e731ef

Merge branch 'main' into feat/epub-loader

e0ff823

0xMochan self-requested a review January 20, 2025 16:10

Merge branch 'main' into feat/epub-loader

708b658

0xMochan approved these changes Feb 5, 2025

View reviewed changes

refactor: Add an ability to strip xml with XmlProcessor

792f1b9

danik-tro requested a review from 0xMochan February 15, 2025 20:51

chore: Fix lifetime clippy warning

dd4d69e

0xMochan approved these changes Feb 18, 2025

View reviewed changes

rig-core/src/loaders/epub.rs Outdated Show resolved Hide resolved

joshua-mo-143 added the non-breaking label Feb 24, 2025

0xMochan added 2 commits February 24, 2025 10:13

fix(loaders): move epub.rs -> epub/mod.rs

715d801

fix(loaders): merge

3103eb3

0xMochan approved these changes Feb 24, 2025

View reviewed changes

Merge branch 'main' into feat/epub-loader

5baf845

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add EpubFileLoader for EPUB file processing #192

feat: Add EpubFileLoader for EPUB file processing #192

danik-tro commented Jan 10, 2025

0xMochan commented Jan 10, 2025

danik-tro commented Jan 11, 2025

0xMochan commented Jan 17, 2025 •

edited

Loading

danik-tro commented Jan 18, 2025

0xMochan commented Feb 3, 2025

danik-tro commented Feb 4, 2025

0xMochan commented Feb 5, 2025

0xMochan left a comment

danik-tro commented Feb 10, 2025

danik-tro commented Feb 15, 2025

0xMochan commented Feb 17, 2025 •

edited

Loading

0xMochan left a comment •

edited

Loading

danik-tro commented Feb 25, 2025

now, a bit of a ramble ;)

feat: Add EpubFileLoader for EPUB file processing #192

Are you sure you want to change the base?

feat: Add EpubFileLoader for EPUB file processing #192

Conversation

danik-tro commented Jan 10, 2025

Changes

0xMochan commented Jan 10, 2025

danik-tro commented Jan 11, 2025

0xMochan commented Jan 17, 2025 • edited Loading

danik-tro commented Jan 18, 2025

0xMochan commented Feb 3, 2025

danik-tro commented Feb 4, 2025

0xMochan commented Feb 5, 2025

0xMochan left a comment

Choose a reason for hiding this comment

danik-tro commented Feb 10, 2025

danik-tro commented Feb 15, 2025

0xMochan commented Feb 17, 2025 • edited Loading

0xMochan left a comment • edited Loading

Choose a reason for hiding this comment

now, a bit of a ramble ;)

danik-tro commented Feb 25, 2025

now, a bit of a ramble ;)

0xMochan commented Jan 17, 2025 •

edited

Loading

0xMochan commented Feb 17, 2025 •

edited

Loading

0xMochan left a comment •

edited

Loading