Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add EpubFileLoader for EPUB file processing #192

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

danik-tro
Copy link

Add EpubFileLoader for EPUB file processing.

Solves #160

Changes

  • Implemented a new loader type, EpubFileLoader, in rig-core/src/loaders/epub.rs under the epub feature.
  • Added an optional dependency on epub-rs for handling EPUB files.
  • Extracts chapters from the EPUB file. Currently, the text is retrieved in XML format, as EPUB files are archives of XML files.
  • Potential enhancement: Add methods to EpubFileLoader for stripping XML tags to produce plain text.

@0xMochan
Copy link
Contributor

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was, I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

@danik-tro
Copy link
Author

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was,

Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.

I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example.

@0xMochan
Copy link
Contributor

0xMochan commented Jan 17, 2025

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was,

Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.

I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example.

Hey, wanted to check up on this. Were you going to introduce anything extra here or is this ready for review? It does look like you need to rebase to main to ensure that this can be merged w/o conflicts. When resolved, you can mark me for review!

@danik-tro
Copy link
Author

Great work on this, will be reviewing this soon! I love how you added a by_chapter method which matches by_page. Let me know how creating a custom Loader was,

Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.

I think there are ways to make it cleaner with loader traits but the way the types are setup might seem a bit confusing at first!

Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example.

Hey, wanted to check up on this. Were you going to introduce anything extra here or is this ready for review? It does look like you need to rebase to main to ensure that this can be merged w/o conflicts. When resolved, you can mark me for review!

Hi! I've resolved conflicts using the GitHub tool. Let me know if rebase is mandatory; I'll squash commits locally and push them to the branch. The PR is ready to be reviewed. No additional changes aren't going to be added for now, maybe in the next PRs.

@0xMochan 0xMochan self-requested a review January 20, 2025 16:10
@0xMochan
Copy link
Contributor

0xMochan commented Feb 3, 2025

@danik-tro Could this PR be rebased / merged w/ main! thanks!

@danik-tro
Copy link
Author

@0xMochan Hi!
Yep, I've fixed conflicts.

@0xMochan
Copy link
Contributor

0xMochan commented Feb 5, 2025

Hey @danik-tro ! A big PR was merged this morning. It likely doesn't conflict but it be good to rebase to main. I'll finish a review ASAP to get this merged soon

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I used this with a real epub, It's pretty good! I think one suggestion I'd make is to include a way to strip html codes so that there's less symbols and content (like add a strip_html_symbols, etc). If you don't think that's a good idea, then we can go ahead and merge but it's something I noticed in testing.

@danik-tro
Copy link
Author

Hi! Good point. I will add it today or tomorrow.

Hey, I used this with a real epub, It's pretty good! I think one suggestion I'd make is to include a way to strip html codes so that there's less symbols and content (like add a strip_html_symbols, etc). If you don't think that's a good idea, then we can go ahead and merge but it's something I noticed in testing.

@danik-tro
Copy link
Author

@0xMochan Hi! I've refactored the Epub loader. Now, text processing after extracting a chapter is handled by separate processors. This allows users to define their processors and customize the processing of Epub chapters as needed. Maybe it's a bit complicated, but I’d appreciate your feedback on this implementation!

@danik-tro danik-tro requested a review from 0xMochan February 15, 2025 20:51
@0xMochan
Copy link
Contributor

0xMochan commented Feb 17, 2025

@0xMochan Hi! I've refactored the Epub loader. Now, text processing after extracting a chapter is handled by separate processors. This allows users to define their processors and customize the processing of Epub chapters as needed. Maybe it's a bit complicated, but I’d appreciate your feedback on this implementation!

Oh this is interesting. a general text-processing aspect is intriguing, esp since it could help with chunking strategies which we also lack. currently, it looks baked into the epub one but there might be an opportunity to generalize later on (XML isn't specific to epub after-all) in a future PR.

there's an argument to whether the whole loader thing can be over-engineering as "another" way to deal with loading in general text. Also another aspect is whether it should be worked thru via the pipelines module (currently loaders are not well integrated).

I started my review but I'll get back to this tomorrow or Tuesday to finish it up!

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exciting approach to this problem, def. an abstraction that can grow into the loaders. I wanna bring this up with the team before we merge (and do wanna test locally with some more complex epubs) as it adds a new dep and concept within rig.

now, a bit of a ramble ;)

I'll say, there's a sorta natural web and flow when it comes to abstractions. Often times, u can generalize to an extreme that ends up something similar to a natural language construct. In Rig, the opinionated-ness of the abstractions with the context of llms and agents help inform the design of the framework. If we make certain assumptions, we can simplify dev-ex from "doing things from scratch" to hopefully save time and I do believe the TextProcessing trait does that very well!

@danik-tro
Copy link
Author

This is an exciting approach to this problem, def. an abstraction that can grow into the loaders. I wanna bring this up with the team before we merge (and do wanna test locally with some more complex epubs) as it adds a new dep and concept within rig.

now, a bit of a ramble ;)

I'll say, there's a sorta natural web and flow when it comes to abstractions. Often times, u can generalize to an extreme that ends up something similar to a natural language construct. In Rig, the opinionated-ness of the abstractions with the context of llms and agents help inform the design of the framework. If we make certain assumptions, we can simplify dev-ex from "doing things from scratch" to hopefully save time and I do believe the TextProcessing trait does that very well!

I appreciate your thoughtful feedback!

and do wanna test locally with some more complex epubs

Have you had a chance to test it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants