-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add EpubFileLoader for EPUB file processing #192
base: main
Are you sure you want to change the base?
Conversation
Great work on this, will be reviewing this soon! I love how you added a |
Thank you for the feedback! I spent some time understanding how the loader works, and that took the most effort. Compared to researching the existing ones, implementing the new loader took significantly less time. I really like the concept of the type state pattern.
Yes, it does look confusing, but it’s effective at the same time. I’m experimenting with a few improvement options, and if something works out, I’ll share an example. |
Hey, wanted to check up on this. Were you going to introduce anything extra here or is this ready for review? It does look like you need to rebase to main to ensure that this can be merged w/o conflicts. When resolved, you can mark me for review! |
Hi! I've resolved conflicts using the GitHub tool. Let me know if rebase is mandatory; I'll squash commits locally and push them to the branch. The PR is ready to be reviewed. No additional changes aren't going to be added for now, maybe in the next PRs. |
@danik-tro Could this PR be rebased / merged w/ main! thanks! |
@0xMochan Hi! |
Hey @danik-tro ! A big PR was merged this morning. It likely doesn't conflict but it be good to rebase to main. I'll finish a review ASAP to get this merged soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I used this with a real epub, It's pretty good! I think one suggestion I'd make is to include a way to strip html codes so that there's less symbols and content (like add a strip_html_symbols
, etc). If you don't think that's a good idea, then we can go ahead and merge but it's something I noticed in testing.
Hi! Good point. I will add it today or tomorrow.
|
@0xMochan Hi! I've refactored the Epub loader. Now, text processing after extracting a chapter is handled by separate processors. This allows users to define their processors and customize the processing of Epub chapters as needed. Maybe it's a bit complicated, but I’d appreciate your feedback on this implementation! |
Oh this is interesting. a general text-processing aspect is intriguing, esp since it could help with chunking strategies which we also lack. currently, it looks baked into the epub one but there might be an opportunity to generalize later on (XML isn't specific to epub after-all) in a future PR. there's an argument to whether the whole loader thing can be over-engineering as "another" way to deal with loading in general text. Also another aspect is whether it should be worked thru via the pipelines module (currently loaders are not well integrated). I started my review but I'll get back to this tomorrow or Tuesday to finish it up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an exciting approach to this problem, def. an abstraction that can grow into the loaders. I wanna bring this up with the team before we merge (and do wanna test locally with some more complex epubs) as it adds a new dep and concept within rig.
now, a bit of a ramble ;)
I'll say, there's a sorta natural web and flow when it comes to abstractions. Often times, u can generalize to an extreme that ends up something similar to a natural language construct. In Rig, the opinionated-ness of the abstractions with the context of llms and agents help inform the design of the framework. If we make certain assumptions, we can simplify dev-ex from "doing things from scratch" to hopefully save time and I do believe the TextProcessing
trait does that very well!
I appreciate your thoughtful feedback!
Have you had a chance to test it? |
Add EpubFileLoader for EPUB file processing.
Solves #160
Changes
EpubFileLoader
, inrig-core/src/loaders/epub.rs
under theepub
feature.epub-rs
for handling EPUB files.EpubFileLoader
for stripping XML tags to produce plain text.