Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add XLSXToDocument converter #8522

Merged
merged 19 commits into from
Jan 9, 2025
Merged

feat: Add XLSXToDocument converter #8522

merged 19 commits into from
Jan 9, 2025

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Nov 8, 2024

Related Issues

Proposed Changes:

Draft of the Excel to Document converter

How did you test it?

Added tests

Notes for the reviewer

Checklist

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Nov 8, 2024
@sjrl sjrl requested a review from bglearning November 8, 2024 09:48
@sjrl
Copy link
Contributor Author

sjrl commented Nov 8, 2024

Hey @bglearning I'd like to discuss with you how got references to work (e.g. pointing to the correct row in the original Excel table) to make sure we accommodate that properly in this component.

@coveralls
Copy link
Collaborator

coveralls commented Nov 8, 2024

Pull Request Test Coverage Report for Build 12685532464

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 91.088%

Totals Coverage Status
Change from base Build 12673146704: 0.06%
Covered Lines: 8647
Relevant Lines: 9493

💛 - Coveralls

@bglearning
Copy link
Contributor

btw @sjrl , looks like to_csv actually also supports list of strings.

header: bool or list of str, default True

So the out_header (or equivalent) can be made the same for both (List[str]) which would just be the columns by default (if excel_column_names is optional).

Also on the flip side, to_markdown is eventually calling the tabulate library which uses the headers kwarg

- `headers` can be an explicit list of column headers
- if `headers="firstrow"`, then the first row of data is used
- if `headers="keys"`, then dictionary keys or column indices are used

@sjrl sjrl marked this pull request as ready for review December 11, 2024 14:21
@sjrl sjrl requested review from a team as code owners December 11, 2024 14:21
@sjrl sjrl requested review from dfokina and anakin87 and removed request for a team December 11, 2024 14:21
@sjrl
Copy link
Contributor Author

sjrl commented Dec 11, 2024

Just realized I'd like to do a bit more testing. Especially what happens when a complicated table is converted such as what happens if cells are merged, etc.

So a TODO

  • Add test with realistic table (ie like one from a client)

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some initial comments.
Overall, the PR looks good.

haystack/components/converters/xlsx.py Show resolved Hide resolved
haystack/components/converters/xlsx.py Outdated Show resolved Hide resolved
haystack/components/converters/xlsx.py Outdated Show resolved Hide resolved
haystack/components/converters/xlsx.py Outdated Show resolved Hide resolved
@sjrl sjrl requested a review from anakin87 December 12, 2024 14:08
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Feel free to add a test with a realistic table (#8522 (comment)), then it's ready to go.

@anakin87 anakin87 self-requested a review January 9, 2025 07:59
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to ship this feature in the next release,
so we can skip the test with a realistic table for the moment.

It would be great if we can add it later.

@sjrl sjrl merged commit 28ad78c into main Jan 9, 2025
20 checks passed
@sjrl sjrl deleted the xlsx-converter branch January 9, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants