-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] add table info the documents for annotation #50
base: main
Are you sure you want to change the base?
Conversation
@jeromedockes can you review this PR? |
Thanks @jdkent that's a very useful addition! I wonder if we could get better formatting of the tables. it will never be great because labelbuddy is for annotating plain text but maybe we can do slightly better. ATM this is what I get:
diff --git a/src/pubget/_labelbuddy.py b/src/pubget/_labelbuddy.py
index 9561794..2850c80 100644
--- a/src/pubget/_labelbuddy.py
+++ b/src/pubget/_labelbuddy.py
@@ -100,9 +100,8 @@ def _format_tables(doc_tables: pd.DataFrame, root_dir: Path) -> str:
table_info["table_label"]
table_str = f"## ID: {table_id}\n### Label: {table_label}\n"
with open(table_path, encoding="utf-8") as table_fh:
- reader = csv.reader(table_fh)
- for row in reader:
- table_str += "\t".join(row) + "\n" # Tab-separated values
+ df = pd.read_csv(table_fh)
+ table_str += df.to_string() + "\n"
table_caption = "None" if pd.isna(table_info["table_caption"]) else \
table_info["table_caption"]
table_foot = "None" if pd.isna(table_info["table_foot"]) else \ of course for wide tables it will still look terrible because lines wrap
In a later PR, I think it would be worth exploring different ways of displaying tables that makes them more readable when seen as plain text, maybe similar to asciidoc's syntax which allows placing each cell on a separate line (it also allows placing each row in a single line) |
we could also investigate how to map annotation positions back to table cells (in another issue / pr) |
thanks! my only concern with doing this change you suggest in a later PR is that it could make it difficult to map any annotations that are made on current versions to a "new" version of the labelbuddy file. so if it seems like an important thing to do, it might be worth doing it now. |
thanks! my only concern with doing this change you suggest in a later PR is that it could make it difficult to map any annotations that are made on current versions to a "new" version of the labelbuddy file.
so if it seems like an important thing to do, it might be worth doing it now.
yeah that's a good point -- I guess it depends on what you're planning
to annotate in the short term and to do with the annotations!
|
Will think this over and talk with @jdkent! |
I think re: putting cells on each line, that might be good for fine grained annotation, but it is a less faithfull representation of the original tables, which might make them more difficult to interpret. Often the specific way that table is formatted is informative. There is a library for pretty formatting pandas (and other) tables: https://github.com/astanin/python-tabulate |
So I'm clear on this, we would have an index of the table cells specifying column and row, and when a user highlights text in the table, we would be able to point to the specific column and row the highlighted text is from? like modifying
|
My issue with this is that sometimes in scientific tables the specific context of how cells are arranged makes them easier to interpret. And my concern is that by reformatting them some of the intent of that table is lost. Hard to say without seeing some example though. I think for now just replacing the tabs is sufficient for this PR. |
I made a large example table, this is how it renders (on github anyway):
|
I think it looks good now!
yes that's what I was thinking about, I don't think it should be very hard to do (but it might be a bit tricky to do fast); but if you don't need it for what is being annotated now we can leave it for later I guess. In any case I think the current pr is already a great improvement, thanks! |
looks like 3.7 is not supported by ubuntu-latest, we can switch to ubuntu 22-04, or drop 3.7. |
yes we should drop 3.7 . I also want to switch from setup.py to a pyproject.toml only and a different build backend (say hatchling for example) but haven't gotten around to it yet |
from a quick look it seems like some of the errors in the 3.8 job are genuine though? (not issues in the code but tests that need to be updated) |
closes #49
Add table(s) to the text
Here is an example of how it would look: