[ENH] add table info the documents for annotation #50

jdkent · 2024-06-05T03:26:01Z

closes #49
Add table(s) to the text

Here is an example of how it would look:

Kanayama, Tomoyuki and Nakase, Junsuke and Mochizuki, Takafumi and Asai, Kazuki and Yoshimizu, Rikuto and Kimura, Mitsuhiro and Kinuya, Seigo and Tsuchiya, Hiroyuki
Sci Rep, 2022

# Title

Evaluation of skeletal muscle activity during foot training exercises using positron emission tomography

# Keywords

Medical research
Outcomes research


# Abstract
 The abstract of the article  

# Body
 The text of the article with coordinates,
brains, auditory cortex, memory, memory    

# Table(s)
## ID: None
### Label: None
X	Y	Z
10	20	30
-10	-20	-30
### Caption
None
### Footer
None

jdkent · 2024-06-05T17:14:34Z

adelavega · 2024-11-06T00:56:18Z

@jeromedockes can you review this PR?

jeromedockes · 2024-11-06T11:39:10Z

Thanks @jdkent that's a very useful addition! I wonder if we could get better formatting of the tables. it will never be great because labelbuddy is for annotating plain text but maybe we can do slightly better.

ATM this is what I get:

I would rather not use tabs because they are control characters but the qt text interface will not do any special formatting
maybe we can have the columns be aligned when the user changes the labelbuddy font to use a fixed-width one. one way to do it is to let pandas do the formatting; instead of creating the table's string yourself you could use pd.DataFrame.to_string or something similar. it will also simplify the code a little bit

diff --git a/src/pubget/_labelbuddy.py b/src/pubget/_labelbuddy.py
index 9561794..2850c80 100644
--- a/src/pubget/_labelbuddy.py
+++ b/src/pubget/_labelbuddy.py
@@ -100,9 +100,8 @@ def _format_tables(doc_tables: pd.DataFrame, root_dir: Path) -> str:
             table_info["table_label"]
         table_str = f"## ID: {table_id}\n### Label: {table_label}\n"
         with open(table_path, encoding="utf-8") as table_fh:
-            reader = csv.reader(table_fh)
-            for row in reader:
-                table_str += "\t".join(row) + "\n"  # Tab-separated values
+            df = pd.read_csv(table_fh)
+            table_str += df.to_string() + "\n"
         table_caption = "None" if pd.isna(table_info["table_caption"]) else \
             table_info["table_caption"]
         table_foot = "None" if pd.isna(table_info["table_foot"]) else \

of course for wide tables it will still look terrible because lines wrap

maybe we could also have an empty line between the # tables header and the first table, and possibly between the different sections of each table (id, label, table, footer)

In a later PR, I think it would be worth exploring different ways of displaying tables that makes them more readable when seen as plain text, maybe similar to asciidoc's syntax which allows placing each cell on a separate line (it also allows placing each row in a single line)

jeromedockes · 2024-11-06T11:41:08Z

In a later PR, I think it would be worth exploring different ways of displaying tables that makes them more readable when seen as plain text, maybe similar to asciidoc's syntax which allows placing each cell on a separate line (it also allows placing each row in a single line)

we could also investigate how to map annotation positions back to table cells (in another issue / pr)

adelavega · 2024-11-06T22:43:27Z

thanks! my only concern with doing this change you suggest in a later PR is that it could make it difficult to map any annotations that are made on current versions to a "new" version of the labelbuddy file.

so if it seems like an important thing to do, it might be worth doing it now.

jeromedockes · 2024-11-07T08:57:49Z

thanks! my only concern with doing this change you suggest in a later PR is that it could make it difficult to map any annotations that are made on current versions to a "new" version of the labelbuddy file. so if it seems like an important thing to do, it might be worth doing it now.

yeah that's a good point -- I guess it depends on what you're planning to annotate in the short term and to do with the annotations!

adelavega · 2024-11-07T16:03:22Z

Will think this over and talk with @jdkent!

adelavega · 2024-11-08T21:48:28Z

I think re: putting cells on each line, that might be good for fine grained annotation, but it is a less faithfull representation of the original tables, which might make them more difficult to interpret. Often the specific way that table is formatted is informative.

There is a library for pretty formatting pandas (and other) tables: https://github.com/astanin/python-tabulate
Not sure if it's worth adding a dependency though.

…dy_table

jdkent · 2024-11-11T15:26:18Z

we could also investigate how to map annotation positions back to table cells (in another issue / pr)

So I'm clear on this, we would have an index of the table cells specifying column and row, and when a user highlights text in the table, we would be able to point to the specific column and row the highlighted text is from?

like modifying _get_inserted_field_positions or having a similar function for identifying the table cells?

def _get_inserted_field_positions(
    template: str, fields: Mapping[str, Any]
) -> Dict[str, Tuple[int, int]]:
    """Return the indices in a formatted str where values have been inserted.

    example:
    >>> _get_inserted_field_positions("{a}345{b}7", {"a": "012", "b": "6"})
    {'a': (0, 3), 'b': (6, 7)}
    """
    template_parts = re.split(r"\{([^}]*)\}", template)
    prefixes, field_names = template_parts[::2], template_parts[1::2]
    positions = {}
    start, end = 0, 0
    for pref, name in zip(prefixes, field_names):
        start += len(pref)
        end = start + len(str(fields[name]))
        positions[name] = start, end
        start = end
    return positions

adelavega · 2024-11-11T16:17:51Z

My issue with this is that sometimes in scientific tables the specific context of how cells are arranged makes them easier to interpret. And my concern is that by reformatting them some of the intent of that table is lost.

Hard to say without seeing some example though.

I think for now just replacing the tabs is sufficient for this PR.

jdkent · 2024-11-11T16:55:24Z

I made a large example table, this is how it renders (on github anyway):

## ID: None

### Label: None

    X   Y   Z             Region  Activation_Level  Voxel_Count  Signal_Noise_Ratio  Timepoint  Intensity  Subject_ID  Confidence_Score  Cluster_Size  P_Value  Q_Value  Correlation  Noise_Level Hemisphere  Mean_Signal  Std_Dev  Median_Signal Session_ID Scan_Type Experiment_ID
0  10  20  30        Hippocampus               1.5          150                 3.2          5        2.3         101              0.95            12     0.05     0.07         0.85         0.20       Left          1.7      0.5            1.6       SS01      fMRI        EXP001
1 -10 -20 -30  Prefrontal Cortex               2.1          200                 2.8         10        1.9         102              0.88            14     0.03     0.09         0.75         0.30      Right          2.1      0.4            1.9       SS02       MEG        EXP002
2  15  25  35    Auditory Cortex               3.0          120                 3.5         15        3.1         103              0.97            16     0.01     0.08         0.65         0.10       Left          2.5      0.3            2.4       SS03       EEG        EXP003
3 -15 -25 -35      Visual Cortex               1.8          180                 2.9         20        2.2         104              0.89            13     0.02     0.06         0.72         0.20      Right          1.9      0.6            1.8       SS04      fMRI        EXP004
4  20  30  40           Thalamus               2.5          160                 3.1         25        2.8         105              0.92            18     0.04     0.05         0.78         0.25       Left          2.2      0.4            2.1       SS05       MEG        EXP005
5 -20 -30 -40    Parietal Cortex               2.0          170                 3.0         30        2.7         106              0.90            17     0.03     0.07         0.80         0.22      Right          2.0      0.5            1.9       SS06       EEG        EXP006
6  25  35  45    Medial Temporal               3.3          130                 3.3         35        3.0         107              0.94            15     0.02     0.06         0.83         0.30       Left          2.3      0.7            2.2       SS07      fMRI        EXP007
7 -25 -35 -45   Occipital Cortex               1.7          190                 2.7         40        1.8         108              0.91            19     0.03     0.08         0.70         0.10      Right          1.5      0.3            1.4       SS08       MEG        EXP008
8  30  40  50         Cerebellum               2.4          140                 3.6         45        2.9         109              0.93            11     0.05     0.07         0.84         0.30       Left          2.0      0.4            1.8       SS09       EEG        EXP009
9 -30 -40 -50      Basal Ganglia               1.9          210                 2.5         50        1.6         110              0.87            20     0.06     0.05         0.82         0.20      Right          1.7      0.6            1.5       SS10      fMRI        EXP010

### Caption

None

### Footer

None

jeromedockes · 2024-11-12T09:33:02Z

I think it looks good now!

So I'm clear on this, we would have an index of the table cells specifying column and row, and when a user highlights text in the table, we would be able to point to the specific column and row the highlighted text is from?

yes that's what I was thinking about, I don't think it should be very hard to do (but it might be a bit tricky to do fast); but if you don't need it for what is being annotated now we can leave it for later I guess.

In any case I think the current pr is already a great improvement, thanks!

jdkent · 2024-11-12T15:46:03Z

looks like 3.7 is not supported by ubuntu-latest, we can switch to ubuntu 22-04, or drop 3.7.

actions/setup-python#962

jeromedockes · 2024-11-12T15:49:40Z

yes we should drop 3.7 . I also want to switch from setup.py to a pyproject.toml only and a different build backend (say hatchling for example) but haven't gotten around to it yet

jeromedockes · 2024-11-12T15:51:18Z

from a quick look it seems like some of the errors in the 3.8 job are genuine though? (not issues in the code but tests that need to be updated)

jdkent added 4 commits June 4, 2024 22:16

add table info the documents for annotation

f4dc664

do more formatting

5a5d58e

add missing line

15e21f1

fix test

c348362

jdkent added 2 commits November 11, 2024 08:56

Merge branch 'main' of github.com:neuroquery/pubget into enh/labelbud…

6b5f994

…dy_table

add newlines and change stringification

83b9c24

jdkent added 2 commits November 11, 2024 10:28

mess with newlines

fd11d63

make an unruly table with many columns

7d62366

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] add table info the documents for annotation #50

[ENH] add table info the documents for annotation #50

jdkent commented Jun 5, 2024

jdkent commented Jun 5, 2024

adelavega commented Nov 6, 2024

jeromedockes commented Nov 6, 2024

jeromedockes commented Nov 6, 2024

adelavega commented Nov 6, 2024

jeromedockes commented Nov 7, 2024 via email

adelavega commented Nov 7, 2024

adelavega commented Nov 8, 2024

jdkent commented Nov 11, 2024 •

edited

Loading

adelavega commented Nov 11, 2024

jdkent commented Nov 11, 2024

jeromedockes commented Nov 12, 2024

jdkent commented Nov 12, 2024

jeromedockes commented Nov 12, 2024

jeromedockes commented Nov 12, 2024

[ENH] add table info the documents for annotation #50

Are you sure you want to change the base?

[ENH] add table info the documents for annotation #50

Conversation

jdkent commented Jun 5, 2024

jdkent commented Jun 5, 2024

adelavega commented Nov 6, 2024

jeromedockes commented Nov 6, 2024

jeromedockes commented Nov 6, 2024

adelavega commented Nov 6, 2024

jeromedockes commented Nov 7, 2024 via email

adelavega commented Nov 7, 2024

adelavega commented Nov 8, 2024

jdkent commented Nov 11, 2024 • edited Loading

adelavega commented Nov 11, 2024

jdkent commented Nov 11, 2024

jeromedockes commented Nov 12, 2024

jdkent commented Nov 12, 2024

jeromedockes commented Nov 12, 2024

jeromedockes commented Nov 12, 2024

jdkent commented Nov 11, 2024 •

edited

Loading