Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] add table info the documents for annotation #50

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jdkent
Copy link

@jdkent jdkent commented Jun 5, 2024

closes #49
Add table(s) to the text

Here is an example of how it would look:

Kanayama, Tomoyuki and Nakase, Junsuke and Mochizuki, Takafumi and Asai, Kazuki and Yoshimizu, Rikuto and Kimura, Mitsuhiro and Kinuya, Seigo and Tsuchiya, Hiroyuki
Sci Rep, 2022

# Title

Evaluation of skeletal muscle activity during foot training exercises using positron emission tomography

# Keywords

Medical research
Outcomes research


# Abstract
 The abstract of the article  

# Body
 The text of the article with coordinates,
brains, auditory cortex, memory, memory    

# Table(s)
## ID: None
### Label: None
X	Y	Z
10	20	30
-10	-20	-30
### Caption
None
### Footer
None

@jdkent
Copy link
Author

jdkent commented Jun 5, 2024

Screenshot from 2024-06-05 12-13-51

@adelavega
Copy link
Collaborator

@jeromedockes can you review this PR?

@jeromedockes
Copy link
Member

Thanks @jdkent that's a very useful addition! I wonder if we could get better formatting of the tables. it will never be great because labelbuddy is for annotating plain text but maybe we can do slightly better.

ATM this is what I get:

screenshot_2024-11-06T12:19:57+01:00

  • I would rather not use tabs because they are control characters but the qt text interface will not do any special formatting
  • maybe we can have the columns be aligned when the user changes the labelbuddy font to use a fixed-width one. one way to do it is to let pandas do the formatting; instead of creating the table's string yourself you could use pd.DataFrame.to_string or something similar. it will also simplify the code a little bit
diff --git a/src/pubget/_labelbuddy.py b/src/pubget/_labelbuddy.py
index 9561794..2850c80 100644
--- a/src/pubget/_labelbuddy.py
+++ b/src/pubget/_labelbuddy.py
@@ -100,9 +100,8 @@ def _format_tables(doc_tables: pd.DataFrame, root_dir: Path) -> str:
             table_info["table_label"]
         table_str = f"## ID: {table_id}\n### Label: {table_label}\n"
         with open(table_path, encoding="utf-8") as table_fh:
-            reader = csv.reader(table_fh)
-            for row in reader:
-                table_str += "\t".join(row) + "\n"  # Tab-separated values
+            df = pd.read_csv(table_fh)
+            table_str += df.to_string() + "\n"
         table_caption = "None" if pd.isna(table_info["table_caption"]) else \
             table_info["table_caption"]
         table_foot = "None" if pd.isna(table_info["table_foot"]) else \

screenshot_2024-11-06T12:31:04+01:00

of course for wide tables it will still look terrible because lines wrap

  • maybe we could also have an empty line between the # tables header and the first table, and possibly between the different sections of each table (id, label, table, footer)

In a later PR, I think it would be worth exploring different ways of displaying tables that makes them more readable when seen as plain text, maybe similar to asciidoc's syntax which allows placing each cell on a separate line (it also allows placing each row in a single line)

@jeromedockes
Copy link
Member

In a later PR, I think it would be worth exploring different ways of displaying tables that makes them more readable when seen as plain text, maybe similar to asciidoc's syntax which allows placing each cell on a separate line (it also allows placing each row in a single line)

we could also investigate how to map annotation positions back to table cells (in another issue / pr)

@adelavega
Copy link
Collaborator

thanks! my only concern with doing this change you suggest in a later PR is that it could make it difficult to map any annotations that are made on current versions to a "new" version of the labelbuddy file.

so if it seems like an important thing to do, it might be worth doing it now.

@jeromedockes
Copy link
Member

jeromedockes commented Nov 7, 2024 via email

@adelavega
Copy link
Collaborator

Will think this over and talk with @jdkent!

@adelavega
Copy link
Collaborator

I think re: putting cells on each line, that might be good for fine grained annotation, but it is a less faithfull representation of the original tables, which might make them more difficult to interpret. Often the specific way that table is formatted is informative.

There is a library for pretty formatting pandas (and other) tables: https://github.com/astanin/python-tabulate
Not sure if it's worth adding a dependency though.

@jdkent
Copy link
Author

jdkent commented Nov 11, 2024

we could also investigate how to map annotation positions back to table cells (in another issue / pr)

So I'm clear on this, we would have an index of the table cells specifying column and row, and when a user highlights text in the table, we would be able to point to the specific column and row the highlighted text is from?

like modifying _get_inserted_field_positions or having a similar function for identifying the table cells?

def _get_inserted_field_positions(
    template: str, fields: Mapping[str, Any]
) -> Dict[str, Tuple[int, int]]:
    """Return the indices in a formatted str where values have been inserted.

    example:
    >>> _get_inserted_field_positions("{a}345{b}7", {"a": "012", "b": "6"})
    {'a': (0, 3), 'b': (6, 7)}
    """
    template_parts = re.split(r"\{([^}]*)\}", template)
    prefixes, field_names = template_parts[::2], template_parts[1::2]
    positions = {}
    start, end = 0, 0
    for pref, name in zip(prefixes, field_names):
        start += len(pref)
        end = start + len(str(fields[name]))
        positions[name] = start, end
        start = end
    return positions

@adelavega
Copy link
Collaborator

My issue with this is that sometimes in scientific tables the specific context of how cells are arranged makes them easier to interpret. And my concern is that by reformatting them some of the intent of that table is lost.

Hard to say without seeing some example though.

I think for now just replacing the tabs is sufficient for this PR.

@jdkent
Copy link
Author

jdkent commented Nov 11, 2024

I made a large example table, this is how it renders (on github anyway):

## ID: None

### Label: None

    X   Y   Z             Region  Activation_Level  Voxel_Count  Signal_Noise_Ratio  Timepoint  Intensity  Subject_ID  Confidence_Score  Cluster_Size  P_Value  Q_Value  Correlation  Noise_Level Hemisphere  Mean_Signal  Std_Dev  Median_Signal Session_ID Scan_Type Experiment_ID
0  10  20  30        Hippocampus               1.5          150                 3.2          5        2.3         101              0.95            12     0.05     0.07         0.85         0.20       Left          1.7      0.5            1.6       SS01      fMRI        EXP001
1 -10 -20 -30  Prefrontal Cortex               2.1          200                 2.8         10        1.9         102              0.88            14     0.03     0.09         0.75         0.30      Right          2.1      0.4            1.9       SS02       MEG        EXP002
2  15  25  35    Auditory Cortex               3.0          120                 3.5         15        3.1         103              0.97            16     0.01     0.08         0.65         0.10       Left          2.5      0.3            2.4       SS03       EEG        EXP003
3 -15 -25 -35      Visual Cortex               1.8          180                 2.9         20        2.2         104              0.89            13     0.02     0.06         0.72         0.20      Right          1.9      0.6            1.8       SS04      fMRI        EXP004
4  20  30  40           Thalamus               2.5          160                 3.1         25        2.8         105              0.92            18     0.04     0.05         0.78         0.25       Left          2.2      0.4            2.1       SS05       MEG        EXP005
5 -20 -30 -40    Parietal Cortex               2.0          170                 3.0         30        2.7         106              0.90            17     0.03     0.07         0.80         0.22      Right          2.0      0.5            1.9       SS06       EEG        EXP006
6  25  35  45    Medial Temporal               3.3          130                 3.3         35        3.0         107              0.94            15     0.02     0.06         0.83         0.30       Left          2.3      0.7            2.2       SS07      fMRI        EXP007
7 -25 -35 -45   Occipital Cortex               1.7          190                 2.7         40        1.8         108              0.91            19     0.03     0.08         0.70         0.10      Right          1.5      0.3            1.4       SS08       MEG        EXP008
8  30  40  50         Cerebellum               2.4          140                 3.6         45        2.9         109              0.93            11     0.05     0.07         0.84         0.30       Left          2.0      0.4            1.8       SS09       EEG        EXP009
9 -30 -40 -50      Basal Ganglia               1.9          210                 2.5         50        1.6         110              0.87            20     0.06     0.05         0.82         0.20      Right          1.7      0.6            1.5       SS10      fMRI        EXP010

### Caption

None

### Footer

None

@jeromedockes
Copy link
Member

I think it looks good now!

screenshot_2024-11-12T10:29:33+01:00

So I'm clear on this, we would have an index of the table cells specifying column and row, and when a user highlights text in the table, we would be able to point to the specific column and row the highlighted text is from?

yes that's what I was thinking about, I don't think it should be very hard to do (but it might be a bit tricky to do fast); but if you don't need it for what is being annotated now we can leave it for later I guess.

In any case I think the current pr is already a great improvement, thanks!

@jdkent
Copy link
Author

jdkent commented Nov 12, 2024

looks like 3.7 is not supported by ubuntu-latest, we can switch to ubuntu 22-04, or drop 3.7.

actions/setup-python#962

@jeromedockes
Copy link
Member

yes we should drop 3.7 . I also want to switch from setup.py to a pyproject.toml only and a different build backend (say hatchling for example) but haven't gotten around to it yet

@jeromedockes
Copy link
Member

from a quick look it seems like some of the errors in the 3.8 job are genuine though? (not issues in the code but tests that need to be updated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow exporting figure and table captions to labelbuddy
3 participants