JCDL 2024 triage approaches

Author: Yuanxi Fu (https://github.com/yuanxiesa); Ishita Sarraf (https://github.com/ishita-17)

Code used for Heng Zheng*, Yuanxi Fu*, Ishita Sarraf, M. Janina Sarol, and Jodi Schneider. (2024). Addressing Unreliability Propagation in Scientific Digital Libraries. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2024.https://doi.org/10.1145/3677389.3702526. (*both authors contributed equally)

Dependencies

pandas 2.2.2
numpy 1.26.4
openpyxl 3.1.5
scikit-learn 1.5.1
keras 3.6.0
tensorflow 2.17.0

Glossary

sentence-level citation context: the sentence containing the citation marker. For example, consider this paragraph from [https://pubs.acs.org/doi/10.1021/acs.joc.9b03129]

From the paragraph: "Nuclear magnetic resonance (NMR) spectroscopy is one of the pivotal analytical tools used to determine key chemical properties of organic compounds, for example, relative/ absolute configurations,1,2 and to provide further structural information, for example, representative conformational patterns of the investigated molecules.3 In this context, the spectroscopic properties of organic compounds can be proficiently predicted by accurate quantum chemical meth- ods.1,4−7 Indeed, the integration of the information from experimental and computational data can then be of fundamental importance to solve different structural issues of organic compounds. In the last decade, different studies were performed with the combination of the information from NMR spectroscopy (experimental part) and quantum mechanical (QM) calculations (predicted part) (QM/NMR integrated approach) for the successful elucidation of the configurational patterns of organic compounds.1,4 Also, this approach is helpful for the stereostructural assignment of natural compounds, thus representing a reliable alternative, faster and cheaper, to total synthesis.8 Also, the notable advances in computer science nowadays allows the perform- ance of accurate conformational sampling and QM calculations even on desktop computers, thus facilitating the structural elucidation process."

The above paragraph has TWO sentence-level citation context for citation marker 1:

"Nuclear magnetic resonance (NMR) spectroscopy is one of the pivotal analytical tools used to determine key chemical properties of organic compounds, for example, relative/ absolute configurations,1,2 and to provide further structural information, for example, representative conformational patterns of the investigated molecules.3 In this context, the spectroscopic properties of organic compounds can be proficiently predicted by accurate quantum chemical meth- ods.1,4−7"

and

"In the last decade, different studies were performed with the combination of the information from NMR spectroscopy (experimental part) and quantum mechanical (QM) calculations (predicted part) (QM/NMR integrated approach) for the successful elucidation of the configurational patterns of organic compounds.1,4 "

File Structure and Explanations

LSTM.ipynb

An LSTM model to predict the whether a sentence-level citation context indicates risk of the citing publication propagate unreliability (Y or N) Code adapted from: https://github.com/Conferences2023/TPDL/blob/main/Citation%20Analysis/LSTM.py

'code for Usman, M., & Balke, W.-T. (2023). On retraction cascade? Citation intention analysis as a quality control mechanism in digital libraries. In O. Alonso, H. Cousijn, G. Silvello, M. Marrero, C. Teixeira Lopes, & S. Marchesin (Eds.), Linking Theory and Practice of Digital Libraries (pp. 117–131). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-43849-3_11'

Input: ML_input_data

all_data.csv: all citation context sentences
train_set.csv: the train set
test_set.csv: the test set

Output: ML_output_data

LSTM_prediction_203.csv: predictions for all sentence-level citation contexts: Y or N

decision_tree.ipynb

Code for the three approaches: the base approach, the keyword-based approach (Approach-KW), and the machine learning-based approach (Approach-ML)

Input:

input_data/keyword_dictionary.csv: all keywords used in the keywords approach
input_data/LSTM_prediction_203.csv: the prediction of the citing publication propagating risks of the publication from the LSTM model, only for the sentence-level citation context from 203 publications going into stage 3
input_data/metadata.csv: metadata needed for the decision trees
input_data/citation_context_sentence.csv: the sentence-level citation context (for decision tree as well, the keyword approach)

Output:

decision_tree_output_data/decision_df_keywords.csv: decisions made by the keyword approach (Approch-KW)
decision_tree_output_data/decision_df_ml.csv: decisions made by the machine learning-based approach (Approach-ML)

evaluation.ipynb

Input:

decision_tree_output/decision_df_keywords.csv: decisions made by the keyword approach (Approch-KW)
decision_tree_output/decision_df_ml.csv: decisions made by the machine learning-based approach (Approach-ML)
input_data/pub_annotation.csv: silver standard annotation

Output:

eval_keyword_silver: comparison between the keyword-based approach and the silver standard
eval_ml_silver: comparison between the machine learning-based approach and the silver standard
performance_metrics.tsv: perfomrnace metrics of the three approaches, correponding to Table 3 of the publication

helpers.py

all helper functions needed for decision_tree.ipynb and evaluation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JCDL 2024 triage approaches

Dependencies

Glossary

File Structure and Explanations

LSTM.ipynb

decision_tree.ipynb

evaluation.ipynb

helpers.py

About

Releases 2

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.idea		.idea
ML_input_data		ML_input_data
ML_output_data		ML_output_data
__pycache__		__pycache__
decision_tree_output		decision_tree_output
eval_output		eval_output
input_data		input_data
.DS_Store		.DS_Store
LSTM.ipynb		LSTM.ipynb
README.md		README.md
decision_tree.ipynb		decision_tree.ipynb
evaluation.ipynb		evaluation.ipynb
helpers.py		helpers.py

infoqualitylab/JCDL-2024-triage-approach

Folders and files

Latest commit

History

Repository files navigation

JCDL 2024 triage approaches

Dependencies

Glossary

File Structure and Explanations

LSTM.ipynb

decision_tree.ipynb

evaluation.ipynb

helpers.py

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages