-
Notifications
You must be signed in to change notification settings - Fork 0
/
meetings.html
1156 lines (1008 loc) · 80.7 KB
/
meetings.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<link rel="stylesheet" href="style.css" type="text/css">
<script src="https://code.jquery.com/jquery-3.2.1.slim.min.js" integrity="sha384-KJ3o2DKtIkvYIK3UENzmM7KCkRr/rE9/Qpg6aAZGJwFDMVNA/GpGFF93hXpG5KkN" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script>
<title>Thesis Repository</title>
</head>
<body>
<nav class="navbar navbar-expand-lg navbar-light navbar-custom ">
<button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarSupportedContent">
<ul class="navbar-nav mr-auto">
<li class="nav-item active">
<a class="nav-link" href="index.html">Home <span class="sr-only">(current)</span></a>
</li>
<li class="nav-item">
<a class="nav-link" href="diary.html">Diary</a>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="phases.html" id="navbarDropdown" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
Phases
</a>
<div class="dropdown-menu" aria-labelledby="navbarDropdown">
<a class="dropdown-item" href="phase1.html">Phase 1</a>
<a class="dropdown-item" href="phase2.html">Phase 2</a>
<a class="dropdown-item" href="phase3.html">Phase 3</a>
</div>
</li>
<li class="nav-item">
<a class="nav-link" href="meetings.html">Meetings</a>
</li>
</ul>
</div>
</nav>
<h1>Meetings</h1>
<div id="accordion">
<div class="card">
<div class="card-header" id="headingOne">
<h5 class="mb-0">
<button class="btn btn-link" data-toggle="collapse" data-target="#collapseOne" aria-expanded="true" aria-controls="collapseOne">
Week #1 (01-14-20)
</button>
</h5>
</div>
<div id="collapseOne" class="collapse show" aria-labelledby="headingOne" data-parent="#accordion">
<div class="card-body">
<h2>Overview</h2>
<p>The first part of the work implies doing something similar to what has been developed for CROCI: we need a class to extend another class, which implements one method only and expects to receive a CSV. This CSV file defines some citations, and its form depends on the data provider (so we expect National Institute of Health to use a specific data format, probably a csv or a json)</p>
<h2>get_next_citation_data</h2>
<div class="container codebg">
<pre><code>
def get_next_citation_data(self):
row = self._get_next_in_file()
while row is not None:
citing = self.doi.normalise(row.get("citing_id"))
cited = self.doi.normalise(row.get("cited_id"))
if citing is not None and cited is not None:
created = row.get("citing_publication_date")
if not created:
created = None
cited_pub_date = row.get("cited_publication_date")
if not cited_pub_date:
timespan = None
else:
c = Citation(None, None, created, None, cited_pub_date, None, None, None, None, "", None, None, None, None, None)
timespan = c.duration
self.update_status_file()
return citing, cited, created, timespan, None, None
self.update_status_file()
row = self._get_next_in_file()
remove(self.status_file)
</pre></code></div>
<p>This function is the last step of <a href="https://github.com/opencitations/index/blob/master/croci/crowdsourcedcitationsource.py">index/croci/crowdsourcedcitationsource.py /</a>.
It was developed to manage the particular CSV format that we expect for CROCI; it finds information to return. Indeed, the process expects a tuple of 6 values, derived from the citation source.
In particular:
<ol>
<li>citing</li>
<li>cited</li>
<li>created</li>
<li>timespan</li>
<li>None</li>
<li>None</li>
</ol>
The last values are meant to represent additional info about whether the citation is a self citation and about citations between publications on the same journal. In both cases the value is "none" because in this moment we are not interested in this information.
</p>
<h2>The Project</h2>
<p>We have to implement a class aimed at extending this CSV citation file implementing a get_next_citation_data tailored on NIH dataset (and related file formats).</p>
<p>In the aforementioned function we have <em>c</em>: this is the python class managing the interaction with the datasource. Each class relates to a specific format (? "typology"), depending on the index to be created.</p>
<p>We are going to work on an already-existing environment where the introduction of a new datasource requires the crearion of a new class, aimed at extracting from the new source the same set of information that the process is supposed to manage. The system is higly "parameterized": it always expects to receive a 6-values tuple, and exectutes a specific set of actions, depending on the given values.
In particular, the plug-in to be developed should be able to extract form the NIH file format the same information required by the system, so to place it in the 6-element tuple. That's what also COCI does, but with JSON format.
</p>
<p>The 2 primary identifiers are aimed at identifying the citing and the cited entity, and they are not always doi. In this case, we may have PubMedID or PubMedCentralID.
One of the problems to be handled is that some of these citations may already be present in <a href="https://opencitations.net/">Open Citations</a>. For example, in COCI we have only doi-to-doi; in the NIH dataset we are going to import some articles probably have a doi, but it is exposed in a different way.
However, if they have a doi, it is should be specified among the article data; or, as an alternative, it is possible to go back to it through an external mapping dataset.</p>
<p> Another consistent problem to be considered is that the 6-values-tuple doesn't expect (nor store) information about the DOI, since it is structured to manage only the 6 aforementioned aspects.</p>
<p> We'll need to add the mapping information, since -once imported the dataset in Open Citations- the API used for the unification process needs to manage the identifiers. This issue has never been handled until now, since the already imported data didn't require to deal with it. </p>
<h2>Phases Of The Project</h2>
<ol>
<li>Develop a software to read the datasource. We need the aforementioned class, in order to manage the new datasource. In this phase the index is not supposed to disambiguate: it has just to work in its own field.</li>
<li>Develop a sub-index of indexes for the alignment phase, so to recognise and map possible resources referring to the same citation, despite having differen URLs. This plug-in is aimed at managing all the possible data sources, so that -in case we need to import a new dataset with a different identifier in future- we can reuse the same tool.</li>
<li>API</li>
</ol>
<h2>Test Driven Development</h2>
<p>The 9th file in the test folder, <a href="https://github.com/opencitations/index/blob/master/test/09_croci.py">09_croci.py</a> provides a good example of the functioning of this kind of test. The procedure implies starting by creating a new class, which instantiates 2 functions: a set-up one and a test.</p>
<div class="container codebg">
<pre><code>
import unittest
from index.coci.glob import process
from os import sep, makedirs
from os.path import exists
from shutil import rmtree
from index.storer.csvmanager import CSVManager
from index.croci.crowdsourcedcitationsource import CrowdsourcedCitationSource
from csv import DictReader
class CROCITest(unittest.TestCase):
def setUp(self):
self.input_file = "index%stest_data%scroci_dump%ssource.csv" % (sep, sep, sep)
self.citations = "index%stest_data%scroci_dump%scitations.csv" % (sep, sep, sep)
def test_citation_source(self):
ccs = CrowdsourcedCitationSource(self.input_file)
new = []
cit = ccs.get_next_citation_data()
while cit is not None:
citing, cited, creation, timespan, journal_sc, author_sc = cit
new.append({
"citing": citing,
"cited": cited,
"creation": "" if creation is None else creation,
"timespan": "" if timespan is None else timespan,
"journal_sc": "no" if journal_sc is None else journal_sc,
"author_sc": "no" if author_sc is None else author_sc
})
cit = ccs.get_next_citation_data()
with open(self.citations) as f:
old = list(DictReader(f))
self.assertEqual(new, old)
</pre></code></div>
<h2>To do list</h2>
<ul>
<li>create a repository, where the weekly report will be kept updated.</li>
<li>write an e-mail to DARCH and ask for an intership (object: first phase of the thesis project, concerning the interaction with datasource).</li>
<li>study the <a href="https://docs.python.org/3/library/unittest.html">unittest library</a> for tests and make some practice with it.</li>
<li>have a look at CROCI test, which is simple and deals with a class which is similar to the one to be developed. Start with <a href="https://comp-think.github.io/">CT exercises</a> and reproduce the functions using unittest. Before writing the functions that the class implements, make sure to have the test written. Learn how to launch the library, how to use it proficiently.</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwo">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwo" aria-expanded="false" aria-controls="collapseTwo">
Week #2 (01-21-21)
</button>
</h5>
</div>
<div id="collapseTwo" class="collapse" aria-labelledby="headingTwo" data-parent="#accordion">
<div class="card-body">
<h2>To do list</h2>
<ul>
<li>Extract ten examples of citational data from NIH library and try to understad the format of the data we are going to work with.</li>
<li>Start preparing the data mapping: analyse your information in order to understand the mapping.</li>
<li>Try to imagine how the main function has to be structured.</li>
<li>Develop the test case, which is going to fail, since the function to be tested doesn't exist yet.</li>
<li>Study basic commands of the command prompt (python -m unittest).</li>
<li>Clone locally OC index so to run tests.</li>
<li>Revise self citation aspect in COCI article (<a href="Heibi2019_Article_SoftwareReviewCOCITheOpenCitat.pdf">Software review: COCI, the OpenCitations Index of Crossref
open DOI‑to‑DOI citations</a>)</li>
</ul>
<h2>Comments to unittest working frame</h2>
<p><strong>assertRaises():</strong> Checks whether a specific exception is raised by the execution of a specific code. Hoewever, in our case, even if we have exceptions, we don't need to test them, since our checks are strictly assertive, on data.
<p><strong>if __name__ == '__main__ ':</strong> When we make imports, we have the possibility to perform them totally, with <strong>import *</strong>. However, when we import everything, we include also some lines of code we don't really need, such as prints and tests (when we import we are generally interested in functions and methods, not in tests and prints). For this reason, we have a method that allows us to specify that some lines of the code belong locally to a specific file: we can for example specify if __name__ == '__main__ ': do something (which means "do something only if you are processing this specific file, otherwise don't execute the unittest. Unittest has to be executed only if the specific file is the one which is called by the python interpreter).
</p>
<p><strong>assertEqual(first, second, msg=None):</strong>
Test that first and second are equal. If the values do not compare equal, the test will fail. This is the most widely used method of the unittesting framework. </p>
<p>In OC we use Unittest because it comes in default with python, and it is enough for our needs.
In the development of NOCI it will be necessary to mantain the same structure of the other tests, developing a test function for each method of the to be implemented (at least).</p>
<h2> Analysis of get_next_citation_data test</h2>
<p><strong>Sep sep sep:</strong> In setUp(self) function "%" is the separator. The problem of the separator could be be solved also with the filepath class.</p>
<p><strong>setUp(self):</strong> In this function we take paths of the files that contain our sources and what we expect to obtain as a result. The input file is passed as a constructor: we want to take the source of the initial data. I pass this <strong>self.input_file</strong> as input to the <strong>CrowdsourcedCitationSource</strong> class in order to obtain initial data.</p>
<p>At this point, we can create the citations, adapting them to what we have in our source file.
<p><strong>cit = ccs.get_next_citation_data()</strong> bring us back to our 6-elements tuple.</p>
<p>If one between creation and timespan is missing, or if both of them are missing,that's not a big problem, and the field stays empty(e.g.:"creation": "" if creation is None else creation). In the other cases, if an element is missing, we need to add "no" to specify the absence of the element, in order to stay compliant with the OC format.</p>
<p>One of the final steps implies saving the output file in the correct format( with open(self.citations) as f:
<strong>old</strong> = list(DictReader(f))).</p>
<p>At this point, we compare the old and new with the assertEqual method. We need to compare these 2 dictionaries in order to check if the sets are coherent.</p>
<h2>Command Prompt</h2>
<p>We need to know how to use the command propmpt because the test has to be run from the command line.The reason is that it makes possible to execute tests one by one instead of making a unique big test calling all the individual parts. For example, if we clone locally a git repository, from the outside of an index folder we can call python -m package (location where the test is placed). In this way we can run the specific test only.</p>
<h2>Open Citation Environment: NOCI</h2>
<p>The aim of this thesis project is to develop an extension of ocindex. For this reason, i need to clone locally oc index, so to be able to run tests. We will add a new directory, a package, that we can call "NOCI", that will be developed following CROCI structure. We will need a python file in order to interact with the input file containing NIH data (extend locally with a folder containing a python file named "nih citation source", or something similar). </p>
<p>Initially, this won't work: the first thing we need to do after having the structure set is developing the test case, so to check that the input returns a specific output. The form of the output has to be defined (emulate CROCI and COCI). This process forces us to think about the format of input and output materials of the functio to be developed. </p>
<p>As we said, the output has to have the OC sixtuple format (see test data croci dump citation). The only big difference is that the value that I'm going to insert in citing and cited won't be a doi, but the NIH id, which should be a PubMed ID. The DOI may be (and probably is) present; however, we can't rely on this. In this dataset we will have the "selfcitation" info, but we don't know yet whether it is useful or not for our purposes.
<p>Remember that "Selfcitation" is set as "yes" if citing and cited either belong to the same journal or if there is at least a member in common between citing and cited authors. </p>
<h2>Support files</h2>
<p>Many of the information, if lacking, can be integrated in a further phase of the project, in which we can create support files in order to update the process.</p>
<p>A support file is a csv file with very simple structure. In coci there is a glob.py file that creates those support files. Its aims:</p>
<ul>
<li>check whether the specified DOIs are valid</li>
<li>map (id-pubdate; journal isn - other data; article id - related data </li>
</ul>
<p>With the support file we try to understand if we can improve the quality of the info about the sixtuple.</p>
<p>("id1", "id2", None, None, None, None) <br>
id_date.csv:
"id1","2019"
"id2","2017"
-> <br>
("id1", "id2", "2019", <strong>"P2Y"</strong>, None, None)</p>
<p>Once we obtained somewhere else these data, we understand that the first id is associated with the date "2019", while the second one with the date "2017". In this way we can extend the citation data with the timespan info, that I didn't have before. </p>
<p>However, the general process, when lacking a support file, tries to reconstruct the information using an API (the mechanism is like a switch: if we don't have the support file, we try with the API). All the tools that we have now were developed for the doi: now we have to do the same for the PubMed Ids. We may need another blob for NOCI, in order to generate additional csv, or maybe we can make request for already existing online API, so to obtain local files. </p>
<h2>Working process overview</h2>
<ul>
<li>Citation source works at least with 2 pieces of information: citing and cited. Also in the case it is the only material we have, we keep it.</li>
<li> The overall process, in case we lack some piece of information, uses precomputed files that allow to improve the starting material.</li>
</ul>
<p>As first step, we have to understand what we have and what is missing. Then, we have to understand how to develop globs, global files. At this point we have to reason on PubMed Id structure, which is not handled by our OC general process, for now. All this stuff relates to the first phase. The following one is the alignment.</p>
<h2>Links and References</h2>
<ul>
<li>index.test.01_csvmanager</li>
<li><a href="https://github.com/opencitations/index">https://github.com/opencitations/index</a></li>
<li><a href="https://github.com/opencitations/index/blob/master/test_data/croci_dump/citations.csv">https://github.com/opencitations/index/blob/master/test_data/croci_dump/citations.csv</a></li>
<li><a href="https://github.com/opencitations/index/blob/master/coci/glob.py">https://github.com/opencitations/index/blob/master/coci/glob.py</a></li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingThree">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThree" aria-expanded="false" aria-controls="collapseThree">
Week #3 (02-5-21)
</button>
</h5>
</div>
<div id="collapseThree" class="collapse" aria-labelledby="headingThree" data-parent="#accordion">
<div class="card-body">
<h2>Mapping</h2>
<h3>Reconstructing Information</h3>
<p> We can work out the selfcitation information later. It would be reasonable to follow the same process adopted for COCI: we have a preprocessing phase where each document is associated to some information extracted form a csv file containing metadata (e.g. journal issn, orcid..). The algorithm is supposed to receive some files, then used to extract the selfcitation information. The point is that the mapping represents a further phase of the process.
</p>
<h3>NIH Mapping file</h3>
<p>NIH should provide a csv mapping file. We should then understand if it makes more sense using api or existing files (we don't know the number of files to be managed, but probably there are many of them - the APIs could be overloaded and if the same information is already stored in csv files the process should be faster).</p>
<h2>NIH dataset management</h2>
<p>The format we found for CROCI citational data was "citing" - "cited". While for NOCI it is "citing" - "referenced". The second part can't be changed, but muts be managed in the way it comes. Pay attention: <strong>the source file must stay as it is</strong>: the differring naming convention is to be managed in the 6-elements tuple.
<h2> citing, cited, creation, timespan, journal_sc, author_sc = cit </h2>
<p>The format of this assignment depends on the fact that I can associate tuples of variables to tuples of values, and cit is already a tuple of values, while citing, cited, creation, timespan, journal_sc, author_sc are the names of the variables.</p>
<h2>c in the class CrowdsourcedCitationSource (crowdsourcedcitationsource.py)</h2>
<p>Citations class. The passage with c is useful only because it tales the four main values. For example, Citations has in default the timespan step, and the constructor of the class makes calculations on its own. We keep only the part we need.</p>
<h2>NOCI main function</h2>
<p>Is to be developed on the base of the CROCI main function. It is the function that contains the NOCI class.</p>
<p>The first aim of this function is managing the new type of id. This should happen in cnc.py. We are supposed to manage the type of id so to interact with the correct api. The most important aspect is that the conversion citation (source) -- information storing in the 6 columns format works. Once that we accomplish this aim, we have to understand how to improve the quality and the amount of data to be stored in the tuple. Use input files of CROCI to understand the functioning of the overall process. <strong>get_next_citation</strong> must mantain the same name in the new class to be implemented. </p>
<h2>To do</h2>
<ul>
<li>Download Metadata File</li>
<li>Develop the main function</li>
<li>Correct the test function</li>
<li>Locally clone and launch OC index</li>
<li>Read Mail about mapping</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingFour">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFour" aria-expanded="false" aria-controls="collapseFour">
Week #4 (02-11-21)
</button>
</h5>
</div>
<div id="collapseFour" class="collapse" aria-labelledby="headingFour" data-parent="#accordion">
<div class="card-body">
<h2>Normaliser</h2>
<p><strong>Normaliser:</strong>Takes an ID assuming it is of a certain type. It then checks and uniforms it according to a unique scheme (which is slightly custumisable, but not so much). With dois, it takes as input a string and makes a normalisation by turning everything in lowercase and checks that there are no null charachters, which would break the id. So, in general, a normaliser takes as input an id ad gives back it in the normalised format.</p>
<h2>Validator</h2>
<p><stong>Validator:</stong> When the normaliser doesn't return none, the following steps are managed by the validator, which is strictly dependent on the object it works on. Some IDs formats follow a progressive logic, which make them validable against a particular scheme. However, some other ids like DOIs can be validated only by an API. </p>
<h2>PubMed Case</h2>
<p>^\d+$ --> ^[1-9]\d*$ (non ^[0-9]{1,7}$)</p>
<p> <strong>orcid_string = sub("[^X0-9]", "", id_string.upper())</strong>
--> It may happen that the url of the orcid is passed (instead of the pure orcid, for example). The same thing may happen with pubmed id, so it should be managed. In our case, since pubmed id are made of digits only, we have to leave out everything except for digits. The idea is to remove everything that is not a digit. Moreover, we have to leave out the hypotetical sequence of 0s that we may find before the id. At this point we can normalise (we won't need upper/lowercases normalisation, since we are managing digits only). For the validation we will need the PubMed API, in order to know whether the id exists or not.</p>
<h2>to do</h2>
<ol>
<li>Fix the second function developed last time (id case is built on the base of doi case) </li>
<li>Add a test case also for this latter function</li>
<li>Look at the overall functioning: where do we need additional code to manage pubmeds?</li>
<li>Run tests</li>
<li>Before checking the overall functioning, check the two tests</li>
</ol>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingFive">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFive" aria-expanded="false" aria-controls="collapseFive">
Week #5 (02-18-21)
</button>
</h5>
</div>
<div id="collapseFive" class="collapse" aria-labelledby="headingFive" data-parent="#accordion">
<div class="card-body">
<h2>To Do</h2>
<ul>
<li>Run again all tests</li>
<li>Study the execution process</li>
<li>Study how to run dynamically the process (dynamic requests for croci -- request material: It's useful in order to look at what happens step by step. This works for Croci, but understanding its functioning its useful in order to work out how to integrate noci in the process. cnc.py for now works with dois only. It neither checks whether the id is a doi or not. We need to make it work with pmid too, so the nature of the id will have to be specified.)</li>
<li>Once understood how does the process work, we have to understand how to integrate pmids.</li>
<li>Check for information provided specifying "?format=pubmed" in the api request (e.g.: https://pubmed.ncbi.nlm.nih.gov/47/?format=pubmed). For now we are interested only in issn and publication date.</li>
<li>Check corrected functions</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingSix">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSix" aria-expanded="false" aria-controls="collapseSix">
Week #6
</button>
</h5>
</div>
<div id="collapseSix" class="collapse" aria-labelledby="headingSix" data-parent="#accordion">
<div class="card-body">
<h2>Dynamic requests for croci</h2>
<ul>
<li><strong>-s</strong> : source I'm supposed to take data from. I specify where the data source is. In our case, it is an example and we don't speifically need to have the correct source. The command is just intended to start the tool. The source is then specified in provenance info. The system also says where the raw data were taken from and who provided it (data that allowed the creation of the citation)S</li>
<li><strong>-a</strong> : identifies the responsible agent for these data.</li>
<li><strong>key for api-orcid</strong>: is personl and not strictly required. The command can be run even without it.</li>
</ul>
<div class="codebg">
<code>
<pre>
C:\Users\arimoretti\Documents\GitHub>python -m index.cnc -ib "http://dx.doi.org/" -b "https://w3id.org/oc/index/croci/" -p "C:\Users\arimoretti\Documents\GitHub\index\croci\crowdsourcedcitationsource.py" -c "CrowdsourcedCitationSource" -i "C:\Users\arimoretti\Documents\GitHub\index\test_data\croci_dump\source.csv" -l "C:\Users\arimoretti\Documents\GitHub\index\test_data\tmp_store\lookup_full.csv" -d "C:\Users\arimoretti\Documents" -px "050" -a "https://orcid.org/0000-0003-0530-4305" -s "https://doi.org/10.5281/zenodo.3832935" -sv "OpenCitations Index: CROCI" -v
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1109/wi.2006.164' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v11i11.1413' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v8i12.1108' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v11i9.1400' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1038/438900a' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.2307/2529310' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.2307/4486062' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v12i4.1763' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1145/503376.503456' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1142/9789812701527_0009' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1145/1501434.1501445' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1007/11839569_35' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.2307/1562247' has been already processed
# Summary
Number of new citations added to the OpenCitations Index: 0
Number of citations already present in the OpenCitations Index: 13
Number of citations with invalid DOIs: 0
C:\Users\arimoretti\Documents\GitHub>
</pre>
</code>
</div>
<h2>makedir</h2>
<p>Make some tries on a local file (on a python file locally generated and executed). There is something wrong and we have to understand what. </p>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingSeven">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSeven" aria-expanded="false" aria-controls="collapseThree">
Week #7 (03-4-21)
</button>
</h5>
</div>
<div id="collapseSeven" class="collapse" aria-labelledby="headingSeven" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li>Make issues on GitHub (UnicodeDecodeError, Test 05_citationstorer fail)</li>
<li>Check where ocy has to be extended, so to manage pmids</li>
<li>Correct the indentation error in nationalinstituteofhealthsource.py</li>
<li>Print the serialization of the two graphs (g1 and g2 in 05_citationstorer) using n-triple</li>
<li>Correct pmidmanager.py (api gives back an html and not a string)</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingSeven">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseEight" aria-expanded="false" aria-controls="collapseEight">
Week #8 (03-11-21)
</button>
</h5>
</div>
<div id="collapseEight" class="collapse" aria-labelledby="headingEight" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li>Correct the csv file manually so that it can be used in case api can't be used</li>
<li>Check isomorphism between the two graph, and try to understand why do they differ</li>
<li>Manage oci extension in order to manage pmid</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingNine">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseNine" aria-expanded="false" aria-controls="collapseNine">
Week #9 (03-18-21)
</button>
</h5>
</div>
<div id="collapseNine" class="collapse" aria-labelledby="headingNine" data-parent="#accordion">
<div class="card-body">
<p><a href="https://github.com/opencitations/index/blob/master/citation/oci.py">https://github.com/opencitations/index/blob/master/citation/oci.py</a></p>
<p><a href="https://github.com/opencitations/oci/blob/master/lookup.csv">https://github.com/opencitations/oci/blob/master/lookup.csv</a></p>
<h2>To do:</h2>
<ul>
<li>read <a href="https://doi.org/10.6084/m9.figshare.7127816">https://doi.org/10.6084/m9.figshare.7127816</a></li>
<li>fix files and copy notes on them</li>
<li>update open citations issue</li>
<li>check and/or fix: oci + cnc + 05_citationstorer + 04_oci</li>
<li>start analysing cnc and try to understand how does cnc work and what is used in this part of the software, in order to figure out whether some parts of the code can be in contrast with the PMID structure to be integrated </li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTen" aria-expanded="false" aria-controls="collapseTen">
Week #10 (04-01-21)
</button>
</h5>
</div>
<div id="collapseTen" class="collapse" aria-labelledby="headingTen" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li>In oci.py def get_oci(self, doi_1, doi_2, prefix, id_type): #change id type (no more doi only)</li>
<li>Run test 04_oci and add the missing id_type argument in the already existing tests</li>
<li>Integrate a test for get_oci so to test also pmid type --- which is the pmid prefix?</li>
<li>Add type information to the six (now seven) elements tuple which is returned by get_next_citation_data(self). Remember that this information is given, since NationalInstituteHealthSource manages pmids only, while CrossrefCitationSource and CrowdSourcedCitationSource dois only </li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingEleven">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseEleven" aria-expanded="false" aria-controls="collapseEleven">
Week #11 (04-15-21)
</button>
</h5>
</div>
<div id="collapseEleven" class="collapse" aria-labelledby="headingEleven" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li>Id type specification in Tests of: CrossrefCitationSource, CrowdsourcedCitationSource, NationalInstituteOfHealthSource</li>
<li>Study Class and Instance Variables and Inheritance</li>
<li>Integrate a test for get_oci so to test also pmid type --- which is the pmid prefix?</li>
<li>Add type information to the six (now seven) elements tuple which is returned by get_next_citation_data(self). Remember that this information is given, since NationalInstituteHealthSource manages pmids only, while CrossrefCitationSource and CrowdSourcedCitationSource dois only </li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwelve">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwelve" aria-expanded="false" aria-controls="collapseTwelve">
Week #12 (04-22-21)
</button>
</h5>
</div>
<div id="collapseTwelve" class="collapse" aria-labelledby="headingTwelve" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li><b>citationsource.py</b>: (1) <strong>class CSVFileCitationSource(DirCitationSource):</strong> When it is instantiated in the test, it is necessary to pass in input also the id type related to the CSV in question.It is necessary to create a csv file also for PMID. (2)<strong>return citing, cited, created, timespan, journal_sc, author_sc, self.id_type</strong>: It is necessary to pass the self.id_type correctly; the function will return it as it was passed by the user.</li>
<li><b>06_citationsource.py</b>: check the file and transpose the notes. </li>
<li><b>cnc.py</b>: check the file and transpose the the notes.</li>
<li><b>07_cnc.py</b>: (1)check the file and transpose the the notes. (2)Create the required support files.</li>
<li><b>oci.py</b>: check the file and transpose the the notes.</li>
<li><b>resourcefinder.py</b>: read the file so to understand how to implement the version for NIH</li>
<li><b>nationalinstituteofhealthresourcefinder.py</b>: create the file. Use as a model the crossrefresourcefinder.py</li>
<li><b>03_resourcefinder.py</b>: update the file by adding the piece of code to test nationalinstituteofhealthresourcefinder.py</li>
<li>Update the code in accordance with the commits in OpenCitations</li>
<li>Send the request for the intership</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingThirteen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThirteen" aria-expanded="false" aria-controls="collapseThirteen">
Week #13 (05-13-21)
</button>
</h5>
</div>
<div id="collapseThirteen" class="collapse" aria-labelledby="headingThirteen" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li></li>
<li></li>
<li></li>
<li></li>
<li></li>
<li></li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingFourteen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFourteen" aria-expanded="false" aria-controls="collapseFourteen">
Week #14 (05-20-21)
</button>
</h5>
</div>
<div id="collapseFourteen" class="collapse" aria-labelledby="headingFourteen" data-parent="#accordion">
<div class="card-body">
<h2>nihresourcefinder.py</h2>
<p>The kind of material from which we have to retrieve the information handled by a resource finder has this shape:</p>
<div class="codebg">
<code>
<pre>
PMID- 123456
OWN - NLM
STAT- MEDLINE
DCOM- 19750625
LR - 20160920
IS - 0030-0632 (Print)
IS - 0030-0632 (Linking)
VI - 78
IP - 4
DP - 1975 Apr
TI - [The laboratory in programs for enteric infection control].
PG - 318-22
FAU - Grados, O B
AU - Grados OB
LA - spa
PT - Journal Article
TT - El laboratorio en los programas de control de las infecciones entéricas.
PL - United States
TA - Bol Oficina Sanit Panam
JT - Boletin de la Oficina Sanitaria Panamericana. Pan American Sanitary Bureau
JID - 0414762
SB - IM
MH - Bacterial Infections/*prevention and control
MH - Bacteriological Techniques
MH - *Communicable Disease Control/methods
MH - Enteritis/*prevention and control
MH - Health Planning
MH - *Laboratories
MH - Peru
EDAT- 1975/04/01 00:00
MHDA- 1975/04/01 00:01
CRDT- 1975/04/01 00:00
PHST- 1975/04/01 00:00 [pubmed]
PHST- 1975/04/01 00:01 [medline]
PHST- 1975/04/01 00:00 [entrez]
PST - ppublish
SO - Bol Oficina Sanit Panam. 1975 Apr;78(4):318-22.
</pre>
</code>
</div>
<h3>Issues to handle</h3>
<ul>
<li>Find a documentation and understand the acronyms</li>
<li>Understand which piece of information refers to the date of publication</li>
<li>Handle the double ISSN (printing and linking) </li>
<li>The ORCID is missing, so just pass this step and return an empty set in any case</li>
</ul>
<h3>Meeting: Various notes</h3>
<h4>NIHResourceFinder._get_orcid(self,pmid)</h4>
<p>This function is going to return in any case an empty set: the NIH doesn't provide any information about the ORCID of an article. For this, the Orcid Finder will be used, so to try to retrieve the information somewhere else.</p>
<h4>NIHResourceFinder._get_issn</h4>
<p>In the case the test is not passed, doublecheck the indetation of the lines containing the information about the ISSN: it is possible that the number of spaces needs to be changed.</p>
</div>
</div>
<div class="card">
<div class="card-header" id="headingFifteen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFifteen" aria-expanded="false" aria-controls="collapseFifteen">
Week #15 (05-25-21)
</button>
</h5>
</div>
<div id="collapseFifteen" class="collapse" aria-labelledby="headingFifteen" data-parent="#accordion">
<div class="card-body">
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingSixteen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSixteen" aria-expanded="false" aria-controls="collapseSixteen">
Week #16 (06-03-21)
</button>
</h5>
</div>
<div id="collapseSixteen" class="collapse" aria-labelledby="headingSixteen" data-parent="#accordion">
<div class="card-body">
<p> Hai dellelibrerie caricate come packages all'interno di un ambiente python. questo approccio serve ad andare a riferirse alle varie cose. L'idea di base è che hai un package che ha un id di riferimento. l'ultimo nome è una classe, una funzione... Quando questo index viene messo a disposizione su py per installare il pacchetto va fatto così. </p>
<p>è cambiato qualcosa negli import -- o è stato introdotto un errore o i test devono andare</p>
<p>Quando lanci i test esci da index sennò non trova. Nota, lanciando da funzione singola non riconosce come package</p>
<p>Definiendo index come source directory, non la vede come package ma come directory che contiene i pakages.</p>
<p>Ricostruisce lo stack di errori dalla cosa più astratta alla più concreta in basso.</p>
<p>Posso forkare index sul mio repository, poi modifico il mio spazio</p>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingSeventeen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSeventeen" aria-expanded="false" aria-controls="collapseSeventeen">
Week #17 (06-17-21)
</button>
</h5>
</div>
<div id="collapseSeventeen" class="collapse" aria-labelledby="headingSeventeen" data-parent="#accordion">
<div class="card-body">
<h2><strong>03_resourcefinder</strong></h2>
<p> Try to comment all the tests and to launch just one of them. </p>
<p> The path to solve the current problem in the finder test should be: get orcid --> get_item --> csv manager --> doi normalization --> none</p>
<p> The only thing we can do in order to try and fix the code is to keep launching the test from the outside (always) and then make prints at each step, so to understand in which point the problem occurs. Up to now it seems to relate to the idtype extension. </p>
<p> Use print(abspath(".")) to identify from which point the process is run.</p>
<p>The problem should not relate to any of the following points:</p>
<ul>
<li>self.orcid_path : it exists and it is found</li>
<li>The path of the test launch (print(abspath("."), "è il punto in cui si trova") actually prints what it was expected to)</li>
<li> By defining a variable a_csv_manager = CSVManager(self.orcid_path), we also obtain the expected data both with a_csv_manager.data and with a_csv_manager.get_value("doi:10.1108/jd-12-2013-0166") </li>
<p>Instead, we get a problem with of_2.get_orcid("10.1108/jd-12-2013-0166"). Since of_2 is defined as of_2 = ORCIDResourceFinder(orcid=CSVManager(self.orcid_path),
doi=CSVManager(self.doi_path), use_api_service=False), we should look for the error there.</p>
</ul>
<p> Group 0: the whole string. The number of parentheses I add to subdivide the pattern will determine the number of groups I'll define in my matched string. </p>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingEighteen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseEighteen" aria-expanded="false" aria-controls="collapseEighteen">
Week #18 (07-07-21)
</button>
</h5>
</div>
<div id="collapseEighteen" class="collapse" aria-labelledby="headingEighteen" data-parent="#accordion">
<div class="card-body">
<h2><strong>03_resourcefinder</strong></h2>
<p> Try to comment all the tests and to launch just one of them, in order to semplify the process of analysis: it is pretty evident that all the tests implying the loading of a file fail for the same reasons. </p>
<p> During the meeting we analysed together the involved files and compared them with the OC official version in order to spot the file loading problem. No evident mistakes were found during the meeting and we agreed that th problem should be related to some secundary issue, such as a typo or a wrong indentation.</p>
<p>Fixing this error is still the primary issue to be handled, in order to proceed with the workflow.</p>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingNineteen">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseNineteen" aria-expanded="false" aria-controls="collapseNineteen">
Week #19 (07-27-21)
</button>
</h5>
</div>
<div id="collapseNineteen" class="collapse" aria-labelledby="headingNineteen" data-parent="#accordion">
<div class="card-body">
<h2><strong>Fork Update</strong></h2>
<p>A new file was added to OC, which does not concern the work I have done up to now. Further, the structure of the directory index was modified: now index contains another folder named index as well, which contains all the subfolders. cnc only was brought outside the subfolder index. The problem was spotted through a test and it concerns a parallel data ingestion. cnc.py changed also internally: a class was developed (i.e.: handler) in order to make a specific check. Moreover, also the bugs which emerged during the thesis work were fixed. The most of the tests should be the same as before, but the one for cnc was moved outside. Other small changes were made in citationstorer.</p>
<h2>To do</h2>
<ol>
<li>Update the fork with OC changes</li>
<li>Revise at least five meetings (6-10) and update the fork accordingly</li>
<li>Fix issn issue in resourcefinder tests</li>
<li>Fix NIH issue in resourcefinder tests</li>
</ol>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty" aria-expanded="false" aria-controls="collapseTwenty">
Week #20 (08-05-21)
</button>
</h5>
</div>
<div id="collapseTwenty" class="collapse" aria-labelledby="headingTwenty" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ol>
<li>Extend cnc and its test</li>
<li>Extend citation storer and source, so to make the whole process work for pmid also</li>
</ol>
<h2>Notes</h2>
<ul>
<li>loockup_full should be used for DOIs only, while for PMIDS it should not be used, while for pubmed ids we just put the numeric id after the prefix 0XXX0. So, probably I won't need lookup_full to create my test data at all.</li>
<li><strong>Lambda Function:</strong> it is not defined and it allows to define on the fly an anonymous function which is then assigned to a variable. It is a complex topic, which is explained at <a href="https://realpython.com/python-lambda/">https://realpython.com/python-lambda/</a>. All in all, for implementation reasons we need to use different functions when we run the process in parallel, but then the "behaviour" of the two versions (parallel and standard) should be the same. So, we specify which are the needed functions for the process at the moment when we run cnc thanks to the lambda function.</li>
<li><strong>Cnc Testing modalities:</strong> The test is to be run in order to be sure that cnc works as expected. Further, cnc is to be executed for the new index, in order to check its functioning also for the new index (noci), also with a limited amount of data.</li>
<li>The source for pmids is <a href="https://doi.org/10.35092/yhjc.c.4586573.v16 and it works properly">https://doi.org/10.35092/yhjc.c.4586573.v16 and it works properly.</a> </li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-one">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-one" aria-expanded="false" aria-controls="collapseTwenty-one">
Week #21 (08-12-21)
</button>
</h5>
</div>
<div id="collapseTwenty-one" class="collapse" aria-labelledby="headingTwenty-one" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ol>
<li>Create a glob file for noci</li>
</ol>
<h2>Notes</h2>
<ul>
<li>The function of the glob file is that of creating support files to be used in OC process</li>
<li><strong>Files to download:</strong> NIH files in iCite Database, available on <a href="https://nih.figshare.com/articles/dataset/iCite_Database_Snapshot_2021-07/15148737?file=29101575">figshare</a></li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-two">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-two" aria-expanded="false" aria-controls="collapseTwenty-two">
Week #22 (09-09-21)
</button>
</h5>
</div>
<div id="collapseTwenty-two" class="collapse" aria-labelledby="headingTwenty-two" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ol>
<li>Understand how to open tar.gz files in Windows and how to correct download errors.</li>
<li>The support cache file for the mapping journal short names - ISSNs is a good idea: keep it and improve it.</li>
<li>Process all the data as new ones, do not rely on COCI for those citation data which have already been processed in other indexes.</li>
<li>Find Pandas parameter which allows to avoid to directly store in the memory all the information in the CSV files which have to be processed. In particular, understand how to manage very large files in reasonably small chunks, which can be then handled easily: <a href="https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas">https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas</a>.</li>
<li>Add the empty date for the PMIDs which do not have one.</li>
<li>Make a second iteration of the rows of the input files in order to process the cited PMIDs, in order to store as valid the ones which can be validated through the pmidmanager and discard the ones which result as invalid. Call the is_valid and the set_valid functions in case of a positive feedback.</li>
<li>11.30 il 15</li>
</ol>
<h2>Notes</h2>
<ul>
<li>The presence of a mapping between PMIDs and DOIs is good, even if we don't directly need it now. However, this mapping will be very useful in the following steps. All the information in ICite Metadata will be useful in future, but the problem is that some important information is missing. The worse aspect is that the journals are identified by their short names. </li>
<li>A table mapping ORCID - DOIs was already implemented: do not create a new version of it.</li>
<li>Next meeting: Wednesday 15th, 11h30.</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-three">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-three" aria-expanded="false" aria-controls="collapseTwenty-three">
Week #23 (09-14-21)
</button>
</h5>
</div>
<div id="collapseTwenty-three" class="collapse" aria-labelledby="headingTwenty-three" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<h3>Multiple cnc tests</h3>
<ol>
<li>1: call cnc on a citations source dataset (which must exist), but without all the support files. These latter shouldn't exist: you have to use names of not existing files. In this way cnc will create the named files.</li>
<li>2: repeat the call, adding the "na" parameter, which implies not calling the API, using the just created files. The result should be the same.</li>
<li>3: call cnc again, this time using API and already generated files. </li>
<li>4: call cnc one last time, using "na" and empty files: in this case you should obtain the only different result among the four attempts. </li>
</ol>
<h3>Possible empty spaces in referenced pmids (in noci glob) </h3>
<p>First of all, it is necessary to remove possible spaces at the beginning and at the end of the string, then we have to make sure that no repeated spaces are placed as separators between a pmid and another. This second step is to be implemented with a regex.</p>
<h3>Repeated citations issue</h3>
<p>Different indexes can share repeated citations. The problem is to be fixed with <strong>URLs of meta ids</strong>.</p>
<ol>
<li>Starting from the material resulting from cnc, find a way to create the metaids to associate with the pmids (it could be a csv)</li>
<li>Create a mapping table between pmids and metaids (it could be another csv)</li>
<li>Exploit a mapping table between pmids and dois, complete the previous mapping (pmid - metaid - doi, maybe another csv).</li>
<li>Use blazegraph to create a triplestore with only one property (here we need an rdf), which could be dcterms:relations. The data in the resulting ntriple file should have this format: <strong> metaid - dcterms:relation - [format generated by cnc]</strong>. Look at the ttl formats.</li>
<li>The generated file should be uploaded in Triplestore.</li>
</ol>
<p>The result is to be queried with SPARQL. Study triplestore and rdflib.</p>
<p>Remember that the initlial data to be processed must be cnc output.</p>
<h3>Next meeting</h3>
<p>30th of September, Zamboni 32</p>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-four">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-four" aria-expanded="false" aria-controls="collapseTwenty-four">
Week #24 (09-30-21)
</button>
</h5>
</div>
<div id="collapseTwenty-four" class="collapse" aria-labelledby="headingTwenty-four" data-parent="#accordion">
<div class="card-body">
<h2>To do (how to fix last week work):</h2>
<ol>
<li>pandas nan to empty string (df.fillna('', inplace=True))</li>
<li>ttl does not linearize in the same order all the times. Since the result is a string of text lines, it is necessary to give the lines a lexographic order (in both files), then the epmty lines should be removed. At this point each line should contain a triple or quadruple, which will differ only for the creation date. Only the files stored in the folder "data" should be compared, in order to see if the software works properly. </li>
</ol>
<h2>To do (metaid mapping):</h2>
<ul>
<li>In this first phase it is possible to associate also the identifiers only, without the full urls. For the further steps, the metaid prefix is "060". The formats will be: <strong>https://pubmed.ncbi.nlm.nih.gov/ + pmid</strong>, <strong>https://w3id.org/oc/meta/br/060 + metaid</strong> </li>
<li>020 and 0160 identify to oci the provenience of the id: for example nih, or others.</li>
<li>Metaids will be associated to citing and cited ids processed in the citations (cnc output). Only one iteration of each csv should be enough: keep track of already processed information.</li>
<li><strong>input:</strong> 1) cnc output containing citations, 2) mapping files, 3) a csv where to store metaid mappings. If it exists already, it should be updated. The id type information should be included somehow. In any case the software must be able to handle both doi and pmids. Consider that cnc has the id_type information.</li>
<li>The combined mapping doi-pmid-metaid should exploit the information provided in NIH metadata.</li>
<li>The rdf must be generated at the end, not updated line by line.</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-five">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-five" aria-expanded="false" aria-controls="collapseTwenty-five">
Week #25 (10-07-21)
</button>
</h5>
</div>
<div id="collapseTwenty-five" class="collapse" aria-labelledby="headingTwenty-five" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ol>
<li>trovare modo di tenere memoria del fatto che hai cambiato un identificativo </li>
<li>Un metaid non può essere riassegnato, deve essere classificato come cancellato</li>
<li>Provenance: ogni entità è accompagnata da informazioni di provenance. tiene traccia anche di informazioni di modifica di quel dato nel tempo. la cancellazione è l'eliminazione delle informazioni sull'entità. l'entità continua ad esistere. OC_OCDM (libreria python, si installa con pip, gestisce già tutte quelle cose)</li>
<li>Il disallineamento tra quello che abbiamo e quello che è su meta è un po' un problema. Vedi meta e capisci come funziona meta e anche oc_ocdm. Il punto è che i mapping di id che vado a trovare tramite questo sistema, deve succedere qualcosa dentro meta. L'importante è avere l'idea del problema. La conseguenza dell'aggiunta deve andare a cambiare qualcosa in meta. Magari crea un file di mapping in cui aggiungi l'informazione della mappatura, in modo tale da sfruttare ococdm così da fare gestire tutto da questo. un csv in output tra mappatura di metaid. Utilizza il metodo merge di oc_ocdm. Altro software che potrebbe venire utile ocgraph enricher. va alla ricerca tramite sparql query che condividono lo stesso id e fa lui il merge. In caso venga fuori che il mapping era falso, si può tornare allo snapshot precedente tramite meta che tiene conto delle informazioni di provenance. Non è detto che l'ultimo metaid assegnato sia nel mio file. fai in modo di passare in input un file comune contenente l'ultimo metaid. il nome del file deve passare come input obbligatorio. ococdm guardalo più per curiosità che per altro, di meta guarda il codice girato da peroni. Dentro il github di oc c'è anche meta. di meta guarda la logica. </li>
</ol>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-six">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-six" aria-expanded="false" aria-controls="collapseTwenty-six">
Week #26 (10-14-21)
</button>
</h5>
</div>
<div id="collapseTwenty-six" class="collapse" aria-labelledby="headingTwenty-six" data-parent="#accordion">
<div class="card-body">
<h2>To do:</h2>
<ul>
<li> Complete and correct the code of <strong>mapping.py</strong>, so that two new output files are created: one which contains the triples to be removed and the other one which contains the triples to be added. Both files will be in ntriples format. In this way, in the triplestore, as first thing we will remove the invalidated triples, and then we will load the new ones.</li>
<li><strong>update.py</strong>: for now it only handles data addition (insertion): until this moment the deletion of a datum had never been required. However, data can be deleted through a query: the triples to be deleted should be a very small number. These triples to be deleted will be included in a file which will be passed as a new argument (something like -d): instead load them, the new query should delete the specified triples. In this way we will obtain a metaid mapping with all the publications included in OC without any ambiguity. It is an issue related to the indexes' API, which can query all of them at once. In this way we will be able to recognise the same citation in two different indexes.</li>
<li><strong>Noci recap:</strong>cnc.py is ready. The index could be created also now. mapping.py will be used to create the mapping id-metaid in the triplestore, in order to solve the disambiguation problem in the unifying API. At this point we just need to develop the API in order to query NOCI and the extensions to query the unifying (with more indexes inside). NOCI's API will be develped following the same structure used for COCI. When all of this will be done, there will be the very last step of the thesis project to complete.</li>
<li>Read the article about <strong>Ramose</strong> in order to understand how do COCI's APIs work. Ramose is the OC software which was used for all the APIs. It is a kind of tool which works as a proxy between SPARQL and a normal request (http). It is a Rest API. There is a file for the textual interpretation about how this aspect is handled. For next meeting: read the article and the Github repo, where it is possible to find also some examples of its functioning. We need a configuration file for NOCI (noci_v1.hf) which allows the APIs creation. In order to do that, it is necessary to have a triplestore somewhere, with the data within. Create a local blazegraph: a small amount of data is enough. Ramose will allow to test that everything works correctly, when properly configured. Have a look at the article and at github configurations. </li>
<li>Have a look at all the API defined for COCI: third link (coci_v1.hf)</li>
</ul>
</div>
</div>
</div>
<div class="card">
<div class="card-header" id="headingTwenty-seven">
<h5 class="mb-0">
<button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-seven" aria-expanded="false" aria-controls="collapseTwenty-seven">
Week #27 (10-20-21)
</button>
</h5>
</div>
<div id="collapseTwenty-seven" class="collapse" aria-labelledby="headingTwenty-seven" data-parent="#accordion">
<div class="card-body">
<h2>How to use RAMOSE (test)</h2>
<ol>
<li>fix mapping.py code, so to obtain your own triplestore.</li>
<li>make your blazegraph with your example data and query it. After that, it will be possible to understand where to modify the code for coci in order to develop NOCI version. But, in the meantime:</li>
<li>get acquainted with ramose. a simplified version of the test was developed for testing purposes and to let new users to get acquainted with ramose. The file name is “m0”, and it allows Ramose execution in shell, and also the creation of a webserver for making requests on browser. M0 queries wikidata, so that it is possible to try it without a local triplestore. This example is in the new updated and fixed version of ramose. Download the most recent one and run python ramose.py -s test/test_m0.hf -w localhost:8080.</li>
<li>Use of the existing files. Until now, m2 has been the simplest version. However, it needs some extra functions in external python files. The reason why it is so simple is that it has one implemented operation only. Another option is using COCI APIs and its configuration file (the one previously linked) and run RAMOSE with: : python ramose.py -s ../api/coci_v1.hf -w localhost:8080. Note that “-s” specifies the source of the data used by the API, while “-w” specifies the localhost. Localhost 8080 allows you to use ramose in your browser. So, after having copied and pasted the example after the url, you can call the triplestore and manage the request locally. By pressing enter, without any further specification, the default format of the output will be a JSON. When we want to change something, we intervene on the configuration file. The aim is understanding how does ramose work with an existing triplestore: the endpoint has to be specified at the beginning of the configuration file.</li>
<li>Once we have NOCI rdf data on a local triplestore, we replace the OC endpoint with that of our local blazegraph. The various operations must be modified because for now it only received DOIs in input. The idea is that we need to change the input shape, since it is a regex made for matching DOIs instead of PMIDs. Since in the pre and post processing phases there are some encoding/decoding operations specifically meant for DOIs, in the case of PMIDs we can probably skip both passages. The test will then be done by specifying another endpoint (specific for NOCI, to be tested in localhost 8080). In order to do that, it is necessary to change the base on COCI file (http://.. etc) with: #base http://localhost:8080, before running ramose. All the execution links are clickable and they go to the specific software. </li>