meetings.html

<!doctype html>
<html lang="en">
  <head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

    <!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<link rel="stylesheet" href="style.css" type="text/css">

<script src="https://code.jquery.com/jquery-3.2.1.slim.min.js" integrity="sha384-KJ3o2DKtIkvYIK3UENzmM7KCkRr/rE9/Qpg6aAZGJwFDMVNA/GpGFF93hXpG5KkN" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script>

    <title>Thesis Repository</title>
  </head>


  <body>
                  <nav class="navbar navbar-expand-lg navbar-light navbar-custom ">
  <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
    <span class="navbar-toggler-icon"></span>
  </button>

  <div class="collapse navbar-collapse" id="navbarSupportedContent">
    <ul class="navbar-nav mr-auto">
      <li class="nav-item active">
        <a class="nav-link" href="index.html">Home <span class="sr-only">(current)</span></a>
      </li>
      <li class="nav-item">
        <a class="nav-link" href="diary.html">Diary</a>
      </li>
      <li class="nav-item dropdown">
        <a class="nav-link dropdown-toggle" href="phases.html" id="navbarDropdown" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
          Phases
        </a>
        <div class="dropdown-menu" aria-labelledby="navbarDropdown">
          <a class="dropdown-item" href="phase1.html">Phase 1</a>
          <a class="dropdown-item" href="phase2.html">Phase 2</a>
          <a class="dropdown-item" href="phase3.html">Phase 3</a>
        </div>
      </li>
      <li class="nav-item">
        <a class="nav-link" href="meetings.html">Meetings</a>
      </li>
    </ul>
  </div>
</nav>


    <h1>Meetings</h1>

<div id="accordion">

  <div class="card">
    <div class="card-header" id="headingOne">
      <h5 class="mb-0">
        <button class="btn btn-link" data-toggle="collapse" data-target="#collapseOne" aria-expanded="true" aria-controls="collapseOne">
          Week #1 (01-14-20)
        </button>
      </h5>
    </div>

    <div id="collapseOne" class="collapse show" aria-labelledby="headingOne" data-parent="#accordion">
      <div class="card-body">
        <h2>Overview</h2>
        <p>The first part of the work implies doing something similar to what has been developed for CROCI: we need a class to extend another class, which implements one method only and expects to receive a CSV. This CSV file defines some citations, and its form depends on the data provider (so we expect National Institute of Health to use a specific data format, probably a csv or a json)</p>
        <h2>get_next_citation_data</h2>
        <div class="container codebg">
        <pre><code>
 def get_next_citation_data(self):
    row = self._get_next_in_file()

    while row is not None:
        citing = self.doi.normalise(row.get("citing_id"))
        cited = self.doi.normalise(row.get("cited_id"))

        if citing is not None and cited is not None:
            created = row.get("citing_publication_date")
            if not created:
                created = None

            cited_pub_date = row.get("cited_publication_date")
            if not cited_pub_date:
                timespan = None
            else:
                c = Citation(None, None, created, None, cited_pub_date, None, None, None, None, "", None, None, None, None, None)
                timespan = c.duration

            self.update_status_file()
            return citing, cited, created, timespan, None, None

        self.update_status_file()
        row = self._get_next_in_file()

    remove(self.status_file)
  </pre></code></div>
        <p>This function is the last step of <a href="https://github.com/opencitations/index/blob/master/croci/crowdsourcedcitationsource.py">index/croci/crowdsourcedcitationsource.py /</a>.
 It was developed to manage the particular CSV format that we expect for CROCI; it finds information to return. Indeed, the process expects a tuple of 6 values, derived from the citation source.
 In particular:
 <ol>
   <li>citing</li>
   <li>cited</li>
   <li>created</li>
   <li>timespan</li>
   <li>None</li>
   <li>None</li>
 </ol>
The last values are meant to represent additional info about whether the citation is a self citation and about citations between publications on the same journal. In both cases the value is "none" because in this moment we are not interested in this information.
</p>
<h2>The Project</h2>
<p>We have to implement a class aimed at extending this CSV citation file implementing a get_next_citation_data tailored on NIH dataset (and related file formats).</p>
<p>In the aforementioned function we have <em>c</em>: this is the python class managing the interaction with the datasource. Each class relates to a specific format (? "typology"), depending on the index to be created.</p>
<p>We are going to work on an already-existing environment where the introduction of a new datasource requires the crearion of a new class, aimed at extracting from the new source the same set of information that the process is supposed to manage. The system is higly "parameterized": it always expects to receive a 6-values tuple, and exectutes a specific set of actions, depending on the given values.
In particular, the plug-in to be developed should be able to extract form the NIH file format the same information required by the system, so to place it in the 6-element tuple. That's what also COCI does, but with JSON format.
</p>
<p>The 2 primary identifiers are aimed at identifying the citing and the cited entity, and they are not always doi. In this case, we may have PubMedID or PubMedCentralID.
One of the problems to be handled is that some of these citations may already be present in <a href="https://opencitations.net/">Open Citations</a>. For example, in COCI we have only doi-to-doi; in the NIH dataset we are going to import some articles probably have a doi, but it is exposed in a different way.
However, if they have a doi, it is should be specified among the article data; or, as an alternative, it is possible to go back to it through an external mapping dataset.</p>
<p> Another consistent problem to be considered is that the 6-values-tuple doesn't expect (nor store) information about the DOI, since it is structured to manage only the 6 aforementioned aspects.</p>
<p> We'll need to add the mapping information, since -once imported the dataset in Open Citations- the API used for the unification process needs to manage the identifiers. This issue has never been handled until now, since the already imported data didn't require to deal with it. </p>
<h2>Phases Of The Project</h2>
<ol>
  <li>Develop a software to read the datasource. We need the aforementioned class, in order to manage the new datasource. In this phase the index is not supposed to disambiguate: it has just to work in its own field.</li>
  <li>Develop a sub-index of indexes for the alignment phase, so to recognise and map possible resources referring to the same citation, despite having differen URLs. This plug-in is aimed at managing all the possible data sources, so that -in case we need to import a new dataset with a different identifier in future- we can reuse the same tool.</li>
  <li>API</li>
</ol>
<h2>Test Driven Development</h2>
<p>The 9th file in the test folder, <a href="https://github.com/opencitations/index/blob/master/test/09_croci.py">09_croci.py</a> provides a good example of the functioning of this kind of test. The procedure implies starting by creating a new class, which instantiates 2 functions: a set-up one and a test.</p>
<div class="container codebg">
<pre><code>
  import unittest
from index.coci.glob import process
from os import sep, makedirs
from os.path import exists
from shutil import rmtree
from index.storer.csvmanager import CSVManager
from index.croci.crowdsourcedcitationsource import CrowdsourcedCitationSource
from csv import DictReader


class CROCITest(unittest.TestCase):

    def setUp(self):
        self.input_file = "index%stest_data%scroci_dump%ssource.csv" % (sep, sep, sep)
        self.citations = "index%stest_data%scroci_dump%scitations.csv" % (sep, sep, sep)

    def test_citation_source(self):
        ccs = CrowdsourcedCitationSource(self.input_file)
        new = []
        cit = ccs.get_next_citation_data()
        while cit is not None:
            citing, cited, creation, timespan, journal_sc, author_sc = cit
            new.append({
                "citing": citing,
                "cited": cited,
                "creation": "" if creation is None else creation,
                "timespan": "" if timespan is None else timespan,
                "journal_sc": "no" if journal_sc is None else journal_sc,
                "author_sc": "no" if author_sc is None else author_sc
            })
            cit = ccs.get_next_citation_data()

        with open(self.citations) as f:
            old = list(DictReader(f))

        self.assertEqual(new, old)
</pre></code></div>

<h2>To do list</h2>
<ul>
  <li>create a repository, where the weekly report will be kept updated.</li>
  <li>write an e-mail to DARCH and ask for an intership (object: first phase of the thesis project, concerning the interaction with datasource).</li>
  <li>study the <a href="https://docs.python.org/3/library/unittest.html">unittest library</a> for tests and make some practice with it.</li>
  <li>have a look at CROCI test, which is simple and deals with a class which is similar to the one to be developed. Start with <a href="https://comp-think.github.io/">CT exercises</a> and reproduce the functions using unittest. Before writing the functions that the class implements, make sure to have the test written. Learn how to launch the library, how to use it proficiently.</li>
</ul>

      </div>
    </div>
  </div>

  <div class="card">
    <div class="card-header" id="headingTwo">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwo" aria-expanded="false" aria-controls="collapseTwo">
          Week #2 (01-21-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwo" class="collapse" aria-labelledby="headingTwo" data-parent="#accordion">
      <div class="card-body">
          <h2>To do list</h2>
          <ul>
              <li>Extract ten examples of citational data from NIH library and try to understad the format of the data we are going to work with.</li>
              <li>Start preparing the data mapping: analyse your information in order to understand the mapping.</li>
              <li>Try to imagine how the main function has to be structured.</li>
              <li>Develop the test case, which is going to fail, since the function to be tested doesn't exist yet.</li>
              <li>Study basic commands of the command prompt (python -m unittest).</li>
              <li>Clone locally OC index so to run tests.</li>
              <li>Revise self citation aspect in COCI article (<a href="Heibi2019_Article_SoftwareReviewCOCITheOpenCitat.pdf">Software review: COCI, the OpenCitations Index of Crossref
open DOI‑to‑DOI citations</a>)</li>
          </ul>
        <h2>Comments to unittest working frame</h2>
          <p><strong>assertRaises():</strong> Checks whether a specific exception is raised by the execution of a specific code. Hoewever, in our case, even if we have exceptions, we don't need to test them, since our checks are strictly assertive, on data.
          <p><strong>if __name__ == '__main__ ':</strong> When we make imports, we have the possibility to perform them totally, with <strong>import *</strong>. However, when we import everything, we include also some lines of code we don't really need, such as prints and tests (when we import we are generally interested in functions and methods, not in tests and prints). For this reason, we have a method that allows us to specify that some lines of the code belong locally to a specific file: we can for example specify if __name__ == '__main__ ': do something (which means "do something only if you are processing this specific file, otherwise don't execute the unittest. Unittest has to be executed only if the specific file is the one which is called by the python interpreter).
          </p>
          <p><strong>assertEqual(first, second, msg=None):</strong>
Test that first and second are equal. If the values do not compare equal, the test will fail. This is the most widely used method of the unittesting framework. </p>
          <p>In OC we use Unittest because it comes in default with python, and it is enough for our needs.
          In the development of NOCI it will be necessary to mantain the same structure of the other tests, developing a test function for each method of the to be implemented (at least).</p>

          
        <h2> Analysis of get_next_citation_data test</h2>
         <p><strong>Sep sep sep:</strong> In setUp(self) function "%" is the separator. The problem of the separator could be be solved also with the filepath class.</p>
         <p><strong>setUp(self):</strong> In this function we take paths of the files that contain our sources and what we expect to obtain as a result. The input file is passed as a constructor: we want to take the source of the initial data. I pass this <strong>self.input_file</strong> as input to the <strong>CrowdsourcedCitationSource</strong> class in order to obtain initial data.</p>
         <p>At this point, we can create the citations, adapting them to what we have in our source file.
        <p><strong>cit = ccs.get_next_citation_data()</strong> bring us back to our 6-elements tuple.</p> 
        <p>If one between creation and timespan is missing, or if both of them are missing,that's not a big problem, and the field stays empty(e.g.:"creation": "" if creation is None else creation). In the other cases, if an element is missing, we need to add "no" to specify the absence of the element, in order to stay compliant with the OC format.</p>
          <p>One of the final steps implies saving the output file in the correct format( with open(self.citations) as f:
              <strong>old</strong> = list(DictReader(f))).</p>
         <p>At this point, we compare the old and new with the assertEqual method. We need to compare these 2 dictionaries in order to check if the sets are coherent.</p>
        
          <h2>Command Prompt</h2>
          <p>We need to know how to use the command propmpt because the test has to be run from the command line.The reason is that it makes possible to execute tests one by one instead of making a unique big test calling all the individual parts. For example, if we clone locally a git repository, from the outside of an index folder we can call python -m package (location where the test is placed). In this way we can run the specific test only.</p>
          
          <h2>Open Citation Environment: NOCI</h2>
          <p>The aim of this thesis project is to develop an extension of ocindex. For this reason, i need to clone locally oc index, so to be able to run tests. We will add a new directory, a package, that we can call "NOCI", that will be developed following CROCI structure. We will need a python file in order to interact with the input file containing NIH data (extend locally with a folder containing a python file named "nih citation source", or something similar). </p>
          <p>Initially, this won't work: the first thing we need to do after having the structure set is developing the test case, so to check that the input returns a specific output. The form of the output has to be defined (emulate CROCI and COCI). This process forces us to think about the format of input and output materials of the functio to be developed.  </p>
          <p>As we said, the output has to have the OC sixtuple format (see test data croci dump citation). The only big difference is that the value that I'm going to insert in citing and cited won't be a doi, but the NIH id, which should be a PubMed ID. The DOI may be (and probably is) present; however, we can't rely on this. In this dataset we will have the "selfcitation" info, but we don't know yet whether it is useful or not for our purposes. 
          <p>Remember that "Selfcitation" is set as "yes" if citing and cited either belong to the same journal or if there is at least a member in common between citing and cited authors. </p>
          
          
        <h2>Support files</h2>
            <p>Many of the information, if lacking, can be integrated in a further phase of the project, in which we can create support files in order to update the process.</p>
          <p>A support file is a csv file with very simple structure. In coci there is a glob.py file that creates those support files. Its aims:</p>
              <ul>
                  <li>check whether the specified DOIs are valid</li>
                  <li>map (id-pubdate; journal isn - other data; article id - related data </li>
          </ul> 
          <p>With the support file we try to understand if we can improve the quality of the info about the sixtuple.</p>
          <p>("id1", "id2", None, None, None, None) <br>
id_date.csv:
"id1","2019"
"id2","2017"
-> <br>
("id1", "id2", "2019", <strong>"P2Y"</strong>, None, None)</p>
          <p>Once we obtained somewhere else these data, we understand that the first id is associated with the date "2019", while the second one with the date "2017". In this way we can extend the citation data with the timespan info, that I didn't have before. </p>
              
          <p>However, the general process, when lacking a support file, tries to reconstruct the information using an API (the mechanism is like a switch: if we don't have the support file, we try with the API). All the tools that we have now were developed for the doi: now we have to do the same for the PubMed Ids. We may need another blob for NOCI, in order to generate additional csv, or maybe we can make request for already existing online API, so to obtain local files.  </p>
          
       <h2>Working process overview</h2>
          <ul>
              <li>Citation source works at least with 2 pieces of information: citing and cited. Also in the case it is the only material we have, we keep it.</li>
              <li> The overall process, in case we lack some piece of information, uses precomputed files that allow to improve the starting material.</li>
          </ul>
          
        <p>As first step, we have to understand what we have and what is missing. Then, we have to understand how to develop globs, global files. At this point we have to reason on PubMed Id structure, which is not handled by our OC general process, for now. All this stuff relates to the first phase. The following one is the alignment.</p>
          
        <h2>Links and References</h2>
          <ul>
              <li>index.test.01_csvmanager</li>
              <li><a href="https://github.com/opencitations/index">https://github.com/opencitations/index</a></li>
              <li><a href="https://github.com/opencitations/index/blob/master/test_data/croci_dump/citations.csv">https://github.com/opencitations/index/blob/master/test_data/croci_dump/citations.csv</a></li>
              <li><a href="https://github.com/opencitations/index/blob/master/coci/glob.py">https://github.com/opencitations/index/blob/master/coci/glob.py</a></li>    
          </ul>



      </div>
    </div>
  </div>
  <div class="card">
    <div class="card-header" id="headingThree">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThree" aria-expanded="false" aria-controls="collapseThree">
          Week #3 (02-5-21)
        </button>
      </h5>
    </div>
    <div id="collapseThree" class="collapse" aria-labelledby="headingThree" data-parent="#accordion">
      <div class="card-body">
          <h2>Mapping</h2>
          <h3>Reconstructing Information</h3>
          <p> We can work out the selfcitation information later. It would be reasonable to follow the same process adopted for COCI: we have a preprocessing phase where each document is associated to some information extracted form a csv file containing metadata (e.g. journal issn, orcid..). The algorithm is supposed to receive some files, then used to extract the selfcitation information. The point is that the mapping represents a further phase of the process.         
          </p>
          <h3>NIH Mapping file</h3>
          <p>NIH should provide a csv mapping file. We should then understand if it makes more sense using api or existing files (we don't know the number of files to be managed, but probably there are many of them - the APIs could be overloaded and if the same information is already stored in csv files the process should be faster).</p>
          
          <h2>NIH dataset management</h2>
          <p>The format we found for CROCI citational data was "citing" - "cited". While for NOCI it is "citing" - "referenced". The second part can't be changed, but muts be managed in the way it comes. Pay attention: <strong>the source file must stay as it is</strong>: the differring naming convention is to be managed in the 6-elements tuple.
          <h2> citing, cited, creation, timespan, journal_sc, author_sc = cit </h2>
          <p>The format of this assignment depends on the fact that I can associate tuples of variables to tuples of values, and cit is already a tuple of values, while citing, cited, creation, timespan, journal_sc, author_sc are the names of the variables.</p>

          <h2>c in the class CrowdsourcedCitationSource (crowdsourcedcitationsource.py)</h2>
          <p>Citations class. The passage with c is useful only because it tales the four main values. For example, Citations has in default the timespan step, and the constructor of the class makes calculations on its own. We keep only the part we need.</p>
          <h2>NOCI main function</h2>
          <p>Is to be developed on the base of the CROCI main function. It is the function that contains the NOCI class.</p>
          <p>The first aim of this function is managing the new type of id. This should happen in cnc.py. We are supposed to manage the type of id so to interact with the correct api. The most important aspect is that the conversion citation (source) -- information storing in the 6 columns format works. Once that we accomplish this aim, we have to understand how to improve the quality and the amount of data to be stored in the tuple. Use input files of CROCI to understand the functioning of the overall process. <strong>get_next_citation</strong> must mantain the same name in the new class to be implemented. </p>
          
          <h2>To do</h2>
          <ul>
              <li>Download Metadata File</li>
              <li>Develop the main function</li>
              <li>Correct the test function</li>
              <li>Locally clone and launch OC index</li>
              <li>Read Mail about mapping</li>
          </ul>
      </div>
    </div>
  </div>
  <div class="card">
    <div class="card-header" id="headingFour">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFour" aria-expanded="false" aria-controls="collapseFour">
          Week #4 (02-11-21)
        </button>
      </h5>
    </div>
    <div id="collapseFour" class="collapse" aria-labelledby="headingFour" data-parent="#accordion">
      <div class="card-body">
          <h2>Normaliser</h2>
        <p><strong>Normaliser:</strong>Takes an ID assuming it is of a certain type. It then checks and uniforms it according to a unique scheme (which is slightly custumisable, but not so much). With dois, it takes as input a string and makes a normalisation by turning everything in lowercase and checks that there are no null charachters, which would break the id. So, in general, a normaliser takes as input an id ad gives back it in the normalised format.</p>
         
        <h2>Validator</h2>
        <p><stong>Validator:</stong> When the normaliser doesn't return none, the following steps are managed by the validator, which is strictly dependent on the object it works on. Some IDs formats follow a progressive logic, which make them validable against a particular scheme. However, some other ids like DOIs can be validated only by an API. </p>
        
        <h2>PubMed Case</h2>
          <p>^\d+$ --> 	^[1-9]\d*$ (non ^[0-9]{1,7}$)</p>
        
          <p> <strong>orcid_string = sub("[^X0-9]", "", id_string.upper())</strong>
           --> It may happen that the url of the orcid is passed (instead of the pure orcid, for example). The same thing may happen with pubmed id, so it should be managed. In our case, since pubmed id are made of digits only, we have to leave out everything except for digits. The idea is to remove everything that is not a digit. Moreover, we have to leave out the hypotetical sequence of 0s that we may find before the id.  At this point we can normalise (we won't need upper/lowercases normalisation, since we are managing digits only). For the validation we will need the PubMed API, in order to know whether the id exists or not.</p>
         
          <h2>to do</h2>
          <ol>
          <li>Fix the second function developed last time (id case is built on the base of doi case) </li>
          <li>Add a test case also for this latter function</li>
          <li>Look at the overall functioning: where do we need additional code to manage pubmeds?</li>
          <li>Run tests</li>
          <li>Before checking the overall functioning, check the two tests</li>
          </ol> 
      </div>
    </div>
  </div>
  <div class="card">
    <div class="card-header" id="headingFive">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFive" aria-expanded="false" aria-controls="collapseFive">
          Week #5 (02-18-21)
        </button>
      </h5>
    </div>
    <div id="collapseFive" class="collapse" aria-labelledby="headingFive" data-parent="#accordion">
      <div class="card-body">
          <h2>To Do</h2>
          <ul>
              <li>Run again all tests</li>
              <li>Study the execution process</li>
              <li>Study how to run dynamically the process (dynamic requests for croci -- request material: It's useful in order to look at what happens step by step. This works for Croci, but understanding its functioning its useful in order to work out how to integrate noci in the process. cnc.py for now works with dois only. It neither checks whether the id is a doi or not. We need to make it work with pmid too, so the nature of the id will have to be specified.)</li>
              <li>Once understood how does the process work, we have to understand how to integrate pmids.</li>
              <li>Check for information provided specifying "?format=pubmed" in the api request (e.g.: https://pubmed.ncbi.nlm.nih.gov/47/?format=pubmed). For now we are interested only in issn and publication date.</li>
              <li>Check corrected functions</li>
          </ul>
      </div>
    </div>
  </div>
  <div class="card">
    <div class="card-header" id="headingSix">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSix" aria-expanded="false" aria-controls="collapseSix">
          Week #6
        </button>
      </h5>
    </div>
    <div id="collapseSix" class="collapse" aria-labelledby="headingSix" data-parent="#accordion">
      <div class="card-body">
          <h2>Dynamic requests for croci</h2>
          <ul>
              <li><strong>-s</strong> : source I'm supposed to take data from. I specify where the data source is. In our case, it is an example and we don't speifically need to have the correct source. The command is just intended to start the tool. The source is then specified in provenance  info. The system also says where the raw data were taken from and who provided it (data that allowed the creation of the citation)S</li>
              <li><strong>-a</strong> : identifies the responsible agent for these data.</li>
              <li><strong>key for api-orcid</strong>: is personl and not strictly required. The command can be run even without it.</li>
          </ul>
          <div class="codebg">
              <code>
                  <pre>
                  C:\Users\arimoretti\Documents\GitHub>python -m index.cnc -ib "http://dx.doi.org/" -b "https://w3id.org/oc/index/croci/" -p "C:\Users\arimoretti\Documents\GitHub\index\croci\crowdsourcedcitationsource.py" -c "CrowdsourcedCitationSource" -i "C:\Users\arimoretti\Documents\GitHub\index\test_data\croci_dump\source.csv" -l "C:\Users\arimoretti\Documents\GitHub\index\test_data\tmp_store\lookup_full.csv" -d "C:\Users\arimoretti\Documents" -px "050" -a "https://orcid.org/0000-0003-0530-4305" -s "https://doi.org/10.5281/zenodo.3832935" -sv "OpenCitations Index: CROCI" -v
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1109/wi.2006.164' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v11i11.1413' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v8i12.1108' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v11i9.1400' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1038/438900a' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.2307/2529310' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.2307/4486062' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.5210/fm.v12i4.1763' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1145/503376.503456' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1142/9789812701527_0009' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1145/1501434.1501445' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.1007/11839569_35' has been already processed
WARNING: the citation between DOI '10.1002/asi.20755' and DOI '10.2307/1562247' has been already processed

# Summary
Number of new citations added to the OpenCitations Index: 0
Number of citations already present in the OpenCitations Index: 13
Number of citations with invalid DOIs: 0

C:\Users\arimoretti\Documents\GitHub>
                  </pre>
              </code>
          </div>
          <h2>makedir</h2>
          <p>Make some tries on a local file (on a python file locally generated and executed). There is something wrong and we have to understand what. </p>
      </div>
    </div>
  </div>
      
  <div class="card">
    <div class="card-header" id="headingSeven">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSeven" aria-expanded="false" aria-controls="collapseThree">
          Week #7 (03-4-21)
        </button>
      </h5>
    </div>
    <div id="collapseSeven" class="collapse" aria-labelledby="headingSeven" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ul>
              <li>Make issues on GitHub (UnicodeDecodeError, Test 05_citationstorer fail)</li>
              <li>Check where ocy has to be extended, so to manage pmids</li>
              <li>Correct the indentation error in nationalinstituteofhealthsource.py</li>
              <li>Print the serialization of the two graphs (g1 and g2 in 05_citationstorer) using n-triple</li>
              <li>Correct pmidmanager.py (api gives back an html and not a string)</li>
          </ul>

      </div>
    </div>
  </div>
  
  <div class="card">
    <div class="card-header" id="headingSeven">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseEight" aria-expanded="false" aria-controls="collapseEight">
          Week #8 (03-11-21)
        </button>
      </h5>
    </div>
    <div id="collapseEight" class="collapse" aria-labelledby="headingEight" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ul>
              <li>Correct the csv file manually so that it can be used in case api can't be used</li>
              <li>Check isomorphism between the two graph, and try to understand why do they differ</li>
              <li>Manage oci extension in order to manage pmid</li>
          </ul>

      </div>
    </div>
  </div>
      

  <div class="card">
    <div class="card-header" id="headingNine">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseNine" aria-expanded="false" aria-controls="collapseNine">
          Week #9 (03-18-21)
        </button>
      </h5>
    </div>
    <div id="collapseNine" class="collapse" aria-labelledby="headingNine" data-parent="#accordion">
      <div class="card-body">
          <p><a href="https://github.com/opencitations/index/blob/master/citation/oci.py">https://github.com/opencitations/index/blob/master/citation/oci.py</a></p>
          <p><a href="https://github.com/opencitations/oci/blob/master/lookup.csv">https://github.com/opencitations/oci/blob/master/lookup.csv</a></p>
          
          
          <h2>To do:</h2>
          <ul>
              <li>read <a href="https://doi.org/10.6084/m9.figshare.7127816">https://doi.org/10.6084/m9.figshare.7127816</a></li>
              <li>fix files and copy notes on them</li>
              <li>update open citations issue</li>
              <li>check and/or fix: oci + cnc + 05_citationstorer + 04_oci</li>
              <li>start analysing cnc and try to understand how does cnc work and what is used in this part of the software, in order to figure out whether some parts of the code can be in contrast with the PMID structure to be integrated </li>
          </ul>

      </div>
    </div>
  </div>


      
    <div class="card">
    <div class="card-header" id="headingTen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTen" aria-expanded="false" aria-controls="collapseTen">
          Week #10 (04-01-21)
        </button>
      </h5>
    </div>
    <div id="collapseTen" class="collapse" aria-labelledby="headingTen" data-parent="#accordion">
      <div class="card-body">
          
          <h2>To do:</h2>
          <ul>
              <li>In oci.py def get_oci(self, doi_1, doi_2, prefix, id_type): #change id type (no more doi only)</li>
              <li>Run test 04_oci and add the missing id_type argument in the already existing tests</li>
              <li>Integrate a test for get_oci so to test also pmid type --- which is the pmid prefix?</li>
              <li>Add type information to the six (now seven) elements tuple which is returned by get_next_citation_data(self). Remember that this information is given, since NationalInstituteHealthSource manages pmids only, while CrossrefCitationSource and CrowdSourcedCitationSource dois only </li>
          </ul>

      </div>
    </div>
  </div>
      
<div class="card">
    <div class="card-header" id="headingEleven">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseEleven" aria-expanded="false" aria-controls="collapseEleven">
          Week #11 (04-15-21)
        </button>
      </h5>
    </div>
    <div id="collapseEleven" class="collapse" aria-labelledby="headingEleven" data-parent="#accordion">
      <div class="card-body">
          
          <h2>To do:</h2>
          <ul>
              <li>Id type specification in Tests of: CrossrefCitationSource, CrowdsourcedCitationSource, NationalInstituteOfHealthSource</li>
              <li>Study Class and Instance Variables and Inheritance</li>
              <li>Integrate a test for get_oci so to test also pmid type --- which is the pmid prefix?</li>
              <li>Add type information to the six (now seven) elements tuple which is returned by get_next_citation_data(self). Remember that this information is given, since NationalInstituteHealthSource manages pmids only, while CrossrefCitationSource and CrowdSourcedCitationSource dois only </li>
          </ul>

      </div>
    </div>
  </div>
      
    <div class="card">
    <div class="card-header" id="headingTwelve">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwelve" aria-expanded="false" aria-controls="collapseTwelve">
          Week #12 (04-22-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwelve" class="collapse" aria-labelledby="headingTwelve" data-parent="#accordion">
      <div class="card-body">
          
          <h2>To do:</h2>
          <ul>
              <li><b>citationsource.py</b>: (1) <strong>class CSVFileCitationSource(DirCitationSource):</strong> When it is instantiated in the test, it is necessary to pass in input also the id type related to the CSV in question.It is necessary to create a csv file also for PMID. (2)<strong>return citing, cited, created, timespan, journal_sc, author_sc, self.id_type</strong>: It is necessary to pass the self.id_type correctly; the function will return it as it was passed by the user.</li>
              <li><b>06_citationsource.py</b>: check the file and transpose the notes. </li>
              <li><b>cnc.py</b>: check the file and transpose the the notes.</li>
              <li><b>07_cnc.py</b>: (1)check the file and transpose the the notes. (2)Create the required support files.</li>
              <li><b>oci.py</b>: check the file and transpose the the notes.</li>
              <li><b>resourcefinder.py</b>: read the file so to understand how to implement the version for NIH</li>
              <li><b>nationalinstituteofhealthresourcefinder.py</b>: create the file. Use as a model the crossrefresourcefinder.py</li>
              <li><b>03_resourcefinder.py</b>: update the file by adding the piece of code to test nationalinstituteofhealthresourcefinder.py</li>
              <li>Update the code in accordance with the commits in OpenCitations</li>
              <li>Send the request for the intership</li>
          </ul>

      </div>
    </div>
  </div>
      
    <div class="card">
    <div class="card-header" id="headingThirteen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThirteen" aria-expanded="false" aria-controls="collapseThirteen">
          Week #13 (05-13-21)
        </button>
      </h5>
    </div>
    <div id="collapseThirteen" class="collapse" aria-labelledby="headingThirteen" data-parent="#accordion">
      <div class="card-body">
          
          <h2>To do:</h2>
          <ul>
              <li></li>
              <li></li>
              <li></li>
              <li></li>
              <li></li>
              <li></li>
          </ul>

      </div>
    </div>
  </div>
  
  <div class="card">
    <div class="card-header" id="headingFourteen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFourteen" aria-expanded="false" aria-controls="collapseFourteen">
          Week #14 (05-20-21)
        </button>
      </h5>
    </div>
    <div id="collapseFourteen" class="collapse" aria-labelledby="headingFourteen" data-parent="#accordion">
      <div class="card-body">
          
          <h2>nihresourcefinder.py</h2>
          <p>The kind of material from which we have to retrieve the information handled by a resource finder has this shape:</p>
          <div class="codebg">
              <code>
                  <pre>
PMID- 123456
OWN - NLM
STAT- MEDLINE
DCOM- 19750625
LR  - 20160920
IS  - 0030-0632 (Print)
IS  - 0030-0632 (Linking)
VI  - 78
IP  - 4
DP  - 1975 Apr
TI  - [The laboratory in programs for enteric infection control].
PG  - 318-22
FAU - Grados, O B
AU  - Grados OB
LA  - spa
PT  - Journal Article
TT  - El laboratorio en los programas de control de las infecciones entéricas.
PL  - United States
TA  - Bol Oficina Sanit Panam
JT  - Boletin de la Oficina Sanitaria Panamericana. Pan American Sanitary Bureau
JID - 0414762
SB  - IM
MH  - Bacterial Infections/*prevention and control
MH  - Bacteriological Techniques
MH  - *Communicable Disease Control/methods
MH  - Enteritis/*prevention and control
MH  - Health Planning
MH  - *Laboratories
MH  - Peru
EDAT- 1975/04/01 00:00
MHDA- 1975/04/01 00:01
CRDT- 1975/04/01 00:00
PHST- 1975/04/01 00:00 [pubmed]
PHST- 1975/04/01 00:01 [medline]
PHST- 1975/04/01 00:00 [entrez]
PST - ppublish
SO  - Bol Oficina Sanit Panam. 1975 Apr;78(4):318-22.
                  </pre>
              </code>
              
      </div>
    <h3>Issues to handle</h3>
              <ul>
                  <li>Find a documentation and understand the acronyms</li>
                  <li>Understand which piece of information refers to the date of publication</li>
                  <li>Handle the double ISSN (printing and linking) </li>
                  <li>The ORCID is missing, so just pass this step and return an empty set in any case</li>
              </ul>
              
              <h3>Meeting: Various notes</h3>
              <h4>NIHResourceFinder._get_orcid(self,pmid)</h4>
              <p>This function is going to return in any case an empty set: the NIH doesn't provide any information about the ORCID of an article. For this, the Orcid Finder will be used, so to try to retrieve the information somewhere else.</p>
              <h4>NIHResourceFinder._get_issn</h4>
              <p>In the case the test is not passed, doublecheck the indetation of the lines containing the information about the ISSN: it is possible that the number of spaces needs to be changed.</p>
    </div>
  </div>
      
  <div class="card">
    <div class="card-header" id="headingFifteen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseFifteen" aria-expanded="false" aria-controls="collapseFifteen">
          Week #15 (05-25-21)
        </button>
      </h5>
    </div>
    <div id="collapseFifteen" class="collapse" aria-labelledby="headingFifteen" data-parent="#accordion">
      <div class="card-body">
          
             
    </div>
  </div>

</div>
      
<div class="card">
    <div class="card-header" id="headingSixteen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSixteen" aria-expanded="false" aria-controls="collapseSixteen">
          Week #16 (06-03-21)
        </button>
      </h5>
    </div>
    <div id="collapseSixteen" class="collapse" aria-labelledby="headingSixteen" data-parent="#accordion">
      <div class="card-body">
          <p> Hai dellelibrerie caricate come packages all'interno di un ambiente python. questo approccio serve ad andare a riferirse alle varie cose. L'idea di base è che hai un package che ha un id di riferimento. l'ultimo nome è una classe, una funzione... Quando questo index viene messo a disposizione su py per installare il pacchetto va fatto così. </p>
          <p>è cambiato qualcosa negli import -- o è stato introdotto un errore o i test devono andare</p>
          <p>Quando lanci i test esci da index sennò non trova. Nota, lanciando da funzione singola non riconosce come package</p>
          <p>Definiendo index come source directory, non la vede come package ma come directory che contiene i pakages.</p>
          <p>Ricostruisce lo stack di errori dalla cosa più astratta alla più concreta in basso.</p>  
          <p>Posso forkare index sul mio repository, poi modifico il mio spazio</p>
          
             
    </div>
  </div>

</div>
<div class="card">
    <div class="card-header" id="headingSeventeen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseSeventeen" aria-expanded="false" aria-controls="collapseSeventeen">
          Week #17 (06-17-21)
        </button>
      </h5>
    </div>
    <div id="collapseSeventeen" class="collapse" aria-labelledby="headingSeventeen" data-parent="#accordion">
      <div class="card-body">
          <h2><strong>03_resourcefinder</strong></h2>
          <p> Try to comment all the tests and to launch just one of them.  </p>
           <p> The path to solve the current problem in the finder test should be: get orcid --> get_item --> csv manager --> doi normalization --> none</p>
          <p> The only thing we can do in order to try and fix the code is to keep launching the test from the outside (always) and then make prints at each step, so to understand in which point the problem occurs. Up to now it seems to relate to the idtype extension. </p>
          <p> Use  print(abspath(".")) to identify from which point the process is run.</p>
          <p>The problem should not relate to any of the following points:</p>
          <ul>
              <li>self.orcid_path : it exists and it is found</li>
              <li>The path of the test launch (print(abspath("."), "è il punto in cui si trova") actually prints what it was expected to)</li>
              <li> By defining a variable a_csv_manager = CSVManager(self.orcid_path), we also obtain the expected data both with a_csv_manager.data and with a_csv_manager.get_value("doi:10.1108/jd-12-2013-0166") </li>
        <p>Instead, we get a problem with of_2.get_orcid("10.1108/jd-12-2013-0166"). Since of_2 is defined as of_2 = ORCIDResourceFinder(orcid=CSVManager(self.orcid_path),
                                   doi=CSVManager(self.doi_path), use_api_service=False), we should look for the error there.</p>
            </ul>
          <p> Group 0: the whole string. The number of parentheses I add to subdivide the pattern will determine the number of groups I'll define in my matched string. </p>
         
         
          
             
    </div>
  </div>

</div>
    

    <div class="card">
    <div class="card-header" id="headingEighteen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseEighteen" aria-expanded="false" aria-controls="collapseEighteen">
          Week #18 (07-07-21)
        </button>
      </h5>
    </div>
    <div id="collapseEighteen" class="collapse" aria-labelledby="headingEighteen" data-parent="#accordion">
      <div class="card-body">
          <h2><strong>03_resourcefinder</strong></h2>
          <p> Try to comment all the tests and to launch just one of them, in order to semplify the process of analysis: it is pretty evident that all the tests implying the loading of a file fail for the same reasons. </p>
          <p> During the meeting we analysed together the involved files and compared them with the OC official version in order to spot the file loading problem. No evident mistakes were found during the meeting and we agreed that th problem should be related to some secundary issue, such as a typo or a wrong indentation.</p>
          <p>Fixing this error is still the primary issue to be handled, in order to proceed with the workflow.</p>
    </div>
  </div>
</div>

    <div class="card">
    <div class="card-header" id="headingNineteen">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseNineteen" aria-expanded="false" aria-controls="collapseNineteen">
          Week #19 (07-27-21)
        </button>
      </h5>
    </div>
    <div id="collapseNineteen" class="collapse" aria-labelledby="headingNineteen" data-parent="#accordion">
      <div class="card-body">
          <h2><strong>Fork Update</strong></h2>
          <p>A new file was added to OC, which does not concern the work I have done up to now. Further, the structure of the directory index was modified: now index contains another folder named index as well, which contains all the subfolders. cnc only was brought outside the subfolder index. The problem was spotted through a test and it concerns a parallel data ingestion. cnc.py changed also internally: a class was developed (i.e.: handler) in order to make a specific check. Moreover, also the bugs which emerged during the thesis work were fixed. The most of the tests should be the same as before, but the one for cnc was moved outside. Other small changes were made in citationstorer.</p>

          <h2>To do</h2>
          <ol>
              <li>Update the fork with OC changes</li>
              <li>Revise at least five meetings (6-10) and update the fork accordingly</li>
              <li>Fix issn issue in resourcefinder tests</li>
              <li>Fix NIH issue in resourcefinder tests</li>
          </ol>

    </div>
  </div>
</div>

    <div class="card">
    <div class="card-header" id="headingTwenty">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty" aria-expanded="false" aria-controls="collapseTwenty">
          Week #20 (08-05-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty" class="collapse" aria-labelledby="headingTwenty" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ol>
              <li>Extend cnc and its test</li>
              <li>Extend citation storer and source, so to make the whole process work for pmid also</li>
          </ol>
          
          <h2>Notes</h2>
          <ul>
              <li>loockup_full should be used for DOIs only, while for PMIDS it should not be used, while for pubmed ids we just put the numeric id after the prefix 0XXX0. So, probably I won't need lookup_full to create my test data at all.</li>
              <li><strong>Lambda Function:</strong> it is not defined and it allows to define on the fly an anonymous function which is then assigned to a variable. It is a complex topic, which is explained at <a href="https://realpython.com/python-lambda/">https://realpython.com/python-lambda/</a>. All in all, for implementation reasons we need to use different functions when we run the process in parallel, but then the "behaviour" of the two versions (parallel and standard) should be the same. So, we specify which are the needed functions for the process at the moment when we run cnc thanks to the lambda function.</li>
              <li><strong>Cnc Testing modalities:</strong> The test is to be run in order to be sure that cnc works as expected. Further, cnc is to be executed for the new index, in order to check its functioning also for the new index (noci), also with a limited amount of data.</li>
              <li>The source for pmids is <a href="https://doi.org/10.35092/yhjc.c.4586573.v16 and it works properly">https://doi.org/10.35092/yhjc.c.4586573.v16 and it works properly.</a> </li>
          </ul>

    </div>
  </div>
</div>
 

 <div class="card">
    <div class="card-header" id="headingTwenty-one">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-one" aria-expanded="false" aria-controls="collapseTwenty-one">
          Week #21 (08-12-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-one" class="collapse" aria-labelledby="headingTwenty-one" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ol>
              <li>Create a glob file for noci</li>
          </ol>
          
          <h2>Notes</h2>
          <ul>
              <li>The function of the glob file is that of creating support files to be used in OC process</li>
              <li><strong>Files to download:</strong> NIH files in iCite Database, available on <a href="https://nih.figshare.com/articles/dataset/iCite_Database_Snapshot_2021-07/15148737?file=29101575">figshare</a></li>
          </ul>

    </div>
  </div>
</div>

      
<div class="card">
    <div class="card-header" id="headingTwenty-two">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-two" aria-expanded="false" aria-controls="collapseTwenty-two">
          Week #22 (09-09-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-two" class="collapse" aria-labelledby="headingTwenty-two" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ol>
              <li>Understand how to open tar.gz files in Windows and how to correct download errors.</li>
              <li>The support cache file for the mapping journal short names - ISSNs is a good idea: keep it and improve it.</li>
              <li>Process all the data as new ones, do not rely on COCI for those citation data which have already been processed in other indexes.</li>
              <li>Find Pandas parameter which allows to avoid to directly store in the memory all the information in the CSV files which have to be processed. In particular, understand how to manage very large files in reasonably small chunks, which can be then handled easily: <a href="https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas">https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas</a>.</li>
              <li>Add the empty date for the PMIDs which do not have one.</li>
              <li>Make a second iteration of the rows of the input files in order to process the cited PMIDs, in order to store as valid the ones which can be validated through the pmidmanager and discard the ones which result as invalid. Call the is_valid and the set_valid functions in case of a positive feedback.</li>
              <li>11.30 il 15</li>
          </ol>
          
          <h2>Notes</h2>
          <ul>
              <li>The presence of a mapping between PMIDs and DOIs is good, even if we don't directly need it now. However, this mapping will be very useful in the following steps. All the information in ICite Metadata will be useful in future, but the problem is that some important information is missing. The worse aspect is that the journals are identified by their short names. </li>
              <li>A table mapping ORCID - DOIs was already implemented: do not create a new version of it.</li>
              <li>Next meeting: Wednesday 15th, 11h30.</li>
          </ul>
    </div>
  </div>
</div>
      

<div class="card">
    <div class="card-header" id="headingTwenty-three">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-three" aria-expanded="false" aria-controls="collapseTwenty-three">
          Week #23 (09-14-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-three" class="collapse" aria-labelledby="headingTwenty-three" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          
          <h3>Multiple cnc tests</h3>
          <ol>
              <li>1: call cnc on a citations source dataset (which must exist), but without all the support files. These latter shouldn't exist: you have to use names of not existing files. In this way cnc will create the named files.</li>
              <li>2: repeat the call, adding the "na" parameter, which implies not calling the API, using the just created files. The result should be the same.</li>
              <li>3: call cnc again, this time using API and already generated files. </li>
              <li>4: call cnc one last time, using "na" and empty files: in this case you should obtain the only different result among the four attempts. </li>
          </ol>
          
          <h3>Possible empty spaces in referenced pmids (in noci glob) </h3>
          <p>First of all, it is necessary to remove possible spaces at the beginning and at the end of the string, then we have to make sure that no repeated spaces are placed as separators between a pmid and another. This second step is to be implemented with a regex.</p>
          
          <h3>Repeated citations issue</h3>
          <p>Different indexes can share repeated citations. The problem is to be fixed with <strong>URLs of meta ids</strong>.</p>
              <ol>
                  <li>Starting from the material resulting from cnc, find a way to create the metaids to associate with the pmids (it could be a csv)</li>
                  <li>Create a mapping table between pmids and metaids (it could be another csv)</li>
                  <li>Exploit a mapping table between pmids and dois, complete the previous mapping (pmid - metaid - doi, maybe another csv).</li>
                  <li>Use blazegraph to create a triplestore with only one property (here we need an rdf), which could be dcterms:relations. The data in the resulting ntriple file should have this format: <strong> metaid - dcterms:relation - [format generated by cnc]</strong>. Look at the ttl formats.</li>
                  <li>The generated file should be uploaded in Triplestore.</li>
              </ol>
              <p>The result is to be queried with SPARQL. Study triplestore and rdflib.</p>
              <p>Remember that the initlial data to be processed must be cnc output.</p>
          
          <h3>Next meeting</h3>
          <p>30th of September, Zamboni 32</p>

    </div>
  </div>
</div>

<div class="card">
    <div class="card-header" id="headingTwenty-four">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-four" aria-expanded="false" aria-controls="collapseTwenty-four">
          Week #24 (09-30-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-four" class="collapse" aria-labelledby="headingTwenty-four" data-parent="#accordion">
      <div class="card-body">
          <h2>To do (how to fix last week work):</h2>
          <ol>
              <li>pandas nan to empty string (df.fillna('', inplace=True))</li>
              <li>ttl does not linearize in the same order all the times. Since the result is a string of text lines, it is necessary to give the lines a lexographic order (in both files), then the epmty lines should be removed. At this point each line should contain a triple or quadruple, which will differ only for the creation date. Only the files stored in the folder "data" should be compared, in order to see if the software works properly. </li>
          </ol>
          
          <h2>To do (metaid mapping):</h2>
          <ul>
              <li>In this first phase it is possible to associate also the identifiers only, without the full urls. For the further steps, the metaid prefix is "060". The formats will be: <strong>https://pubmed.ncbi.nlm.nih.gov/ + pmid</strong>, <strong>https://w3id.org/oc/meta/br/060 + metaid</strong> </li>
              <li>020 and 0160 identify to oci the provenience of the id: for example nih, or others.</li>
              <li>Metaids will be associated to citing and cited ids processed in the citations (cnc output). Only one iteration of each csv should be enough: keep track of already processed information.</li>
              <li><strong>input:</strong> 1) cnc output containing citations, 2) mapping files, 3) a csv where to store metaid mappings. If it exists already, it should be updated. The id type information should be included somehow. In any case the software must be able to handle both doi and pmids. Consider that cnc has the id_type information.</li>
              <li>The combined mapping doi-pmid-metaid should exploit the information provided in NIH metadata.</li>
              <li>The rdf must be generated at the end, not updated line by line.</li>

          </ul>

    </div>
  </div>
</div>

<div class="card">
    <div class="card-header" id="headingTwenty-five">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-five" aria-expanded="false" aria-controls="collapseTwenty-five">
          Week #25 (10-07-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-five" class="collapse" aria-labelledby="headingTwenty-five" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ol>
              <li>trovare modo di tenere memoria del fatto che hai cambiato un identificativo </li>
              <li>Un metaid non può essere riassegnato, deve essere classificato come cancellato</li>
              <li>Provenance: ogni entità è accompagnata da informazioni di provenance. tiene traccia anche di informazioni di modifica di quel dato nel tempo. la cancellazione è l'eliminazione delle informazioni sull'entità. l'entità continua ad esistere. OC_OCDM (libreria python, si installa con pip, gestisce già tutte quelle cose)</li>
              <li>Il disallineamento tra quello che abbiamo e quello che è su meta è un po' un problema. Vedi meta e capisci come funziona meta e anche oc_ocdm. Il punto è che i mapping di id che vado a trovare tramite questo sistema, deve succedere qualcosa dentro meta. L'importante è avere l'idea del problema. La conseguenza dell'aggiunta deve andare a cambiare qualcosa in meta. Magari crea un file di mapping in cui aggiungi l'informazione della mappatura, in modo tale da sfruttare ococdm così da fare gestire tutto da questo. un csv in output tra mappatura di metaid. Utilizza il metodo merge di oc_ocdm. Altro software che potrebbe venire utile ocgraph enricher. va alla ricerca tramite sparql query che condividono lo stesso id e fa lui il merge. In caso venga fuori che il mapping era falso, si può tornare allo snapshot precedente tramite meta che tiene conto delle informazioni di provenance. Non è detto che l'ultimo metaid assegnato sia nel mio file. fai in modo di passare in input un file comune contenente l'ultimo metaid. il nome del file deve passare come input obbligatorio. ococdm guardalo più per curiosità che per altro, di meta guarda il codice girato da peroni. Dentro il github di oc c'è anche meta. di meta guarda la logica.  </li>
          </ol>

    </div>
  </div>
</div>

<div class="card">
    <div class="card-header" id="headingTwenty-six">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-six" aria-expanded="false" aria-controls="collapseTwenty-six">
          Week #26 (10-14-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-six" class="collapse" aria-labelledby="headingTwenty-six" data-parent="#accordion">
      <div class="card-body">
          <h2>To do:</h2>
          <ul>
              <li> Complete and correct the code of <strong>mapping.py</strong>, so that two new output files are created: one which contains the triples to be removed and the other one which contains the triples to be added. Both files will be in ntriples format. In this way, in the triplestore, as first thing we will remove the invalidated triples, and then we will load the new ones.</li>
              <li><strong>update.py</strong>: for now it only handles data addition (insertion): until this moment the deletion of a datum had never been required. However, data can be deleted through a query: the triples to be deleted should be a very small number. These triples to be deleted will be included in a file which will be passed as a new argument (something like -d): instead load them, the new query should delete the specified triples. In this way we will obtain a metaid mapping with all the publications included in OC without any ambiguity. It is an issue related to the indexes' API, which can query all of them at once. In this way we will be able to recognise the same citation in two different indexes.</li>
              <li><strong>Noci recap:</strong>cnc.py is ready. The index could be created also now. mapping.py will be used to create the mapping id-metaid in the triplestore, in order to solve the disambiguation problem in the unifying API. At this point we just need to develop the API in order to query NOCI and the extensions to query the unifying (with more indexes inside). NOCI's API will be develped following the same structure used for COCI. When all of this will be done, there will be the very last step of the thesis project to complete.</li>
              <li>Read the article about <strong>Ramose</strong> in order to understand how do COCI's APIs work. Ramose is the OC software which was used for all the APIs. It is a kind of tool which works as a proxy between SPARQL and a normal request (http). It is a Rest API. There is a file for the textual interpretation about how this aspect is handled. For next meeting: read the article and the Github repo, where it is possible to find also some examples of its functioning. We need a configuration file for NOCI (noci_v1.hf) which allows the APIs creation. In order to do that, it is necessary to have a triplestore somewhere, with the data within. Create a local blazegraph: a small amount of data is enough. Ramose will allow to test that everything works correctly, when properly configured. Have a look at the article and at github configurations. </li>
              <li>Have a look at all the API defined for COCI: third link (coci_v1.hf)</li>        
          </ul>

    </div>
  </div>
</div>
      
<div class="card">
    <div class="card-header" id="headingTwenty-seven">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-seven" aria-expanded="false" aria-controls="collapseTwenty-seven">
          Week #27 (10-20-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-seven" class="collapse" aria-labelledby="headingTwenty-seven" data-parent="#accordion">
      <div class="card-body">
          <h2>How to use RAMOSE (test)</h2>
          <ol>
              <li>fix mapping.py code, so to obtain your own triplestore.</li>
              <li>make your blazegraph with your example data and query it.  After that, it will be possible to understand where to modify the code for coci in order to develop NOCI version. But, in the meantime:</li>
              <li>get acquainted with ramose. a simplified version of the test was developed for testing purposes and to let new users to get acquainted with ramose. The file name is “m0”, and it allows Ramose execution in shell, and also the creation of a webserver for making requests on browser. M0 queries wikidata, so that it is possible to try it without a local triplestore. This example is in the new updated and fixed version of ramose. Download the most recent one and run python ramose.py -s test/test_m0.hf -w localhost:8080.</li>
              <li>Use of the existing files. Until now, m2 has been the simplest version. However, it needs some extra functions in external python files. The reason why it is so simple is that it has one implemented operation only. Another option is using COCI APIs and its configuration file (the one previously linked) and run RAMOSE with: : python ramose.py -s ../api/coci_v1.hf -w localhost:8080. Note that “-s” specifies the source of the data used by the API, while “-w” specifies the localhost. Localhost 8080 allows you to use ramose in your browser. So, after having copied and pasted the example after the url, you can call the triplestore and manage the request locally. By pressing enter, without any further specification, the default format of the output will be a JSON. When we want to change something, we intervene on the configuration file. The aim is understanding how does ramose work with an existing triplestore: the endpoint has to be specified at the beginning of the configuration file.</li>
              <li>Once we have NOCI rdf data on a local triplestore, we replace the OC endpoint with that of our local blazegraph. The various operations must be modified because for now it only received DOIs in input. The idea is that we need to change the input shape, since it is a regex made for matching DOIs instead of PMIDs. Since in the pre and post processing phases there are some encoding/decoding operations specifically meant for DOIs, in the case of PMIDs we can probably skip both passages. The test will then be done by specifying another endpoint (specific for NOCI, to be tested in localhost 8080).  In order to do that, it is necessary to change the base on COCI file (http://.. etc) with: #base http://localhost:8080, before running ramose. All the execution links are clickable and they go to the specific software. </li>
              <li>In order to realize the main operations for NOCI (all of them except for metadata), simple SPARQL queries will be enough, without any pre and post processing.  The only operation which will require external functions is metadata. We will save metadata of the links by pubmed. The metadata acquisition will be done on the fly, by making requests to PubMed Central. The procedure will be the same performed when we had to get data such as ISSN and so on, from the txt output given by the APIs. citation, reference, citation count, reference count, citations are the base operations which won’t require external functions. Metadata will be the last one to be realised. </li>
          </ol>

    </div>
  </div>
</div>

<div class="card">
    <div class="card-header" id="headingTwenty-eight">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-eight" aria-expanded="false" aria-controls="collapseTwenty-eight">
          Week #28 (10-28-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-eight" class="collapse" aria-labelledby="headingTwenty-eight" data-parent="#accordion">
      <div class="card-body">
          <h2>Last metaid file, structure to correct</h2>
          <p>The part preceding the metaid number must be deleted, since another part of the code is structured so to work with files containing the number only.</p>
          
          <h2>Test the code</h2>
          <p>Create a class which tests all the functionalities of this part of the code. Add a new test file to the directory "tests". For now test all the existing cases. Then, later, it will be possible to simplify the code and eventually leave out the unnecessary parts (the if-else cases which represent non factual possibilities).</p>
          
          <h2>Installation failure</h2>
          <p>The reasons of the failure of the last attempts was due to the attempt to install all the requirements at once. Now it works.</p>
          
          <h2>Download Blazegraph and Java</h2>
          <p>Blazegraph is part of AN now. Two distinct things are mantained separatedly:</p>
          <ol>
              <li>The open source one</li>
              <li>The internal one</li>
          </ol>
          <p>However, Blazegraph still exists. Find at the end of the page (<a href="blazegraph.com">blazegraph.com</a>) and download the file. The file with the extension ".jar" is the triplestore.</p>
          <p>Go to <a href="https://github.com/opencitations/triplestore-index">https://github.com/opencitations/triplestore-index</a> and clone the repository.</p>
          <p>Once you are inside the directory which contains the blazegraph file, call <strong>java -server -Xmx1g -Dbigdata.propertyFile=index.properties -Djetty.port=3001 -Djetty.host=127.0.0.1 -jar blazegraph.jar</strong></p>
          <p>It is advisable to delete the <strong>index.jnl</strong> file which contains triples samples, in order to avoid confusion. It is to be deleted before running blazegraph, since it recreates it automatically each time it is run. </p>
          <p>This is what we talk about when we say that we create a blazegraph. The crucial point is that of uploading the triples of the index we want to create. In order to do that, we go to index --> storer --> update. This file allows the upload of a file in the triplestore. ttl files are uploaded on the triplestore by using this software, which can be run like this, <strong>python -m index.storer.updatetp -s "http://localhost:3001/blazegraph/sparql" -i (path cartella contenente ttl di dati citazionali) -g "https://w3id.org/oc/index/noci/"</strong>, in order to upload triples on a local triplestore.</p>
          <p>This command takes what it finds in the directory and it upload it on -g, which is the graph of the triplestore where it places the triples.</p>
          <p>It may be a good idea to separate the folders where the triples to upload and to remove are stored. 
              Once you run blazegraph, all the following times it just goes back to the last state. At this point, through the blazegraph interface, you can access your uploaded data. Once you have downloaded Java you can go on and run it from shell.</p>

          <h2>Next Meeting</h2>
          <p>11-11-21, h. 11.00</p>


    </div>
  </div>
</div>
      
<div class="card">
    <div class="card-header" id="headingTwenty-nine">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseTwenty-nine" aria-expanded="false" aria-controls="collapseTwenty-nine">
          Week #29 (11-11-21)
        </button>
      </h5>
    </div>
    <div id="collapseTwenty-nine" class="collapse" aria-labelledby="headingTwenty-nine" data-parent="#accordion">
      <div class="card-body">
          
      <h2>Correct the blazegraph update error: ExecutionException: org.openrdf.query.MalformedQueryException </h2>
      
      <h2>Make independent each test for mapping.py</h2>


    </div>
  </div>
</div>
      
<div class="card">
    <div class="card-header" id="headingThirty">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThirty" aria-expanded="false" aria-controls="collapseThirty">
          Week #30 (19-11-21)
        </button>
      </h5>
    </div>
    <div id="collapseThirty" class="collapse" aria-labelledby="headingThirty" data-parent="#accordion">
      <div class="card-body">
          
          <h2>Windows interoperability issues fixed during the meeting:</h2>
      <ul>
          <li>pathlib.Path(abspath(f_n)).as_uri() instead of abspath(f_n), so that the path of the file containing the triples to upload does not create any problems when run in a windows machine </li>
          <li>date_str = datetime.now().strftime('%Y-%m-%dT%H%M%S') is now without double dots as separators, since it caused problems in Windows. update.py code is now interoperable.</li>
          <li>    #URLLIB.PARSE ha una funzione che si chiama quote che fa l'escaping di tutti i carattri unicode sopra l'asci. SERVE SOLO PER IL RDF.
    #NECESSARIO SOLO PER I DOI -- guarda QUOTE di CNC. usa quote per avere l'encoding pulito. Ci sono dei dei DOI brutti che hanno bisogno
    #di un escaping pazzesco. L'altra cosa, per il doi</li>
      </ul>

          <h2>To do:</h2>

      <ol>
          <li>Split the output folders for the csv to rdf function (mapping.py), so that it becomes easier to upload and remove triples.</li>
          <li>Check that - in case a triple should be uploaded and then removed- the extra triple is neither in the upload file nor in the removal one. </li>
          <li>Create a Glob test file.</li>
          <li>Update update.py with the function for removing triples. If not, ask if you should do it. We need a query sparql which does the opposite of loading. There is the possibility that the only way to do that is to use "delete data", specifying the triples to remove after having read the file. The query may be too big to be handled in input. So, it may be necessary to split the query in more queries.</li>
          <li>Revise ramose</li>
      </ol>

    </div>
  </div>
</div>
      
<div class="card">
    <div class="card-header" id="headingThirty-One">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThirty-One" aria-expanded="false" aria-controls="collapseThirty-One">
          Week #31 (3-12-21)
        </button>
      </h5>
    </div>
    <div id="collapseThirty-One" class="collapse" aria-labelledby="headingThirty-One" data-parent="#accordion">
      <div class="card-body">
      <h2>To do:</h2>
      <ol>
      <li>Check and fix nocimapping.py</li>
      <li>Test the missing functionality of glob1.py for noci (the one associating the orcid to the publication id)</li>
      <li>Genera la stringa concatenata con tutte le triple. il problema è che forse non regge con file grandi. è sensato impostare un parametro per cui con un numero x di triple impostabile va a trascrivere. 10.000 come default potrebbe andare bene, però lascia la possibiltà di cambiare il parametro. ripeti fino alla fine delle triple del grafo.</li><li>Testa ultima cosa glob</li>
      <li>RAMOSE: file di configurazione per NOCI deve assomigliare a quello di COCI, ma l'encoding e il decoding non dovrebbero essere necessari perché è una stringa numerica.</li>
      <li>20 gennaio - 21 febbraio scrivo. il dataset potrebbe arrivare in tempo come anche no. quello che conta è avere entro il 21, magari la settimana prima, un indice espanso (titoli dei capitolo e sottosezioni dei capitoli principali. un paragrafo di 3 righe che descrive cosa c'è nel capitolo e nelle sottosezioni). Il cuore del lavoro deve essere suddiviso in tre parti: metodologia (ragiona a livelli generali, non parla di codice ma a livello astratto di come fare le cose e di tutti i problemi che ci possono essere: possibilità di duplicazione, presenza di altri indici, etc; sezione sull'implementazione base in cui mostri cosa hai implementato, quale tecnica per lo sviluppo (tdd), come lanciare le cose, cosa mi aspetto in output e in input; capitolo finale in cui si spiegano alcuni aspett implementativi: come si fanno varie cose, codice. Carino, in aggiunta, una parte di valutazioni, anche sulla falsa riga di quello che ha scritto fabio mariani. è poi una questione di tempi. con la mia macchina, un ragionevole set di dati in input, quanto ci si mette? quanta memoria occupa? (partendo da x megabyte di csv ottengo y megabyte di csv, json, scholix... etc. è una questione di statistica descrittiva. per dare un'idea su quanto tempo ci si mette a fare le run. una cosa che dia un'idea di massima a capire quanto ci si mette a fare questa cosaA). Nell'eventualità arrivi il dataset si potrebbe parlare anche dei risultati. Non è detto. Literature review: concentrata su OC, il topic è software per creare un indice. è una peculiarità, ma di indici ce ne sono. una chiacchiera rispetto ai database che ci sono e che contengono queste informazioni vale la pena. Non sono tanti, ma vale la pena farla. grosso su cosa contengono i database esterni al nostro. Alla fine scriviamo un articolo sugli indici, tutti nel loro complessivo, dopo lo sviluppo di DOCI. l'articolo sugli indici sarà basato sulla tesi. visione diagrammatica di tutti gli indici, per fare questo serviranno dei formati standard. UML: linguaggio standard per descrivere l'organizzazione di codice organizzato ad oggetti, classi metodi e quant'altro. che è esattamente quello che ho fatto io. il diagramma che serve a me è il DIAGRAMMA DELLE CLASSI. descrive tutte le classi della libreria tra cui quelle che ho fatto io e le mette in relazione tra loro. L'esempio del PMID riusa il CSV manager. nell'implementazione di base c'è anche un discorso funzionale su come usare index. il diagramma UML serve per implementazione avanzata. METODOLOGIA NON DEVE ESSERE TECNICA/TECNOLOGICA. </li>
      <li></li>
      </ol>
    </div>
  </div>
</div>
      
<div class="card">
    <div class="card-header" id="headingThirty-Two">
      <h5 class="mb-0">
        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapseThirty-Two" aria-expanded="false" aria-controls="collapseThirty-Two">
          Week #32 (10-12-21)
        </button>
      </h5>
    </div>
    <div id="collapseThirty-Two" class="collapse" aria-labelledby="headingThirty-Two" data-parent="#accordion">
      <div class="card-body">
          <ol>
          <li>See Configuration File for NOCI</li>
          <li>Revise Ramose Paper</li>
          <li></li>
          </ol>


    </div>
  </div>
</div>

      
      
</div>




  </body>
</html>