-
Notifications
You must be signed in to change notification settings - Fork 17
Dictionaries: creation from Wikipedia pages
There are at least these ways of creating dictionaries (see https://github.com/petermr/tigr2ess/blob/master/dictionaries/TUTORIAL.md)
- from a list of terms (words and phrases). AMI will then look these up in Wikidata
- by using SPARQL on Wikidata (see WikiFactMine in tutorial)
and from Wikipedia
- from Wikipedia Templates. These are at the bottom of the page and often have a good list of concepts.
- from Wikipedia Categories. These are pages with lists of pages with a common theme. The quality varies widely.
- from Wikipedia Lists. Some pages are titled "List of X" and generally very useful. Other pages contain in-page lists which are variable.
- from tables in Wikipedia Pages. These are like in-page lists, but may have additional links.
- from Wikipedia Pages. This scrapes all links. There is a lot of noise.
We will use https://en.wikipedia.org/wiki/Medical_procedure . Have a close look before running. This is mainly lists but will need editing.
AMI creates a single dictionary and looks up Wikidata QIDs. This will ALWAYS need editing to remove noise (omit rows) and also remove ambiguities.
ami-dictionary
has several top-level subcommands:
create,
display,
help,
search,
translate,
we use create
. There are several mandatory and optional flags. It can be put on 1 lines but we insert \
to make it clearer here.
The complete command on one line
ami-dictionary create --informat wikipage --input https://en.wikipedia.org/wiki/Medical_procedure --dictionary medproc --directory /Users/pm286/dictionaries --outformats xml
or more readably with line breaks and comments
ami-dictionary \ # command
create \ # subcommand
--informat wikipage \ # read from WikiPedia page
--input https://en.wikipedia.org/wiki/Medical_procedure \ # full URL of page (note `_` and NOT space
--dictionary medproc \ # name of dictionary (should be unique, more later)
--directory /users/pm286/dictionaries \. # where it goes. Again, more later
--outformats xml # output format
The command
$ ami-dictionary create --informat wikipage --input https://en.wikipedia.org/wiki/Medical_procedure --dictionary medproc --directory /Users/pm286/dictionaries --outformats xml
The input parameters are echoed
Generic values (AMIDictionaryTool)
================================
-v to see generic values
oldstyle true
Specific values (AMIDictionaryTool)
================================
baseUrl https://en.wikipedia.org/wiki
booleanQuery false
descriptions null
dataCols null
dictionary [[medproc]]
dictionaryTop /Users/pm286/dictionaries
hrefCols null
inputs null
input https://en.wikipedia.org/wiki/Medical_procedure
informat wikipage
dictInformat null
linkCol null
log4j null
nameCol null
operation create
outformats [xml]
query 10
search null
searchfile null
splitCol ,
templatea null
termCol null
terms null
termfile null
title null
urlref null
wikiLinks [wikipedia, wikidata]
wptype null
ignore
0 [main] DEBUG org.contentmine.ami.tools.AMIDictionaryTool - extracting hyperlinks
0 [main] DEBUG org.contentmine.ami.tools.AMIDictionaryTool - extracting hyperlinks
diagnostics for retrieving each wikidata item. Takes a second or sso for each character, i.e. a few minutes
N 171; T 171
.!........!!.........!..........!..!..........!....!!!....!!.!............!.!...!.!....!.!...!..!.!.!!!!.!.!.!.....!........!!!.....!...!!....!...!!..!............!.......++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>> medproc
>> dict medproc
writing dictionary to /Users/pm286/dictionaries/medproc.xml
I think this is a bug
Missing wikipedia: :
Bloodtest; Cancervaccine; Cardiopulmonaryresuscitation; Celltherapy; Contrast-enhancedultrasound; Diffusion-weightedimaging; Electrical impedancetomography; Electroconvulsivetherapy;
Endoluminal capsulemonitoring; Enzyme replacementtherapy; Esophageal motilitystudy; Extracorporeal membraneoxygenation; Fluid replacementtherapy; Functional magneticresonance imaging; Generalsurgery; Genetherapy;
Gynecologicultrasonography; Handsurgery; Heattherapy; Hormone replacementtherapy; Hyperbaric oxygentherapy; Immunosuppressivetherapy; Insulin potentiationtherapy; Knee cartilagereplacement therapy;
Localanesthesia; Medicalimaging; Monoclonal antibodytherapy; Negative Pressure WoundTherapy; Nicotine replacementtherapy; Obstetricultrasonography; Opiate replacementtherapy; Oxygentherapy;
Phagetherapy; Physical therapy/Physiotherapy; Positron emissiontomography; Protontherapy; Stooltest; Transcutaneouselectrical nerve stimulation; Unsealed sourceradiotherapy; Visiontherapy;
diagnosing;
Missing wikidata: :
; Cancervaccine; Cardiopulmonaryresuscitation; Contrast-enhancedultrasound; Diffusion-weightedimaging; Electrical impedancetomography; Electroconvulsivetherapy; Endoluminal capsulemonitoring;
Enzyme replacementtherapy; Esophageal motilitystudy; Extracorporeal membraneoxygenation; Fluid replacementtherapy; Functional magneticresonance imaging; Gynecologicultrasonography; Heattherapy; Hormone replacementtherapy;
Immunosuppressivetherapy; Insulin potentiationtherapy; Knee cartilagereplacement therapy; Localanesthesia; Medicalimaging; Monoclonal antibodytherapy; Nicotine replacementtherapy; Obstetricultrasonography;
Opiate replacementtherapy; Phagetherapy; Phototerapy; Stooltest; Transcutaneouselectrical nerve stimulation; Unsealed sourceradiotherapy; Visiontherapy; pm286macbook:~
The output in /Users/pm286/dictionaries/medproc.xml
is
<?xml version="1.0" encoding="UTF-8"?>
<dictionary title="https://en.wikipedia.org/wiki/Medical_procedure">
<entry term="Ablation" url="/wiki/Ablation" wikidata="Q1806547" name="laser ablation" description="process that removes material from an object by heating it with a laser" id="CM.medproc.0" wikipedia="Ablation"/>
<entry term="Acupuncture" url="/wiki/Acupuncture" wikidata="Q121713" name="acupuncture" description="an alternative medicine practice involving insertion of fine needles" id="CM.medproc.1" wikipedia="Acupuncture"/>
<entry term="Amputation" url="/wiki/Amputation" wikidata="Q477415" name="amputation" description="removal of a body extremity by trauma, prolonged constriction, or surgery" id="CM.medproc.2" wikipedia="Amputation"/>
</dictionary>
I now hand-edit to remove noise. These are removed because they aren't procedures, or because they contain punctuation.
<entry term="blood pressure" url="/wiki/Blood_pressure" wikidata="Q82642" name="blood pressure" description="pressure exerted by circulating blood upon the walls of blood vessels" id="CM.medproc.11" wikipedia="Blood_pressure"/>
<entry term="Physical therapy/Physiotherapy" url="/wiki/Physical_therapy" wikidata="Q186005" name="physiotherapy" description="a health profession that aims to address the illnesses or injuries that limit a person's abilities to function in everyday lives" id="CM.medproc.125" wikipedia="Physical_therapy"/>
<entry term="Screening (medicine)" url="/wiki/Screening_(medicine)" wikidata="Q68422575" name="[Screening medicine]" description="scientific article published on 01 January 1988" id="CM.medproc.140" wikipedia="Screening_(medicine)"/>
<entry term="Targeted therapy" url="/wiki/Targeted_therapy" wikidata="Q492646" name="targeted therapy" description="drug treatment which interacts with or blocks synthesis of specific cellular components, to impede the biochemical dysfunction involved in progression of the disease" id="CM.medproc.150" wikipedia="Targeted_therapy"/>
<entry term="Vital signs" url="/wiki/Vital_signs" wikidata="Q1067560" name="vital signs" description="group of the 4-6 important medical signs that indicate the status of the body’s vital functions" id="CM.medproc.166" wikipedia="Vital_signs"/>
Many entries were wrongly looked up in Wikidata as "scientific article" These fields but not the entry were removed.
We should use the "WikidataItem" on the page. This means we have a number of false hits.
with is the SAME as the filename (without ".xml"). If this isn't there it fails to search at all - silently.