Skip to content

Latest commit

 

History

History
696 lines (633 loc) · 35.2 KB

README.org

File metadata and controls

696 lines (633 loc) · 35.2 KB

Offshore Leaks as LOD

1 Intro

Day 1: <2016-05-12 Thu>

1.1 Offshore Links

The Offshore Leaks database of The International Consortium of Investigative Journalists (ICIJ) is a wonderful resource that allows you to explore the murky world of high-flyer finance and offshore destinations. It includes two sources:

The database was released on May 10 (please download it from their torrent link) and on May 12 we decided to explore it using Linked Open Data (LOD).

1.2 The Database

The database is available as 5 CSV files with the following number of records:

>wc -l *.csv
   151128 Addresses.csv
   319422 Entities.csv
    23643 Intermediaries.csv
   345646 Officers.csv
  1269797 all_edges.csv

When parsing it, you should be careful about two aspects:

  • CSV quoting
  • Unicode (UTF-8) encoding

The files have the following columns, which you can see with

head -1 *.csv

As we analyze the data, we’ll add more in-depth notes.

1.3 Addresses.csv

colnotesexample
addresstextual address#02-01; 14 MOHAMED SULTAN ROAD; SINGAPORE 238963
icij_idICIJ guid91BB6C910709CE3331D31A89DD97EDAD
valid_untilfixed statementThe Panama Papers data is current through 2015
country_codesISO alpha3 codeSGP
countriescountry nameSingapore
node_idunique across all files14000015
sourceIDsource name“Panama Papers” or “Offshorshuckaks”

1.4 Entities.csv

Legal bodies such as companies, foundations, trusts…

colnotesexample
namecompany nameCHEM D-T Corp.
original_nameofficial nameCHEM D-T Corp. EX-CHEM DT Corp.
former_nameformer name (often empty)CHEM DT Corp.
jurisdictionwhere registered (not alpha3 code)CAYMN
jurisdiction_descriptionwhere registered (country, US state, …)Cayman
company_typeformal type“Standard International Company”, “BVI Trust”…
addressformal addressGO SHINE MANAGEMENT CO.; LTD. ROOM B; 5F.; NO. 92; SEC. 1NANJING E. RD.; JHONGSHAN DISTRICT; TAIPEI CITY 104; TAIPEI TAIWAN
internal_id???1000094
incorporation_datewhen created30-MAR-2004
inactivation_datewhen deactivated06-NOV-2009
struck_off_datewhen removed from register15-FEB-2010
dorm_datewhen became dormant
status29% Active, 29% Defaulted, 7% Dissolved…Defaulted
service_providerlaw firm serving the entity“Mossack Fonseca”, “Portcullis Trustnet” or “Commonwealth Trust Limited”
ibcRUC???16469
country_codeswhere active (alpha3), can be multipleAUS;BLZ
countrieswhere active (countries), can be multipleAustralia;Belize
notemost often empty
valid_untilfixed statementThe Panama Papers data is current through 2015
node_idunique across all files10000018
sourceIDsource name“Panama Papers” or “Offshore Leaks”

1.5 Intermediaries.csv

Agents that help beneficiaries setup offshore companies

colnotesexample
namenameSECRETARIAL SERVICES LIMITED
internal_id???1009
addressaddressSECRETARIAL SERVICES LIMITED P.O. BOX 37 ST. ANNE’S HOUSE; VICTORIA STREET ALDERNEY; CHANNEL ISLANDS
valid_untilfixed statementThe Panama Papers data is current through 2015
country_codeswhere active (alpha3), can be multipleGGY;GBR
countrieswhere active (countries), can be multipleGuernsey;United Kingdom
status46% blank, 30% ACTIVE, 20% SUSPENDED…SUSPENDED
node_idunique across all files11000034
sourceIDsource name“Panama Papers” or “Offshore Leaks”

1.6 Officers.csv

Agents (people, groups of people, companies) that serve as company officers and beneficiaries, both formal and real

colnotesexample
namenameWu Chi-Ping and Wu Chou Tsan-Ting
icij_idICIJ guid1B92FDDD451DA8DCA9CD36B0AF797411
valid_untilfixed statementThe Panama Papers data is current through 2015
country_codeswhere active (alpha3), can be multipleTWN
countrieswhere active (countries), can be multipleTaiwan, Province of China
node_idunique across all files12000009
sourceIDsource name“Panama Papers” or “Offshore Leaks”

1.7 all_edges.csv

Relations between records. Since node_id is unique across files, there’s no need to mention the entity types.

colnotes
node_1source node
rel_typerelation type
node_2destination node

1.8 rel_type

The relation types is one of the most interesting key fields. The distribution of values is as follows:

countrel_type
319121intermediary of
316472registered address
277380shareholder of
118589Director of
105408Shareholder of
46761similar name and address as
36318Records & Registers of
15151beneficiary of
14351Secretary of
4031Beneficiary of
3146same name and registration date as
1847Beneficial Owner of
1418Trustee of Trust of
1234Trust Settlor of
1229Authorised Person / Signatory of
1198Protector of
1130Nominee Shareholder of
960same address as
622related entity
583Assistant Secretary of
409Alternate Director of
320Co-Trustee of Trust of
281Officer of
272Resident Director of
207Auditor of
173Correspondent Addr. of
123Bank Signatory of
120General Accountant of
101Nominated Person of
89Legal Advisor of
74Reserve Director of
65Investment Advisor of
64Nominee Director of
48Register of Director of
41Register of Shareholder of
41Joint Settlor of
40President of
32Auth. Representative of
32Appointor of
28Owner, director and shareholder of
25Beneficial owner of
24Nominee Trust Settlor of
20Power of Attorney of
18Unit Trust Register of
18Treasurer of
16Owner of
14Tax Advisor of
14Custodian of
13Successor Protector of
11Stockbroker of
9Power of attorney of
9Personal Directorship of
8Safekeeping of
8Nominee Protector of
7Vice President of
7Partner of
6Director / Shareholder of
6Beneficiary, shareholder and director of
5Nominee Secretary of
4Sole shareholder of
4Nominee Beneficial Owner of
4Director / Beneficial Owner of
4Chairman of
3Principal beneficiary of
3Member of Foundation Council of
3Connected of
2Sole signatory of
2Signatory of
2Nominee Beneficiary of
2Director / Shareholder / Beneficial Owner of
2Director (Rami Makhlouf) of
2Board Representative of
1Sole signatory / Beneficial owner of
1Shareholder (through Julex Foundation) of
1President and director of
1President - Director of
1Power of Attorney / Shareholder of
1Nominee Name of
1Nominee Investment Advisor of
1Member / Shareholder of
1Grantee of a mortgage of
1First beneficiary of
1Director and shareholder of
1Authorized signatory of

2 RDF Conversion

2.1 Date Conversion

The dates in Entities.csv have the form “06-NOV-2009”, but we want to convert them to proper xsd:date, eg “2009-11-06”. We do that with a script ./dates.pl by calling it like

perl dates.pl Entities.csv > Entities-dated.csv

We can find the distribution of years like this:

perl -ne 'print "$1\n" if m{\b[0-9]{2}-[A-Z]{3}-([0-9]{4})\b}' Entities.csv|sort|uniq -c

The most active years were 1999-2009. (There are also 9 invalid dates 1-APR-1001.)

2.2 Leaks Ontology

First we define all prefixes we use in a single file ./prefixes.ttl, so we won’t have to repeat them many times. In addition to standard prefixes (that you can get from http://prefix.cc/dbr,dbo,dct,rdf,rdfs,skos,owl,xsd.ttl), we also define:

@prefix leak:  <http://data.ontotext.com/resource/leak/>.   # ontology
@prefix leaks: <http://data.ontotext.com/resource/leaks/> . # data

We made an ontology ./leak-ontology.ttl. It has these parts:

  • The prefixes described above
  • A header that describes the ontology itself:
leak: a owl:Ontology;
  rdfs:label "Offshore Leaks ontology";
  rdfs:comment "Describes the ICIJ Offshore Leaks database released on 2016-05-10";
  dct:subject dbr:Offshore_company, dbr:Money_laundering, dbr:Tax_evasion;
  dct:created "2016-05-12"^^xsd:date;
  rdfs:seeAlso
    <https://offshoreleaks.icij.org/>,
    <http://data.ontotext.com/resource/leaks>,
    <https://github.com/Ontotext-AD/leaks>;
  dct:source <https://offshoreleaks.icij.org/pages/database>;
  dct:creator <http://www.ontotext.com>;
  void:sparqlEndpoint <http://data.ontotext.com/sparql>.
  • “Raw” classes and data properties derived directly from the CSVs, eg:
leak:Node a owl:Class;
  rdfs:isDefinedBy leak:;
  rdfs:label "Node";
  rdfs:comment "Any kind of node".

leak:Address a owl:Class;
  rdfs:subClassOf leak:Node;
  rdfs:isDefinedBy leak:;
  rdfs:label "Address";
  rdfs:comment "Address of an entity, intermediary or officer".

leak:address a owl:DatatypeProperty;
  rdfs:isDefinedBy leak:;
  rdfs:label "address";
  rdfs:domain leak:Node;
  rdfs:comment "Textual address".
  • Explicit linking and structuring object properties, eg
leak:hasCountry a owl:ObjectProperty;
  rdfs:isDefinedBy leak:;
  rdfs:label "hasCountry";
  rdfs:domain leak:Node;
  rdfs:range leak:Country;
  rdfs:comment "Country (Countries) of Address, Entity, Intermediary or Officer";
  skos:scopeNote "Obtained by splitting country_codes on ';' and linking".

leak:hasJurisdiction a owl:ObjectProperty;
  rdfs:isDefinedBy leak:;
  rdfs:label "hasJurisdiction";
  rdfs:domain leak:Entity;
  rdfs:range leak:OffshoreJurisdiction;
  rdfs:comment "OffshoreJurisdiction of an Entity".
  • interpretation object properties, not explicitly present in the CSV files They are meant to layer further structure based on implicit semantics and inferencing (property generalization)

We make it by concatenating these parts:

cat prefixes.ttl leak.ttl leak-inferred.ttl > leak-ontology.ttl

2.3 tarql

We use tarql (SPARQL processor for Tables) to convert from CSV to Turtle.

2.3.1 tarql Queries

tarql is driven by CONSTRUCT queries. They are fairly straight-forward: the columns are mapped to raw data properties of the same name, while the URL is made of a descriptive prefix (eg “address-“) and the node_id:

prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix leak:  <http://data.ontotext.com/resource/leaks#> # ontology
prefix leaks: <http://data.ontotext.com/resource/leaks/> # data

construct {
  ?node a leak:Address;
    leak:address        ?address;
    leak:icij_id        ?icij_id;
    leak:valid_until    ?valid_until;
    leak:country_codes  ?country_codes;
    leak:countries      ?countries;
    leak:node_id        ?node_id;
    leak:sourceID       ?sourceID
  }
from <file:../Addresses.csv#encoding=utf-8>
where {
  bind(uri(concat(str(leaks:),"address-",?node_id)) as ?node)
}

We got ./addresses.rq, ./edges.rq, ./entities.rq, ./intermediaries.rq, ./officers.rq. (These are the only files that include prefixes, since tarql can’t use an extra prefix file.)

2.3.2 tarql Results

The ./addresses.rq query produces Turtle RDF data like this:

leaks:address-14000003
        rdf:type            leak:Address ;
        leak:address        "\"Cantonia\" South Road St Georges Hill Weybridge, Surrey" ;
        leak:icij_id        "240EE44DFB70AF775E6CD02AF8CB889B" ;
        leak:valid_until    "The Panama Papers  data is current through 2015" ;
        leak:country_codes  "GBR" ;
        leak:countries      "United Kingdom" ;
        leak:node_id        "14000003" ;
        leak:sourceID       "Panama Papers" .

The other files are similar. Only edges are a bit different: they use UUIDs, because

  • the same pair <node_1, node_2> may be connected by several edges,
  • yet edges don’t have a unique ID themselves, and tarql’s special variable ?ROWNUM doesn’t work:
leaks:edge-31203a84-a56e-4e2a-8bc6-0921a399b691
        rdf:type       leak:Edge ;
        leak:node_1    "11000001" ;
        leak:rel_type  "intermediary of" ;
        leak:node_2    "10208879" .

Unicode is handled properly by tarql, eg:

  • Côte d’Ivoire, Curaçao
  • ELÍAS BAYTER MONTENEGRO, MARITZA GARCIA ALCÁNTARA
  • etc

2.3.3 Running tarql

Since the queries designate the input files (assumed to be in a directory one level up), we run tarql simply like this:

tarql addresses.rq      > addresses.ttl
tarql edges.rq          > edges.ttl
tarql entities.rq       > entities.ttl
tarql intermediaries.rq > intermediaries.ttl
tarql officers.rq       > officers.ttl

This easily makes 760Mb of RDF data, so you better have a fast disk (SSD). Voila!

tarql skips some rows (unexplained), but the loss is very small. Eg 319150 entities.ttl vs 319421 Entities.csv, or a loss of 0.08%

2.4 Country Codes

Since the data uses ISO alpha3 country codes, we have to use that to correlate to DBpedia.

  • Wikipedia has such a list in the form of a table
  • Geonames has another such list
  • We extracted them to a Google sheet and did a quick check that all codes match (Geonames has 3 more)

The Google sheet almost does what we want, but the first column is a country display name, and not the actual page title

  • Aland Islands !Åland Islands: the first is used for sorting, and the second is the page title
  • Virgin Islands (British) is the display name, but British Virgin Islands is the actual page title

So we wrote a script ./countries-wiki.pl that extracts country links from Wikipedia source (./countries-wiki-source.txt). The result ./countries-wiki.txt looks like this:

ABW	http://dbpedia.org/resource/Aruba
AFG	http://dbpedia.org/resource/Afghanistan
...
XXX	http://dbpedia.org/resource/Undefined

The data uses code “XXX” Undefined, so we’ve added a fake line for it (dbr:Undefined is a disambiguation page, but is good enough to use as a signal value).

It turns out that Addresses.csv has the largest number of country codes (211). We cross-checked, and all codes are covered by Wikipedia (250) and Geonames (252).

We got ./countries-dbpedia.ttl (211) with statements like this:

leak:country-ABW a leak:Country; leak:code "ABW"; leak:name "Aruba";  owl:sameAs dbr:Aruba.
leak:country-AGO a leak:Country; leak:code "AGO"; leak:name "Angola"; owl:sameAs dbr:Angola.
...

We also split ./countries-noleak.ttl (49) with countries that don’t appear as leak:Country (but may appear as leak:OffshoreJurisdiction). It only has coreference to DBpedia that may be useful in the future, eg:

leak:country-AFG owl:sameAs dbr:Afghanistan.
leak:country-ALA owl:sameAs dbr:Åland_Islands.

2.5 Offshore Jurisdictions

./jurisdictions.ttl includes data about the Offshore Jurisdictions

  • The bigest destination in the Panama Leaks is BVI, but many remain XXX “Undetermined”:
leaks:offshore-BVI    a leak:OffshoreJurisdiction; leak:code "BVI";   leak:name "British Virgin Islands";   skos:exactMatch dbr:British_Virgin_Islands . # 151588
leaks:offshore-XXX    a leak:OffshoreJurisdiction; leak:code "XXX";   leak:name "Undetermined";             skos:exactMatch dbr:Undetermined           . # 55645
  • Many of them re tiny islands and other exotic locations:
leaks:offshore-NIUE   a leak:OffshoreJurisdiction; leak:code "NIUE";  leak:name "Niue";                     skos:exactMatch dbr:Niue                   . # 9611
leaks:offshore-LABUA  a leak:OffshoreJurisdiction; leak:code "LABUA"; leak:name "Labuan";                   skos:exactMatch dbr:Labuan                 . # 421
  • Some are not countries but parts thereof (eg a US state and a UAE emirate):
leaks:offshore-WYO    a leak:OffshoreJurisdiction; leak:code "WYO";   leak:name "Wyoming";                  skos:exactMatch dbr:Wyoming                . # 37
leaks:offshore-RAK    a leak:OffshoreJurisdiction; leak:code "RAK";   leak:name "Ras Al Khaimah";           skos:exactMatch dbr:Ras_al-Khaimah         . # 2

Notably, Luxembourg is missing from the list (see Luxembourg Leaks)

2.6 Data Model

To enrich and use the RDF data efficiently, it’s important to understand how it is laid out, i.e. the data model (or as is currently called, RDF Shape).

Ontotext has developed a tool rdfpuml that creates precise diagrams from actual Turtle. See “Making True RDF Diagrams With rdfpuml”: presentation or continuous HTML.

We made a sample ./model.ttl that describes a few entities, Edges between them, and the associated Countries and Offshore jurisdictions. We generated the following diagram directly from it:

./model.png

We’ll keep enriching the diagram as we add more inferences. Stay tuned.

2.7 Day1 Recap

And looked Onto upon the land, and saw that it was good:

  • CSVs parsed good, the devilish comma betwixt data divined right
  • UTFs looketh right
  • tarql worketh fastly and loseth nearly nought data (0.08%)
  • 760 million ducats of RDF spilt forth
  • Prefixes unified and registered as http://prefix.cc/leak
  • Ontology described by the VOID, and shalt be registered in the LOV (see LOV announcement)
  • Data model lucid and clear
  • Countries and Offshores hast connexion to DBpedia

And there was evening (actually well past midnight), and there was morning–the first day.

3 Inferencing

Day 2: <2016-05-13 Fri> What shall we do today? How about inferring some new data from the basic RDF.

3.1 Linking Countries and Offshore Jurisdictions

In the original data, countries and jurisdictions are represented with codes (eg “AUS;BLZ” for 2 countries and “CAYMN” for 1 offshore destination). It’s easier to query the data if these are made into explicit links, especially if one wants to explore hierarchical links (eg Entities active in Eastern Europe countries).

So we created UPDATE queries ./countries-link.ru, ./jurisdictions-link.ru to make links hasCountry and hasJurisdiction respectively. The first query is more complex since there can be several codes in country_codes (separated with ;):

insert {
  graph leaks:countries-link {
    ?node leak:hasCountry ?country
  }
} where {
  ?node leak:country_codes ?codes.
  ?country a leak:Country; leak:countryCode ?code.
  filter(contains(?codes,?code))
}

3.2 Linking Entities

The INSERT query ./edges-link.ru makes explicit connections hasSource and hasTarget for every Edge:

insert {
    graph leaks:edges-link {
      ?edge leak:hasSource ?src; leak:hasTarget ?trg
    }
} where {
  ?edge leak:node_1 ?src_id;
        leak:node_2 ?trg_id.
  ?src leak:node_id ?src_id.
  ?trg leak:node_id ?trg_id.
}

We also made another INSERT query ./edges-specific.ru that converts the rel_type literals listed in sec *rel_type into similarly-named relations:

insert {
  graph leaks:specific-relations {
    ?src ?rel ?trg
  }
} where {
  values (?rel_type ?rel) {
    ("Alternate Director of"  leak:isAlternateDirectorOf)
    ("Appointor of"           leak:isAppointorOf)
    ("Assistant Secretary of" leak:isAssistantSecretaryOf)
    ...
  }
  ?edge leak:hasSource ?src;
        leak:hasTarget ?trg;
        leak:rel_type  ?rel_type .
}

3.3 Relation Hierarchy

The raw rel_types mentioned in the previous section are hard to understand or query:

  • There are a lot of them (84)
  • Some are very similar, eg “Shareholder (through Julex Foundation) of” and “Shareholder of”
  • Some combine several roles in one raw relation, eg “Director / Shareholder / Beneficial Owner of”

We have therefore created a hierarchy of properties in ./leak-ontology.ttl that group similar relations into groups, allowing easier querying.

Raw relations are in camelCase and inferred (“cooked”) relations are in UPPERCASE. The hierarchy goes something like this, and is subject to change. ... indicates there are more raw relations that are skipped for brevity:

hasRegisteredAddress
isIntermediaryOf
RELATED
  relatedEntity
  similarNameAndAddressAs ...
  SAME
    sameNameAndRegistrationDateAs
RELATED_AGENT
  OWNER
    isBeneficialOwnerOf
    isNomineeBeneficialOwnerOf
    isBeneficiaryShareholderAndDirectorOf (1) ...
    REAL_OWNER (3)
  AGENT_OF
    OFFICER
      isOfficerOf
      EXECUTIVE
        isPresidentOf
        isVicePresidentOf
        isPresidentAndDirectorOf
        isTrusteeOfTrustOf
        isCo-TrusteeOfTrustOf ...
    SERVICE_PROVIDER
      isAppointorOf
      isAuditorOf
      isSecretaryOf
      isGranteeOfAMortgageOf
      AUTHORIZED_REPRESENTATIVE
        isAuthRepresentativeOf
        isAuthorisedPersonSignatoryOf
        isBankSignatoryOf ...
    DIRECTOR
      isDirectorOf
      isBeneficiaryShareholderAndDirectorOf (1)
      isDirectorAndShareholderOf
      isMemberOfFoundationCouncilOf
      isNomineeDirectorOf (2) ...
    NOMINEE
      isNominatedPersonOf
      isNomineeDirectorOf
      isNomineeBeneficialOwnerOf (2) ...

Notes:

  1. Combined raw relations (eg isDirectorShareholderBeneficialOwnerOf) appear in several branches, thus contributing to several cooked relations (eg in this case DIRECTOR, OWNER)
  2. NOMINEE is a sort of flag, eg a DIRECTOR can be a real director, or NOMINEE director
  3. Although we distinguish REAL_OWNER as a sub-prop of OWNER, we don’t yet have any instances of it. Indeed the essence of investigative work is to find out the real owner.

3.4 Geonames Place Hierarchy

We want to correlate countries to Geonames (in addition to DBpedia), in order to: use the gn:parentFeature hierarchy to group countries by region (eg Eastern Europe) and continent (eg Europe).

We use the Ontotext endpoint http://ff-news.ontotext.com/sparql that has DBpedia and Geonames integrated with owl:sameAs statements between these datasets. The following query returns the places (gn:Feature) above country (gn:A.PCLI):

PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX onto: <http://www.ontotext.com/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select * from onto:disable-sameAs {
    ?x a gn:Feature; rdfs:label ?name; gn:featureCode ?feat.
    filter not exists {?x gn:featureCode gn:A.PCLI}
    filter exists {?y gn:featureCode gn:A.PCLI; gn:parentFeature ?x}
} group by ?x

We filter gn:A.PCLI itself, because there are some mistakes (eg dbr:Barbados is parent of itself).

The result is as follows:

dbr:Africa
dbr:Arabian_Peninsula
dbr:Asia
dbr:Australia_and_New_Zealand
dbr:Caribbean
dbr:Central_Asia
dbr:Earth
dbr:Eastern_Africa
dbr:Eastern_Asia
dbr:Eastern_Europe
dbr:Europe
dbr:European_Free_Trade_Association
dbr:La_Habana_Province
dbr:Maghreb
dbr:Melanesia
dbr:Micronesia
dbr:Middle_Africa
dbr:North_America
dbr:Northern_Africa
dbr:Northern_Europe
dbr:Oceania
dbr:Polynesia
dbr:South_Eastern_Asia
dbr:Southern_Africa
dbr:Southern_Asia
dbr:Southern_Europe
dbr:W_National_Park
dbr:Western_Africa
dbr:Western_Europe
  • Mahgreb is a region of Northwest Africa that includes: Algeria, Morocco, Tunisia
  • W_National_Park is a major trans-national park in West Africa that includes areas of: Niger, Benin, Burkina Faso
  • La_Habana_Province is a mistake in Geonames: the small village America in that province is made parent of South_America and North_America: we’ve replaced it with dbr:Americas

3.5 Geonames Data

We use the following query ./geonames-top-level.rq to extract places at the level of country or above, and the following attributes (geonames-top-level.ttl):

  • URL in the dbr: namespace, eg dbr:Europe
  • gn:name: official name
  • dbo:abstract: description
  • gn:featureCode: place type(s), eg A.PCLI (independent country), L.CONT (continent), L.RGN (region)
  • gn:parentFeature: ancestor places
  • wgs:lat, wgs:long: geographic coordinates
TODO geonames-top-level.rq

3.6 Linking to Source

We make links back to the source (https://offshoreleaks.icij.org) in order to give credit where credit is due, and to allow easy inspection of the ICIJ interactive graphs (./seeAlso.ru):

insert {
  graph leaks:seeAlso {
    ?node rdfs:seeAlso ?icij_org
  }
} where {
  ?node leak:node_id ?node_id
  bind(iri(concat("https://offshoreleaks.icij.org/nodes/",?node_id)) as ?icij_org)
}

3.7 Data Loading Stats

./leaks-load.xlsx includes some stats on loading the data and inferencing. Here is a screen shot, but it’s not updated:

./leaks-load.png

3.8 Day2 Recap

Brushed Onto the sweat from its weary brow, and looketh at the fruit of its day’s work:

  • Relations between Nodes made
  • Relations of entities grouped in an interpretive hierarchy
  • Links back to the source (https://offshoreleaks.icij.org) added
  • Hierarchy above countries obtained from Geonames

A Leaks dataset is borne. Go forth and queriest! Whence did money came from, and whither did it flow?

And there was evening, and there was morning–the second day.

4 TODO Further Ideas

This is a parking place for stuff to do in the future:

  • Network analysis
  • leak:Officer including the word “BEARER” (and variations) should be marked specially as “bearer shares”. These are essentially anonymous shareholders or beneficiaries, often used for money laundering. Most countries have banned registration of Entities with bearer shares
  • Addresses: there are literals Entity.address and Intermediary.address, and also link hasRegisteredAddress to the separate class Address, which has literal Address.address. What nodes have hasRegisteredAddress? What is the relation between these
  • Look for Entities in the FIBO LEI database?

You can also suggest what we should do: hither on Github or thither on Gitter