-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rule engine for data transformation #73
Comments
@sabrycito @jzwerg as discussed we can start by documenting some examples of taxonomy mappings with different degrees of complexity and attach them to this issue. |
@caldeirav well noted - @sabrycito will share available taxonomies mappings today, together with the documentation explaining our methodology and findings. These taxonomies will include:
We are currently working on adding 1 global framework (TCFD), 1 global standard (ISSB), and 1 regulatory standard (EET) - which we will share as well in the following weeks. We also encourage the broader OS-C community to contribute to this work stream with additional global, regulatory, exchange, or industry standards. |
As discussed during yesterday's meeting, please find the presentation attached. |
Upon initial analysis we propose two possible approaches for rule-based integration to be demonstrated in a POC:
We have decided with @sabrycito and @jzwerg to prototype the first solution with DBT based on data inputs / outputs provided by U-Reg. |
@sabrycito / @jzwerg Could you provide some use-cases for us to prototype this please? Ideally let's keep this information as a link shared in this issue. |
- Case 0-0: As a user, I want to understand the equivalence between "Disclosure A: 'report the percentage of total employees covered by collective bargaining agreements'" and "Disclosure B: 'Percentage of active workforce covered under collective bargaining agreements'", recognizing that they represent identical semantics expressed with different linguistic constructions. - Case 1-2 and (2-1):
As a user, I need to compare "Disclosure A: '(1) Total energy consumed, (2) percentage heavy fuel oil, (3) percentage renewable'" with "Disclosure B: 'Total energy consumption'", noting that A provides all necessary data for B, but B only partially satisfies A without the heavy fuel oil and renewable components.
As a user, I want to understand how to compute the data in "Disclosure B: 'Total number of employees, percentage contractors'" from "Disclosure A: 'Total workforce by gender, employment type (for example, full- or part-time), age group and geographical region'", and recognize that while B can be derived from A, A requires additional information not present in B. - Case 1-3 (and 3-1): As a user, I am interested in understanding how "Disclosure A: 'Risks and opportunities posed by climate change that have the potential to generate substantive changes in operations, revenue, or expenditure, including a description of the risk or opportunity and its classification as either physical, regulatory, or other'" compares to "Disclosure B: 'TRANSITION RISK: Has the customer faced or expected to face any impact from policy and regulation risk?'", and how the response to B does not provide insight for A, yet A's information can be used for B. Similarly, how "Disclosure A: 'Description of efforts in solar energy system project development to address community and ecological impacts'" relates to "Disclosure B: 'Focus areas of contribution (e.g. education, environmental concerns, labour needs, health, culture, sport)'", noting that B does not aid in fulfilling A. - Case 2-2:
As a user, I want to compare "Disclosure A: 'report the total number of employees, and a breakdown of this total by gender and by region'" with "Disclosure B: 'Number of employees, number of truck drivers'", acknowledging that A to B lacks information on truck drivers, and B to A lacks the breakdown by gender and region.
As a user, I aim to understand how "Disclosure A: 'Total fuel consumption within the organization from renewable sources, in joules or multiples, and including fuel types used.'" compares to "Disclosure B: 'Fleet fuel consumed, percentage renewable'", recognizing the need for computation to reconcile the differences in the data provided by each disclosure. - Case 2-3 (and 3-2): As a user, I need to understand the relationship between "Disclosure A: 'describe the process for designing its remuneration policies and for determining remuneration, including whether remuneration consultants are involved in determining remuneration and, if so, whether they are independent of the organization, its highest governance body and senior executives'" and "Disclosure B: 'Information on: the policies relating to compensation and dismissal, recruitment and promotion, working hours, rest periods, equal opportunity, diversity, anti-discrimination, and other benefits and welfare.'", acknowledging that A provides some but not all the information in B, and vice versa. - Case 3-3: As a user, I am interested in comparing "Disclosure A: 'Discussion of engagement processes and due diligence practices with respect to human rights, indigenous rights, and operation in areas of conflict'" with "Disclosure B: 'describe its specific policy commitment to respect human rights, including the internationally recognized human rights that the commitment covers'", recognizing that both involve human rights but focus on different aspects, leading to a possible but not guaranteed overlap. |
Hi Vincent, In addition to the methodology above, you can find below our implementation and observations. I have attached the SASB, GRI, and EDCI examples for reference. Implementation U-Reg endeavoured to group these disclosures into clusters with high equivalence probability. Utilizing an NLP-based algorithm, we employed a domain-specific BERT model, an innovation by Google, for text mining in the sustainable investing sector. This resulted in our ESG disclosures being classified into 26 distinct labels. (see "4. Exploration" below) U-Reg adopted a multipronged approach for creating equivalence pairings, in alignment with our corporate philosophy to leverage open-source technologies whenever feasible. U-Reg formulated user stories to define 4 levels of equivalence. (refer to Sabry's paragraph above) U-Reg is working on an algorithms to identify potentially equivalent disclosures, and sentence similarity algorithms to assess similarity between disclosure pairs on a scale of 0 to 1. (see "4. Exploration" below) The culmination of these efforts will be an all-encompassing database (i.e., master model), each entry comprising the source and target framework, source and target disclosure code, and equivalence level. (see "3. Equivalence" below) 1. Original ESG Standards 2. Structured ESG Standards 3. Equivalence 4. Exploration Observations Given the unique nature of our project, we acknowledge that data for testing our algorithms is finite – 7,000 data points might not be massive in traditional machine learning, but for our context, it is significant. Data augmentation through the creation of random ESG disclosures presents challenges, as does sourcing external equivalence data for testing our models - primarily since we pioneer ESG equivalence, and fellow companies rarely share their results openly. Nevertheless, U-Reg has employed this training set to test the developed technology. By comparing the results of our developed solutions with our training set, we can ascertain the effectiveness of our models. We opted for technologies Moreover, as less than 1% of disclosure pairs are truly equivalent, we've decided not to rely solely on accuracy as a success criterion. Instead, we've chosen to prioritize the reduction of false negatives over false positives. It's more beneficial to identify non-equivalent disclosure pairs and manually eliminate them, rather than miss pairs that are indeed equivalent. For this reason, U-Reg seeks to maximize the Recall score over the Precision score in our algorithm evaluations. We've placed greater weight on the identification of true equivalences versus non-equivalences in our machine learning explorations. This decision will, we believe, ultimately prove the effectiveness of our solution. |
Maintenance of taxonomies for data should ideally be done in some kind of standard format with the ability to build rules for data equivalence between different data formats. This would be useful in particular in the case of ESG taxonomies mapping. Without such an ability to have mappings maintained in a one dimensional format, a lot of maintenance is required for cross-mappings for example:
https://github.com/OS-SFT/Taxonomy-Mappings-Library
This issue is to investigate a better way to maintain mappings in order to support the taxonomy equivalence project run within OS-Climate.
@MichaelTiemannOSC
The text was updated successfully, but these errors were encountered: