Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using npl_publn_id to merge PatCit to PATSTAT ??? #43

Open
kylehigham opened this issue Apr 27, 2021 · 1 comment
Open

Using npl_publn_id to merge PatCit to PATSTAT ??? #43

kylehigham opened this issue Apr 27, 2021 · 1 comment

Comments

@kylehigham
Copy link

I am trying to unnest the front-page npl data on BigQuery so I can merge it into PATSTAT data in Stata. However, both the npl_publn_id and the cited_by variables are repeated. Why is this necessary?

@cverluise cverluise changed the title Merging PatCit to PATSTAT Using npl_publn_id to merge PatCit to PATSTAT ??? Apr 27, 2021
@cverluise
Copy link
Owner

Hello @kylehigham,

To give a bit of context, such a merge is mainly used to obtain the full text citation stored in PATSTAT which we cannot release in PatCit for proprietary reasons.

Why it is not straightforward

The PATSTAT npl_publn_id is unstable from a version to the following (see #17). Mapping on npl_publn_id should thus be made conditional on using the exact same version of Patstat. In our case, it's PATSTAT 2018b (this might change, please check the latest version of the documentation).

All you need is patcit_id

That is why we created the patcit_id, it uses the md5 hash of the raw npl_biblio(i.e. citation) text. Hence, it can be used across PATSTAT versions while npl_publn_id cannot. Note that a positive side-effect is that the number of duplicated citations dramatically dropped - from 40 million rows in the PATSTAT table, we obtained 27 million unique patcit_id (!) We are aware of discussions at the EPO about implementing a similar solution directly in PATSTAT but we don't know if it's still on their roadmap.

In practice

In short, to merge PatCit with PATSTAT, here are the solutions:

  1. Use the same PATSTAT version as we did (check latest doc) and merge on npl_publn_id.
  2. Use another version of PATSTAT, but create the patcit_id. To do so, simply generate the patcit_id using the get_md5() function defined in patcit/serialize/main.py (skip the async prefix in your case) in your version of PATSTAT. Then, you can merge the two datasets on the patcit_id. There you are.
  3. [Future?] Ideally, we would like to be able to release the raw text of the citation. However, they are proprietary data and we have not yet discussed with the EPO to know whether they would let us do so. That would be my favourite option.

Hope it helps

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants