Using npl_publn_id to merge PatCit to PATSTAT ??? #43

kylehigham · 2021-04-27T04:27:03Z

I am trying to unnest the front-page npl data on BigQuery so I can merge it into PATSTAT data in Stata. However, both the npl_publn_id and the cited_by variables are repeated. Why is this necessary?

cverluise · 2021-04-27T07:47:32Z

Hello @kylehigham,

To give a bit of context, such a merge is mainly used to obtain the full text citation stored in PATSTAT which we cannot release in PatCit for proprietary reasons.

Why it is not straightforward

The PATSTAT npl_publn_id is unstable from a version to the following (see #17). Mapping on npl_publn_id should thus be made conditional on using the exact same version of Patstat. In our case, it's PATSTAT 2018b (this might change, please check the latest version of the documentation).

All you need is patcit_id

That is why we created the patcit_id, it uses the md5 hash of the raw npl_biblio(i.e. citation) text. Hence, it can be used across PATSTAT versions while npl_publn_id cannot. Note that a positive side-effect is that the number of duplicated citations dramatically dropped - from 40 million rows in the PATSTAT table, we obtained 27 million unique patcit_id (!) We are aware of discussions at the EPO about implementing a similar solution directly in PATSTAT but we don't know if it's still on their roadmap.

In practice

In short, to merge PatCit with PATSTAT, here are the solutions:

Use the same PATSTAT version as we did (check latest doc) and merge on npl_publn_id.
Use another version of PATSTAT, but create the patcit_id. To do so, simply generate the patcit_id using the get_md5() function defined in patcit/serialize/main.py (skip the async prefix in your case) in your version of PATSTAT. Then, you can merge the two datasets on the patcit_id. There you are.
[Future?] Ideally, we would like to be able to release the raw text of the citation. However, they are proprietary data and we have not yet discussed with the EPO to know whether they would let us do so. That would be my favourite option.

Hope it helps

Cheers!

cverluise changed the title ~~Merging PatCit to PATSTAT~~ Using npl_publn_id to merge PatCit to PATSTAT ??? Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using npl_publn_id to merge PatCit to PATSTAT ??? #43

Using npl_publn_id to merge PatCit to PATSTAT ??? #43

kylehigham commented Apr 27, 2021

cverluise commented Apr 27, 2021

Using npl_publn_id to merge PatCit to PATSTAT ??? #43

Using npl_publn_id to merge PatCit to PATSTAT ??? #43

Comments

kylehigham commented Apr 27, 2021

cverluise commented Apr 27, 2021