You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to unnest the front-page npl data on BigQuery so I can merge it into PATSTAT data in Stata. However, both the npl_publn_id and the cited_by variables are repeated. Why is this necessary?
The text was updated successfully, but these errors were encountered:
cverluise
changed the title
Merging PatCit to PATSTAT
Using npl_publn_id to merge PatCit to PATSTAT ???
Apr 27, 2021
To give a bit of context, such a merge is mainly used to obtain the full text citation stored in PATSTAT which we cannot release in PatCit for proprietary reasons.
Why it is not straightforward
The PATSTAT npl_publn_id is unstable from a version to the following (see #17). Mapping on npl_publn_id should thus be made conditional on using the exact same version of Patstat. In our case, it's PATSTAT 2018b (this might change, please check the latest version of the documentation).
All you need is patcit_id
That is why we created the patcit_id, it uses the md5 hash of the raw npl_biblio(i.e. citation) text. Hence, it can be used across PATSTAT versions while npl_publn_id cannot. Note that a positive side-effect is that the number of duplicated citations dramatically dropped - from 40 million rows in the PATSTAT table, we obtained 27 million unique patcit_id (!) We are aware of discussions at the EPO about implementing a similar solution directly in PATSTAT but we don't know if it's still on their roadmap.
In practice
In short, to merge PatCit with PATSTAT, here are the solutions:
Use the same PATSTAT version as we did (check latest doc) and merge on npl_publn_id.
Use another version of PATSTAT, but create the patcit_id. To do so, simply generate the patcit_id using the get_md5() function defined in patcit/serialize/main.py (skip the async prefix in your case) in your version of PATSTAT. Then, you can merge the two datasets on the patcit_id. There you are.
[Future?] Ideally, we would like to be able to release the raw text of the citation. However, they are proprietary data and we have not yet discussed with the EPO to know whether they would let us do so. That would be my favourite option.
I am trying to unnest the front-page npl data on BigQuery so I can merge it into PATSTAT data in Stata. However, both the npl_publn_id and the cited_by variables are repeated. Why is this necessary?
The text was updated successfully, but these errors were encountered: