Privilege INSPIRE categories #41

kaplun · 2016-11-18T08:55:48Z

Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.

However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.

It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).

rikirenz · 2016-11-25T12:10:07Z

cc: @david-caro

kaplun · 2016-11-25T12:15:18Z

So we could split it in INSPIRE_categories and external_categories. INSPIRE_catagories will not need any scheme or source. So just list of strings.

david-caro · 2016-11-26T08:52:46Z

I'm always confused about caps and lowercase, isn't it better to keep everything in lowercase? (As if they were variable names in python)

kaplun · 2016-11-27T19:04:17Z

Sure!

kaplun · 2016-12-02T14:31:56Z

@inspirehep/inspire-content @inspirehep/inspire-dir

Given the following facts:

In the last 6 months only ~0.01% users queried INSPIRE using arXiv categories (or any category at all)
arXiv categories are not available for all types of documents

Given the following thoughts:

Faceting for ~10 categories (INSPIRE) is nicer than faceting for ~30 (arXiv)
Using neural network (magpie) to guess ~10 categories (INSPIRE) is more precise and reliable than ~30 categories (arXiv)

~~It is proposed here to simplify the data model and hence curation inteface in the following way:~~
* store only INSPIRE categories
* upon ingestion external categories (e.g. arXiv) are mapped to INSPIRE ones
* if no external category we use magpie to guess one based on abstract
* upon migration from legacy we preserve INSPIRE categories and we map existing arXiv ones into INSPIRE.
* Present to users a facet based on INSPIRE categories.

Edit: See below

david-caro · 2016-12-02T14:35:30Z

So we only use magpie if there are no external categories?

kaplun · 2016-12-02T14:44:43Z

If we don't have any INSPIRE category (they might come through a different mapping in hepcrawl, for example)

salmele · 2016-12-02T15:04:15Z

I strongly disagree.

Our user community understands what hep-th means, for instance, and ought to be able to search accordingly

Possibly, we've a 'topic' search, which has all of INSPIRE present 'categories'. And arXiv-category search, which is meant to work only on arXiv content, and not second-guessed INSPIRE attribution

jacquerie · 2016-12-02T15:11:32Z

I think @michamos can contribute his idea here.

kaplun · 2016-12-02T15:48:36Z

OK: New proposal:

we store arXiv categories (originated from arXiv) into arXiv categories
we store INSPIRE categories (or topics or subjects) into INSPIRE categories:
- upon migration from legacy: copying them if they already exist
- upon migration from legacy: generating them from arXiv categories when available
- using magpie to guess them from the abstract (or title) in all the other cases.
We propose to present to user 2 separate facets (initially folded, in order not to waste too much space):
- INSPIRE categories/topics/subjects (easy): human-friendly names
- arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl

Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.

annetteholtkamp · 2016-12-03T06:52:26Z

I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway? - Annette On Dec 2, 2016, at 4:48 PM, Samuele Kaplun <[email protected]<mailto:[email protected]>> wrote: OK: New proposal: * we store arXiv categories (originated from arXiv) into arXiv categories * we store INSPIRE categories (or topics or subjects) into INSPIRE categories: * upon migration from legacy: copying them if they already exist * upon migration from legacy: generating them from arXiv categories when available * using magpie to guess them from the abstract (or title) in all the other cases. * We propose to present to user 2 separate facets (initially folded, in order not to waste too much space): * INSPIRE categories/topics/subjects (easy): human-friendly names * arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl

…

________________________________ Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub<#41 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg>.

bing13 · 2016-12-03T16:15:42Z

I agree. But there might be some non-misleading way to expose such a feature with clever UI, etc. On Fri, Dec 2, 2016 at 10:52 PM, annetteholtkamp <[email protected]> wrote:

…

I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway? - Annette On Dec 2, 2016, at 4:48 PM, Samuele Kaplun ***@***.***< ***@***.***>> wrote: OK: New proposal: * we store arXiv categories (originated from arXiv) into arXiv categories * we store INSPIRE categories (or topics or subjects) into INSPIRE categories: * upon migration from legacy: copying them if they already exist * upon migration from legacy: generating them from arXiv categories when available * using magpie to guess them from the abstract (or title) in all the other cases. * We propose to present to user 2 separate facets (initially folded, in order not to waste too much space): * INSPIRE categories/topics/subjects (easy): human-friendly names * arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl ________________________________ Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub<https://github.com/ inspirehep/inspire-schemas#41#issuecomment-264485570>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_ fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg>. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABGK0a2VZZTdcztEYR22mLVe2Z4pA9YPks5rERGrgaJpZM4K2QIg> .

aw-bib · 2016-12-05T08:31:10Z

might be some non-misleading way

Sort of a hierarchical facet comes to mind. First arXiv, then it's categories.

arxiv
- hep-ph
- hep-th
- ...
inspire
- ...

BTW: Quite often one does not search for a term, but having the keyword on display gives additional clues about the content. hep-th is usually a different sort of paper than hep-ph or hep-ex. IOW it's not only computers looking at the data, thus IMHO bean counting alone doesn't suffice.

suenjedt · 2016-12-05T08:33:45Z

I think this is a perfect example where more data/evidence is needed, as well as more testing - at least for the user facing part. I would recommend consulting Stella for this.

david-caro · 2016-12-05T09:42:21Z

Okok, for now, we all seem to agree on storing arXiv + inspire categories, just not yet sure how to expose that info to the users right?
(to @rikirenz can keep working on the backend part)

ksachs · 2016-12-05T10:04:03Z

upon migration from legacy: generating them from arXiv categories when available
Only if there is no INSPIRE category!

Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit.
Usually these are cross-listings which on arXiv mean 'also of interest for...' whereas on INSPIRE they are equivalent to the main category.

Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?

kaplun · 2016-12-05T10:27:10Z

Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit.

Who/what process has added the wrong INSPIRE category in the first place? We currently have a mapping that unambiguously maps an arXiv category to an INSPIRE one:

https://github.com/inspirehep/inspire-next/blob/master/inspirehep/config.py#L1445-L1604

Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?

In principle both searches are possible, but we have stats demonstrating that users are not able to search with them, via search syntax. On the other hand facets are very easy to manipulate.
Indeed there is no hurry on this, so we can indeed see the results of user testing the alternatives.

ksachs · 2016-12-05T10:33:57Z

Not wrong in the sense of 'wrong mapping', but wrong in the sense of 'this is not about Experiment-HEP'.
At arXiv articles get cross-listed for better visibility 'also of interest for...', at INSPIRE the subject always means 'is about', we don't have the notion of secondary category.

kaplun · 2016-12-05T10:41:10Z

In the new model a record can have more than one INSPIRE subject, so crosslisting is preserved.

kaplun · 2016-12-05T10:43:04Z

IMHO users will not be damaged by having a record to belong to multiple categories. It's just a way for a paper to be reached. Once reached, the user will be able to make is own judgement.
But indeed we are making several assumptions here. So worth doing a user-checking.

ksachs · 2016-12-05T11:16:51Z

side effect: everything with a INSPIRE category *-HEP will (automatically) get CORE.
I don't need / want user-checking to know that The Escaramujo Project: instrumentation courses during a road trip across the Americas is not CORE. And I want to be in control of INSPIRE subjects even for arXiv papers. If you keep arXiv and INSPIRE categories in sync we completely depend on arXiv.
So: derive INSPIRE category automatically from arXiv on ingestion: perfect. But later on they should be independent.

david-caro · 2016-12-05T11:22:23Z

@ksachs I think that is the idea, the INSPIRE categories are only guessed/automatically added if they are not there already, and only on ingestion. After that, they can be manually changed if needed.

The point that @kaplun is trying to make, is that we can have multiple INSPIRE ones, just as we have multiple arXiv ones, with the same treatment to main category and multiple secondary ones.

kaplun · 2016-12-05T12:03:52Z

Yeah, plus I was really curios to know if there was maybe something we could correct, since you were mentioning that you were actually removing INSPIRE categories. Fully understood now.

jacquerie · 2016-12-05T12:29:51Z

In the last 6 months only ~0.01% users queried INSPIRE using arXiv categories (or any category at all)

Correcting myself: the ratio of

SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction AND
  name LIKE '% fc %'
);

and

SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction
);

is 0.07%. Still low, but mostly I wanted to write down how to compute this kind of number if someone else is interested.

Also: this is the ratio of queries containing fc, not users. Computing the ratio of users can be done, but is slightly harder.

jacquerie · 2016-12-05T12:46:40Z

It's actually pretty easy. The ratio of:

SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction AND
  name LIKE '% fc %'
);

and

SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction
);

is 0.1%.

That is, of all distinct users that made at least a query on INSPIRE in the last six months, 0.1% of them used the fc keyword.

kaplun · 2016-12-05T14:09:29Z

@annetteholtkamp, rightly points out, of course, that it does make quite a difference to know if an INSPIRE category had been:

guessed by magpie (source magpie)
derived automatically from arXiv categories (source arxiv)
overridden by a cataloger (source cataloger).

So finally in our data structure seems like we still need to capture the source for INSPIRE categories.

david-caro · 2016-12-05T16:26:30Z

If the inspire categories guess is done only on ingestion and only if not there already, I don't think that it adds much value to track the source of the categories, let me explain (it's very possible that I just don't get your point ;), I don't see @annetteholtkamp rationale here).

The arxiv keywords come from outside, and will never change, so letting them have their own section already expresses the source (passive info).

The inspire ones, we just care if they are right or not, if they are not right, we just change them, no matter if they come from magpie, if they came from arxiv or from a cataloger no?

I only see a couple of use cases where we would want that info, to train better magpie by comparing the result it gave with the one from the cataloger, or something like getting a list of all the records with automated categories... though I don't see that being very useful, as we can just train magpie with the whole set of data just expecting it to be correct (as it passed validation from a cataloger).

kaplun · 2016-12-05T16:34:22Z

How do you envisage a workflow where a cataloger check a record only 2 weeks after it arrives on INSPIRE. At that point we already have ingested it, but the INSPIRE category has been already guessed by arXiv/magpie. How the cataloger can now if it can trust the INSPIRE category?

Beside this small point I agree with you that in general we don't care between arXiv Vs. magpie.

david-caro · 2016-12-05T16:40:26Z

Well, it does not matter who put it no? Just if it is correct or not, and for that you don't care of the origin right?
Unless, there's some extra work associated with checking the correctness itself. In that case it makes sense to add that 'already verified' flag.

ksachs · 2016-12-06T09:33:16Z

It's about the trust level. Up to now the only sources for subject categories where catalogers (Annette, Florian and me) or arXiv (obvious to spot). If we have more sources (magpie, journal category, ...) it would be nice to know whether it is worth a second look.

salmele · 2016-12-06T09:39:32Z

Excellent point, @ksachs. This calls for the notion of provenance (for later audit should the case be) in the record. Didn't Invenio offer something on that?

kaplun · 2016-12-06T14:41:01Z

@salmele nothing out of the box. It's up to us to decide what provenance to store for what and where, and to implement support for it.

Invenio supports basic history, information, such as this version of record X was touched by curator Y, but we don't know which field is actually involved.

OK. So all in all, we don't simplify data-model WRT INSPIRE subjects, but we keep them as list of objects with two attributes: source (enum with curator, magpie, arxiv) and value (enum among the current list of subjcets).

BTW how shall we call it? inspire_subjects or _categories or _topics?

david-caro · 2016-12-06T17:52:43Z

@salmele Quick question, what should we do when:

Having multiple inspire categories with different sources but same term?
(I'd say we keep them and let the curator remove the duplicates if need be).
Having categories (coming from legacy) that have no source?
(I'd say put there something like 'undefined' to allow checking them later, should be a small amount if any)

This (and choosing the name) is currently blocking @rikirenz on #47.

salmele · 2016-12-07T00:07:00Z

I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are.

I am neutral about what to do with INSPIRE subjects and I recommend @ksachs or @annetteholtkamp decide how they want to define the ontology of sources and their priority

bing13 · 2016-12-07T00:36:41Z

i concur.

…

On Tue, Dec 6, 2016 at 4:07 PM, Salvatore Mele ***@***.***> wrote: I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are. I am neutral about what to do with INSPIRE *subjects* and I recommend @ksachs <https://github.com/ksachs> or @annetteholtkamp <https://github.com/annetteholtkamp> decide how they want to define the ontology of sources and their priority — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABGK0XVH1wuD0CLOcYS549sI3e2cTMhcks5rFfilgaJpZM4K2QIg> .

kaplun · 2016-12-07T07:58:13Z

@david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.

annetteholtkamp · 2016-12-07T08:01:15Z

Actually, we could run magpie also on arXiv papers and keep magpie’s suggestion in case of difference to arXiv. This could be helpful for the content people as a hint to carefully check the subject. - Annette

…

On 7 Dec 2016, at 08:58, Samuele Kaplun ***@***.***> wrote: @david-caro <https://github.com/david-caro> IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AM1-OxcZWXYAuPj0fSbTqwXTzgDIWuTxks5rFmcWgaJpZM4K2QIg>.

salmele · 2016-12-07T08:03:57Z

For the avoidance of doubt: using magpie or human classification on arXiv would create a new INSPIRE subject and NOT modify the arXiv category (and cross-listing) as chosen by the submitter (which we'd show somewhere else)... right?

kaplun · 2016-12-07T08:08:20Z

@salmele right: arXiv categories are never going to be touched.

david-caro · 2016-12-07T08:28:13Z

So I propose then having this:

A field arxiv_categories with the categories list from arxiv, that are just populated on ingestion.
A field inspire_categories with a list of term-source tuples that must be unique, in the sense that there's no repeated tuple, though it might be that we have the same term with different sources:
- Where term is the actual name of the category it represents.
- source is one of (this list might get extended in time):
  - cataloger (for manually set categories)
  - arxiv (for categories derived from the arxiv ones)
  - magpie (for automatically guessed categories
  - undefined (for migrated records that have no source on their categories, not sure if there are any though, but just in case).

kaplun · 2016-12-07T08:42:30Z

Do we want to go for the undefined? In general so far, when something is undefined we simply haven't specified a value. In Invenio 1 this had some implications because it wasn't possible to search for those things having no-value. Not sure for Elasticsearch. @jacquerie ?

salmele · 2016-12-07T08:53:07Z

A field arxiv_categories with the categories list from arxiv, that are just populated on ingestion

...preserving the concept of primary category in arXiv (current PRIMARC in SPIRES syntax and dedicated MARC field)

kaplun · 2016-12-07T08:54:54Z

Yup. This is something that can be implemented at search time. The first arXiv category in the list of arXiv categories will be searchable also with primarc.

Ticketized in inspirehep/inspire-next#1791 so we don't forget)

michamos · 2016-12-07T10:02:37Z

TL;DR: adopt arXiv categories instead of INSPIRE categories

My opinion is rather different from those expressed before.
So let me summarize some of the different viewpoints as far as I understand them:

@salmele cares about keeping arXiv categories present and unmodified, as he needs it for SCOAP3 statistics
@ksachs cares about being able to override the INSPIRE category for cross-lists that are only weakly relevant
@kaplun cares about having a small number of categories so that magpie can guess them with high confidence

One point of view is largely missing, the user point of view, which is probably the most important. I was a simple user until not too long ago, and my point of view is that INSPIRE categories are terrible, and I would suggest getting rid of them in their current form, for the following reasons.

nobody knows about them

as @jacquerie showed, ~0.1% of our users ever tried to perform a search with the 'fc' keyword. Of those, I am sure a large fraction is doing it wrong, e.g. fc hep-th instead of fc t. To the contrary, arXiv categories have been user-visible for 25 years and everyone knows what hep-th means. This goes to the extent that even inside INSPIRE, we use arXiv categories for HEPNAMES and jobs instead of the INSPIRE ones. So we don't have any legacy to preserve here, and we can use the migration as an opportunity to rethink the whole concept. I would suggest sticking as closely as possible to the arXiv categories that the people are familiar with. (For anecdotal evidence, I had never heard of INSPIRE categories as a user, and am still struggling to remember the less common ones.)

they are confusing

there are three types of categories:
- small categories are exactly the same conceptually as an arXiv category (Theory-HEP), but with a different name, and additional papers not on the arXiv
- medium categories regroup a limited number of categories (Instrumentation)
- large categories encompass a huge number of arXiv categories (Math and Math Physics)

See the list on the bottom of the INSPIRE categories and the arXiv categories that map to it. They are inconsistent, as General Physics is NOT the same as physics.gen-ph (which is a euphemism for garbage on the arXiv), and Data Analysis and Statistics does not contain the stat.* statistics categories. And who knows what Other means?

they are largely useless

the purpose of categories is to organize the records in such a way that the user can easily filter them out and focus on what she is interested in. Either she is interested in a small category, in which case she is looking conceptually for a single arXiv category, but with another name, or she is not. In the latter case, the category is probably useless anyway. For example, if someone wants to know more about entanglement entropy (which is a hot subject right now in hep-th, but was born in quant-ph and is used quite a lot in cond-mat), he is not interested whether the paper is General Physics, but in the distinction between quant-ph and cond-mat.* ). More importantly, our coverage will probably be quite poor in those areas, but the user is not necessarily aware of this fact.

solution: adopt arXiv categories

So I would suggest overhauling INSPIRE categories to match arXiv categories.
For arXiv papers, we adopt the arXiv category as the INSPIRE category (keeping the distinction primary/secondary also on INSPIRE, and allowing the user to search/facet based on primary only or both primary and secondary). For records that do not come from arXiv, we put them in a specific category (e.g. physics.acc-ph) if we care about it, or we put them in a top-level category (e.g. astro-ph or math) if we don't want to zoom in.

If we keep the primary/secondary distinction, I don't think there would be any need for overriding categories (maybe @ksachs still sees a usecase), as we can trust the arXiv moderators which vastly outnumber the handful of people assigning categories in INSPIRE and are active researchers in their fields. But we could still have the provenance field just to be sure, and for the case in which we add a non-arXiv record that is later added to arXiv.
And assigning articles to top-level categories should not be too difficult, and the number of them still manageable for magpie (with the added benefit that we could train it on arXiv data).

{'Accelerators': ['physics.acc-ph'],
 'Astrophysics': ['physics.space-ph',
  'astro-ph.EP',
  'astro-ph.GA',
  'astro-ph.SR',
  'astro-ph.CO',
  'astro-ph',
  'astro-ph.HE'],
 'Computing': ['cs.ET',
  'cs.RO',
  'cs.CY',
  'cs.NE',
  'cs.DB',
  'cs.DC',
  'cs.SI',
  'cs.DL',
  'cs.CV',
  'cs.FL',
  'cs.SD',
  'cs.PF',
  'cs.LG',
  'cs.DS',
  'cs.OH',
  'cs.OS',
  'cs.LO',
  'cs.MM',
  'cs.AI',
  'physics.comp-ph',
  'cs.GT',
  'cs.IR',
  'cs.NA',
  'cs.SE',
  'cs.CL',
  'cs.CG',
  'cs.DM',
  'cs.SY',
  'cs.GL',
  'cs.IT',
  'cs.CR',
  'cs.MS',
  'cs.SC',
  'cs.CC',
  'cs.AR',
  'cs.GR',
  'cs.NI',
  'cs.MA',
  'cs.PL',
  'cs.CE',
  'cs.HC',
  'cs'],
 'Data Analysis and Statistics': ['physics.data-an'],
 'Experiment-HEP': ['hep-ex'],
 'Experiment-Nucl': ['nucl-ex'],
 'General Physics': ['quant-ph',
  'cond-mat.stat-mech',
  'cond-mat.mes-hall',
  'cond-mat.supr-con',
  'physics.plasm-ph',
  'cond-mat',
  'cond-mat.other',
  'physics.class-ph',
  'nlin',
  'cond-mat.quant-gas',
  'cond-mat.dis-nn',
  'nlin.CD',
  'cond-mat.soft',
  'nlin.CG',
  'cond-mat.str-el',
  'physics.ao-ph',
  'cond-mat.mtrl-sci',
  'physics.gen-ph',
  'nlin.AO',
  'physics.atm-clus',
  'physics.flu-dyn',
  'physics.atom-ph',
  'physics.optics',
  'physics',
  'physics.geo-ph'],
 'Gravitation and Cosmology': ['gr-qc'],
 'Instrumentation': ['astro-ph.IM', 'physics.ins-det'],
 'Lattice': ['hep-lat'],
 'Math and Math Physics': ['patt-sol',
  'math.GT',
  'math.CV',
  'math.MP',
  'math.GM',
  'math.PR',
  'math.GR',
  'math.DG',
  'math.NA',
  'math.AP',
  'math.CA',
  'math.LO',
  'math.NT',
  'math.AG',
  'math.KT',
  'q-alg',
  'math.ST',
  'math.CT',
  'math.QA',
  'alg-geom',
  'math',
  'math.DS',
  'math.FA',
  'math.CO',
  'math.SP',
  'math.MG',
  'math.GN',
  'math.AT',
  'nlin.PS',
  'math.OC',
  'math.SG',
  'math.HO',
  'math.RT',
  'math.IT',
  'math.RA',
  'math.OA',
  'math-ph',
  'dg-ga',
  'math.AC',
  'solv-int',
  'nlin.SI'],
 'Other': ['q-fin.TR',
  'q-bio.PE',
  'q-bio.CB',
  'q-bio.BM',
  'q-fin.GN',
  'q-fin.PR',
  'stat.AP',
  'physics.chem-ph',
  'physics.pop-ph',
  'q-bio.MN',
  'stat.CO',
  'stat.ML',
  'physics.hist-ph',
  'q-fin.CP',
  'stat.OT',
  'q-bio.TO',
  'q-fin.EC',
  'q-bio.GN',
  'q-fin.PM',
  'physics.med-ph',
  'stat.TH',
  'physics.bio-ph',
  'q-bio.SC',
  'physics.soc-ph',
  'physics.ed-ph',
  'q-bio.OT',
  'q-bio.QM',
  'q-fin.ST',
  'q-bio.NC',
  'q-fin.RM',
  'q-fin.MF',
  'stat.ME'],
 'Phenomenology-HEP': ['hep-ph'],
 'Theory-HEP': ['hep-th'],
 'Theory-Nucl': ['nucl-th']}

david-caro · 2016-12-07T10:13:31Z

IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data, I'd try to get some from the users before doing any big effort either way (that might require some effort too, but having the machinery to easily get that kind of things will help on deciding on other features too).

kaplun · 2016-12-07T10:13:36Z

@michamos for us developers, can you exactly define or point us to the definition of what is:

primary arXiv category
secondary arXiv category
top-level arXiv category

E.g. with some concrete example to reasoning on.

michamos · 2016-12-07T10:19:40Z

@michamos for us developers, can you exactly define or point us to the definition of what is:
primary arXiv category

the thing we put into 037, e.g. for https://inspirehep.net/record/1501963 it is hep-th

secondary arXiv category

for the same record, they are cond-mat.stat-mech and physics.flu-dyn

top-level arXiv category

a category in bold on https://arxiv.org/, which might have subcategories, so hep-th or physics.* (probably using math* instead of math.* though to take also math-ph).

E.g. with some concrete example to reasoning on.

michamos · 2016-12-07T10:23:25Z

IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data

the fact they are hidden and completely unkwown is an opportunity to rethink INSPIRE categories IMHO

ksachs · 2016-12-07T10:58:15Z

Hi Micha,

thanks for your detailed input from the user side.
Point taken - visibility is an issue.
A pitty that we don't have usage statistics from SPIRES - that's our legacy.
Both field-codes (= subject, category, ...) and keywords are neglected on INSPIRE.

However you are talking about something you barely know (as you say yourself).
The main use-case was that people would browse (we even had printed lists) through the daily inputs in their category. And from the arXiv listings I believe this is still the case.

First some statistics - records added the last 2 years:

  year   2016 / 2015 / 2015core
  all   76795 / 51412 / 28802
  arxiv 22558 / 25454 / 17228
  a     10595 / 10957 / 4982
  b      6667 / 6007 / 1494
  c       912 / 610 / 351
  e      5313 / 5311 / 5087
  g      5941 / 6463 / 4479
  i      5143 / 5086 / 3171
  l      1063 / 1205 / 1202
  m      2718 / 3160 / 1548
  n    19345* / 4451 / 2064
  o       422 / 503 / 169
  p      9304 / 9008 / 8750
  q      3894 / 4194 / 1692
  t      6819 / 7422 / 7304
  x    17809* / 3717 / 1568
  * reharvest of nucl journals

Btw: other = non-physics

We have one big category (with a lot of non-core stuff we don't want to waste time on): astro, two small category: lattice and computing, all the rest is around 5k/y. Which shows that the balance is not too bad.

You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records. For core records the big categories are p (9k), t (7k) and e (5k) which we might want to break down into smaller sub-categories; which SPIRES used to have, but that was before my time. Actually SPIRES had categories before arXiv existed.

The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators.

Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 10% I dumped right away, for about 20% I went through and made corrections, about 70% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~7k/y to assign manually.

michamos · 2016-12-07T12:32:11Z

@ksachs thanks for you comments. I agree that having categories is very useful, I just think those categories should be closer to the arXiv ones that people know. In this way we can avoid having our users learn and remember what from their point of view is a new classification scheme which is very similar yet subtly different in some areas.

The main use-case was that people would browse (we even had printed lists) through the
daily inputs in their category. And from the arXiv listings I believe this is still
the case.

People browse the new arXiv listings (e.g. hep-th/new), it would actually be very nice if one could do the same on INSPIRE, with all new papers in a given category. So by adopting arXiv categories on INSPIRE, one could get, say, arXiv hep-th + non-arXiv hep-th additions.

Btw: other = non-physics

I know that Other = None of the rest, but in order to know what it means precisely one has to know all the other categories that we have

You forget that the INSPIRE content is not the same as the arXiv content.
Only about half of our records come from arXiv. And we have only a very selective
part of arXiv. Where arXiv categories are split in subcategories we barely have records.

That's why it would be good to facet in a hierarchical way, so that we can have the full math category collapsed into math at first, but with possibility to expand it if one is interested in math.RT but not math.AP.

The assignment of categories has to be feasible which it is not if the
categories are too finegrained. It's mainly Florian and me who assign
categories to the non-arXiv stuff, not a bunch of moderators.

I was mentioning moderators for the arXiv content. For non-arXiv, we could do it ourselves, but based on the arXiv top-level categories if it's something we don't care about (so physics or math or cs or ...).

Lesson learned from the reharvest of nucl journals, where we used magpie to guess the
subject. 30% I dumped right away, for about 20% I went through and made corrections,
about 50% were useful. Which means even if we run magpie over everything that doesn't
come from arXiv we still have ~10k/y to assign manually.

What I have seen of magpie was very impressive. I would guess that magpie didn't have good training data in this case. By adopting arXiv categories, we could very easily train magpie on the arXiv corpus.

kaplun assigned rikirenz Nov 25, 2016

david-caro mentioned this issue Dec 5, 2016

Refactoring categories #47

Merged

michamos assigned StellaCh and rikirenz and unassigned rikirenz Jan 16, 2017

Privilege INSPIRE categories #41

Privilege INSPIRE categories #41

Comments

kaplun commented Nov 18, 2016

rikirenz commented Nov 25, 2016

kaplun commented Nov 25, 2016

david-caro commented Nov 26, 2016

kaplun commented Nov 27, 2016

kaplun commented Dec 2, 2016 • edited Loading

david-caro commented Dec 2, 2016

kaplun commented Dec 2, 2016

salmele commented Dec 2, 2016

jacquerie commented Dec 2, 2016

kaplun commented Dec 2, 2016

annetteholtkamp commented Dec 3, 2016 via email

bing13 commented Dec 3, 2016 via email

aw-bib commented Dec 5, 2016 • edited Loading

suenjedt commented Dec 5, 2016

david-caro commented Dec 5, 2016

ksachs commented Dec 5, 2016

kaplun commented Dec 5, 2016 • edited Loading

ksachs commented Dec 5, 2016

kaplun commented Dec 5, 2016

kaplun commented Dec 5, 2016

ksachs commented Dec 5, 2016

david-caro commented Dec 5, 2016

kaplun commented Dec 5, 2016

jacquerie commented Dec 5, 2016 • edited Loading

jacquerie commented Dec 5, 2016 • edited Loading

kaplun commented Dec 5, 2016

david-caro commented Dec 5, 2016

kaplun commented Dec 5, 2016

david-caro commented Dec 5, 2016

ksachs commented Dec 6, 2016

salmele commented Dec 6, 2016

kaplun commented Dec 6, 2016

david-caro commented Dec 6, 2016

salmele commented Dec 7, 2016

bing13 commented Dec 7, 2016 via email

kaplun commented Dec 7, 2016

annetteholtkamp commented Dec 7, 2016 via email

salmele commented Dec 7, 2016

kaplun commented Dec 7, 2016

david-caro commented Dec 7, 2016

kaplun commented Dec 7, 2016

salmele commented Dec 7, 2016

kaplun commented Dec 7, 2016 • edited Loading

michamos commented Dec 7, 2016 • edited Loading

nobody knows about them

they are confusing

they are largely useless

solution: adopt arXiv categories

david-caro commented Dec 7, 2016

kaplun commented Dec 7, 2016

michamos commented Dec 7, 2016

michamos commented Dec 7, 2016

ksachs commented Dec 7, 2016 • edited Loading

michamos commented Dec 7, 2016 • edited Loading

kaplun commented Dec 2, 2016 •

edited

Loading

aw-bib commented Dec 5, 2016 •

edited

Loading

kaplun commented Dec 5, 2016 •

edited

Loading

jacquerie commented Dec 5, 2016 •

edited

Loading

jacquerie commented Dec 5, 2016 •

edited

Loading

kaplun commented Dec 7, 2016 •

edited

Loading

michamos commented Dec 7, 2016 •

edited

Loading

ksachs commented Dec 7, 2016 •

edited

Loading

michamos commented Dec 7, 2016 •

edited

Loading