Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privilege INSPIRE categories #41

Open
kaplun opened this issue Nov 18, 2016 · 50 comments
Open

Privilege INSPIRE categories #41

kaplun opened this issue Nov 18, 2016 · 50 comments
Assignees

Comments

@kaplun
Copy link
Contributor

kaplun commented Nov 18, 2016

Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.

However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.

It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).

@rikirenz
Copy link
Contributor

cc: @david-caro

@kaplun
Copy link
Contributor Author

kaplun commented Nov 25, 2016

So we could split it in INSPIRE_categories and external_categories. INSPIRE_catagories will not need any scheme or source. So just list of strings.

@david-caro
Copy link
Contributor

I'm always confused about caps and lowercase, isn't it better to keep everything in lowercase? (As if they were variable names in python)

@kaplun
Copy link
Contributor Author

kaplun commented Nov 27, 2016

Sure!

@kaplun
Copy link
Contributor Author

kaplun commented Dec 2, 2016

@inspirehep/inspire-content @inspirehep/inspire-dir

Given the following facts:

  • In the last 6 months only ~0.01% users queried INSPIRE using arXiv categories (or any category at all)
  • arXiv categories are not available for all types of documents

Given the following thoughts:

  • Faceting for ~10 categories (INSPIRE) is nicer than faceting for ~30 (arXiv)
  • Using neural network (magpie) to guess ~10 categories (INSPIRE) is more precise and reliable than ~30 categories (arXiv)

It is proposed here to simplify the data model and hence curation inteface in the following way:
* store only INSPIRE categories
* upon ingestion external categories (e.g. arXiv) are mapped to INSPIRE ones
* if no external category we use magpie to guess one based on abstract
* upon migration from legacy we preserve INSPIRE categories and we map existing arXiv ones into INSPIRE.
* Present to users a facet based on INSPIRE categories.

Edit: See below

@david-caro
Copy link
Contributor

So we only use magpie if there are no external categories?

@kaplun
Copy link
Contributor Author

kaplun commented Dec 2, 2016

If we don't have any INSPIRE category (they might come through a different mapping in hepcrawl, for example)

@salmele
Copy link

salmele commented Dec 2, 2016

I strongly disagree.

Our user community understands what hep-th means, for instance, and ought to be able to search accordingly

Possibly, we've a 'topic' search, which has all of INSPIRE present 'categories'. And arXiv-category search, which is meant to work only on arXiv content, and not second-guessed INSPIRE attribution

@jacquerie
Copy link
Contributor

I think @michamos can contribute his idea here.

@kaplun
Copy link
Contributor Author

kaplun commented Dec 2, 2016

OK: New proposal:

  • we store arXiv categories (originated from arXiv) into arXiv categories
  • we store INSPIRE categories (or topics or subjects) into INSPIRE categories:
    • upon migration from legacy: copying them if they already exist
    • upon migration from legacy: generating them from arXiv categories when available
    • using magpie to guess them from the abstract (or title) in all the other cases.
  • We propose to present to user 2 separate facets (initially folded, in order not to waste too much space):
    • INSPIRE categories/topics/subjects (easy): human-friendly names
    • arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl

Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.

@annetteholtkamp
Copy link

annetteholtkamp commented Dec 3, 2016 via email

@bing13
Copy link

bing13 commented Dec 3, 2016 via email

@aw-bib
Copy link

aw-bib commented Dec 5, 2016

might be some non-misleading way

Sort of a hierarchical facet comes to mind. First arXiv, then it's categories.

  • arxiv
    • hep-ph
    • hep-th
    • ...
  • inspire
    • ...

BTW: Quite often one does not search for a term, but having the keyword on display gives additional clues about the content. hep-th is usually a different sort of paper than hep-ph or hep-ex. IOW it's not only computers looking at the data, thus IMHO bean counting alone doesn't suffice.

@suenjedt
Copy link

suenjedt commented Dec 5, 2016

I think this is a perfect example where more data/evidence is needed, as well as more testing - at least for the user facing part. I would recommend consulting Stella for this.

@david-caro
Copy link
Contributor

Okok, for now, we all seem to agree on storing arXiv + inspire categories, just not yet sure how to expose that info to the users right?
(to @rikirenz can keep working on the backend part)

@ksachs
Copy link

ksachs commented Dec 5, 2016

  • upon migration from legacy: generating them from arXiv categories when available
    Only if there is no INSPIRE category!

Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit.
Usually these are cross-listings which on arXiv mean 'also of interest for...' whereas on INSPIRE they are equivalent to the main category.

Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?

@kaplun
Copy link
Contributor Author

kaplun commented Dec 5, 2016

Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit.

Who/what process has added the wrong INSPIRE category in the first place? We currently have a mapping that unambiguously maps an arXiv category to an INSPIRE one:

https://github.com/inspirehep/inspire-next/blob/master/inspirehep/config.py#L1445-L1604

Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?

In principle both searches are possible, but we have stats demonstrating that users are not able to search with them, via search syntax. On the other hand facets are very easy to manipulate.
Indeed there is no hurry on this, so we can indeed see the results of user testing the alternatives.

@ksachs
Copy link

ksachs commented Dec 5, 2016

Not wrong in the sense of 'wrong mapping', but wrong in the sense of 'this is not about Experiment-HEP'.
At arXiv articles get cross-listed for better visibility 'also of interest for...', at INSPIRE the subject always means 'is about', we don't have the notion of secondary category.

@kaplun
Copy link
Contributor Author

kaplun commented Dec 5, 2016

In the new model a record can have more than one INSPIRE subject, so crosslisting is preserved.

@kaplun
Copy link
Contributor Author

kaplun commented Dec 5, 2016

IMHO users will not be damaged by having a record to belong to multiple categories. It's just a way for a paper to be reached. Once reached, the user will be able to make is own judgement.
But indeed we are making several assumptions here. So worth doing a user-checking.

@ksachs
Copy link

ksachs commented Dec 5, 2016

side effect: everything with a INSPIRE category *-HEP will (automatically) get CORE.
I don't need / want user-checking to know that The Escaramujo Project: instrumentation courses during a road trip across the Americas is not CORE. And I want to be in control of INSPIRE subjects even for arXiv papers. If you keep arXiv and INSPIRE categories in sync we completely depend on arXiv.
So: derive INSPIRE category automatically from arXiv on ingestion: perfect. But later on they should be independent.

@david-caro
Copy link
Contributor

@ksachs I think that is the idea, the INSPIRE categories are only guessed/automatically added if they are not there already, and only on ingestion. After that, they can be manually changed if needed.

The point that @kaplun is trying to make, is that we can have multiple INSPIRE ones, just as we have multiple arXiv ones, with the same treatment to main category and multiple secondary ones.

@kaplun
Copy link
Contributor Author

kaplun commented Dec 5, 2016

Yeah, plus I was really curios to know if there was maybe something we could correct, since you were mentioning that you were actually removing INSPIRE categories. Fully understood now.

@jacquerie
Copy link
Contributor

jacquerie commented Dec 5, 2016

  • In the last 6 months only ~0.01% users queried INSPIRE using arXiv categories (or any category at all)

Correcting myself: the ratio of

SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction AND
  name LIKE '% fc %'
);

and

SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction
);

is 0.07%. Still low, but mostly I wanted to write down how to compute this kind of number if someone else is interested.

Also: this is the ratio of queries containing fc, not users. Computing the ratio of users can be done, but is slightly harder.

@jacquerie
Copy link
Contributor

jacquerie commented Dec 5, 2016

It's actually pretty easy. The ratio of:

SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction AND
  name LIKE '% fc %'
);

and

SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction
);

is 0.1%.

That is, of all distinct users that made at least a query on INSPIRE in the last six months, 0.1% of them used the fc keyword.

@kaplun
Copy link
Contributor Author

kaplun commented Dec 5, 2016

@annetteholtkamp, rightly points out, of course, that it does make quite a difference to know if an INSPIRE category had been:

  • guessed by magpie (source magpie)
  • derived automatically from arXiv categories (source arxiv)
  • overridden by a cataloger (source cataloger).

So finally in our data structure seems like we still need to capture the source for INSPIRE categories.

@david-caro
Copy link
Contributor

If the inspire categories guess is done only on ingestion and only if not there already, I don't think that it adds much value to track the source of the categories, let me explain (it's very possible that I just don't get your point ;), I don't see @annetteholtkamp rationale here).

The arxiv keywords come from outside, and will never change, so letting them have their own section already expresses the source (passive info).

The inspire ones, we just care if they are right or not, if they are not right, we just change them, no matter if they come from magpie, if they came from arxiv or from a cataloger no?

I only see a couple of use cases where we would want that info, to train better magpie by comparing the result it gave with the one from the cataloger, or something like getting a list of all the records with automated categories... though I don't see that being very useful, as we can just train magpie with the whole set of data just expecting it to be correct (as it passed validation from a cataloger).

@kaplun
Copy link
Contributor Author

kaplun commented Dec 5, 2016

How do you envisage a workflow where a cataloger check a record only 2 weeks after it arrives on INSPIRE. At that point we already have ingested it, but the INSPIRE category has been already guessed by arXiv/magpie. How the cataloger can now if it can trust the INSPIRE category?

Beside this small point I agree with you that in general we don't care between arXiv Vs. magpie.

@david-caro
Copy link
Contributor

Well, it does not matter who put it no? Just if it is correct or not, and for that you don't care of the origin right?
Unless, there's some extra work associated with checking the correctness itself. In that case it makes sense to add that 'already verified' flag.

@ksachs
Copy link

ksachs commented Dec 6, 2016

It's about the trust level. Up to now the only sources for subject categories where catalogers (Annette, Florian and me) or arXiv (obvious to spot). If we have more sources (magpie, journal category, ...) it would be nice to know whether it is worth a second look.

@salmele
Copy link

salmele commented Dec 6, 2016

Excellent point, @ksachs. This calls for the notion of provenance (for later audit should the case be) in the record. Didn't Invenio offer something on that?

@kaplun
Copy link
Contributor Author

kaplun commented Dec 6, 2016

@salmele nothing out of the box. It's up to us to decide what provenance to store for what and where, and to implement support for it.

Invenio supports basic history, information, such as this version of record X was touched by curator Y, but we don't know which field is actually involved.

OK. So all in all, we don't simplify data-model WRT INSPIRE subjects, but we keep them as list of objects with two attributes: source (enum with curator, magpie, arxiv) and value (enum among the current list of subjcets).

BTW how shall we call it? inspire_subjects or _categories or _topics?

@david-caro
Copy link
Contributor

@salmele Quick question, what should we do when:

  • Having multiple inspire categories with different sources but same term?
    (I'd say we keep them and let the curator remove the duplicates if need be).
  • Having categories (coming from legacy) that have no source?
    (I'd say put there something like 'undefined' to allow checking them later, should be a small amount if any)

This (and choosing the name) is currently blocking @rikirenz on #47.

@salmele
Copy link

salmele commented Dec 7, 2016

I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are.

I am neutral about what to do with INSPIRE subjects and I recommend @ksachs or @annetteholtkamp decide how they want to define the ontology of sources and their priority

@bing13
Copy link

bing13 commented Dec 7, 2016 via email

@kaplun
Copy link
Contributor Author

kaplun commented Dec 7, 2016

@david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.

@annetteholtkamp
Copy link

annetteholtkamp commented Dec 7, 2016 via email

@salmele
Copy link

salmele commented Dec 7, 2016

For the avoidance of doubt: using magpie or human classification on arXiv would create a new INSPIRE subject and NOT modify the arXiv category (and cross-listing) as chosen by the submitter (which we'd show somewhere else)... right?

@kaplun
Copy link
Contributor Author

kaplun commented Dec 7, 2016

@salmele right: arXiv categories are never going to be touched.

@david-caro
Copy link
Contributor

So I propose then having this:

  • A field arxiv_categories with the categories list from arxiv, that are just populated on ingestion.
  • A field inspire_categories with a list of term-source tuples that must be unique, in the sense that there's no repeated tuple, though it might be that we have the same term with different sources:
    • Where term is the actual name of the category it represents.
    • source is one of (this list might get extended in time):
      • cataloger (for manually set categories)
      • arxiv (for categories derived from the arxiv ones)
      • magpie (for automatically guessed categories
      • undefined (for migrated records that have no source on their categories, not sure if there are any though, but just in case).

@kaplun
Copy link
Contributor Author

kaplun commented Dec 7, 2016

Do we want to go for the undefined? In general so far, when something is undefined we simply haven't specified a value. In Invenio 1 this had some implications because it wasn't possible to search for those things having no-value. Not sure for Elasticsearch. @jacquerie ?

@salmele
Copy link

salmele commented Dec 7, 2016

A field arxiv_categories with the categories list from arxiv, that are just populated on ingestion

...preserving the concept of primary category in arXiv (current PRIMARC in SPIRES syntax and dedicated MARC field)

@kaplun
Copy link
Contributor Author

kaplun commented Dec 7, 2016

Yup. This is something that can be implemented at search time. The first arXiv category in the list of arXiv categories will be searchable also with primarc.

Ticketized in inspirehep/inspire-next#1791 so we don't forget)

@michamos
Copy link
Contributor

michamos commented Dec 7, 2016

TL;DR: adopt arXiv categories instead of INSPIRE categories

My opinion is rather different from those expressed before.
So let me summarize some of the different viewpoints as far as I understand them:

  • @salmele cares about keeping arXiv categories present and unmodified, as he needs it for SCOAP3 statistics
  • @ksachs cares about being able to override the INSPIRE category for cross-lists that are only weakly relevant
  • @kaplun cares about having a small number of categories so that magpie can guess them with high confidence

One point of view is largely missing, the user point of view, which is probably the most important. I was a simple user until not too long ago, and my point of view is that INSPIRE categories are terrible, and I would suggest getting rid of them in their current form, for the following reasons.

nobody knows about them

as @jacquerie showed, ~0.1% of our users ever tried to perform a search with the 'fc' keyword. Of those, I am sure a large fraction is doing it wrong, e.g. fc hep-th instead of fc t. To the contrary, arXiv categories have been user-visible for 25 years and everyone knows what hep-th means. This goes to the extent that even inside INSPIRE, we use arXiv categories for HEPNAMES and jobs instead of the INSPIRE ones. So we don't have any legacy to preserve here, and we can use the migration as an opportunity to rethink the whole concept. I would suggest sticking as closely as possible to the arXiv categories that the people are familiar with. (For anecdotal evidence, I had never heard of INSPIRE categories as a user, and am still struggling to remember the less common ones.)

they are confusing

there are three types of categories:
- small categories are exactly the same conceptually as an arXiv category (Theory-HEP), but with a different name, and additional papers not on the arXiv
- medium categories regroup a limited number of categories (Instrumentation)
- large categories encompass a huge number of arXiv categories (Math and Math Physics)

See the list on the bottom of the INSPIRE categories and the arXiv categories that map to it. They are inconsistent, as General Physics is NOT the same as physics.gen-ph (which is a euphemism for garbage on the arXiv), and Data Analysis and Statistics does not contain the stat.* statistics categories. And who knows what Other means?

they are largely useless

the purpose of categories is to organize the records in such a way that the user can easily filter them out and focus on what she is interested in. Either she is interested in a small category, in which case she is looking conceptually for a single arXiv category, but with another name, or she is not. In the latter case, the category is probably useless anyway. For example, if someone wants to know more about entanglement entropy (which is a hot subject right now in hep-th, but was born in quant-ph and is used quite a lot in cond-mat), he is not interested whether the paper is General Physics, but in the distinction between quant-ph and cond-mat.* ). More importantly, our coverage will probably be quite poor in those areas, but the user is not necessarily aware of this fact.

solution: adopt arXiv categories

So I would suggest overhauling INSPIRE categories to match arXiv categories.
For arXiv papers, we adopt the arXiv category as the INSPIRE category (keeping the distinction primary/secondary also on INSPIRE, and allowing the user to search/facet based on primary only or both primary and secondary). For records that do not come from arXiv, we put them in a specific category (e.g. physics.acc-ph) if we care about it, or we put them in a top-level category (e.g. astro-ph or math) if we don't want to zoom in.

If we keep the primary/secondary distinction, I don't think there would be any need for overriding categories (maybe @ksachs still sees a usecase), as we can trust the arXiv moderators which vastly outnumber the handful of people assigning categories in INSPIRE and are active researchers in their fields. But we could still have the provenance field just to be sure, and for the case in which we add a non-arXiv record that is later added to arXiv.
And assigning articles to top-level categories should not be too difficult, and the number of them still manageable for magpie (with the added benefit that we could train it on arXiv data).


{'Accelerators': ['physics.acc-ph'],
 'Astrophysics': ['physics.space-ph',
  'astro-ph.EP',
  'astro-ph.GA',
  'astro-ph.SR',
  'astro-ph.CO',
  'astro-ph',
  'astro-ph.HE'],
 'Computing': ['cs.ET',
  'cs.RO',
  'cs.CY',
  'cs.NE',
  'cs.DB',
  'cs.DC',
  'cs.SI',
  'cs.DL',
  'cs.CV',
  'cs.FL',
  'cs.SD',
  'cs.PF',
  'cs.LG',
  'cs.DS',
  'cs.OH',
  'cs.OS',
  'cs.LO',
  'cs.MM',
  'cs.AI',
  'physics.comp-ph',
  'cs.GT',
  'cs.IR',
  'cs.NA',
  'cs.SE',
  'cs.CL',
  'cs.CG',
  'cs.DM',
  'cs.SY',
  'cs.GL',
  'cs.IT',
  'cs.CR',
  'cs.MS',
  'cs.SC',
  'cs.CC',
  'cs.AR',
  'cs.GR',
  'cs.NI',
  'cs.MA',
  'cs.PL',
  'cs.CE',
  'cs.HC',
  'cs'],
 'Data Analysis and Statistics': ['physics.data-an'],
 'Experiment-HEP': ['hep-ex'],
 'Experiment-Nucl': ['nucl-ex'],
 'General Physics': ['quant-ph',
  'cond-mat.stat-mech',
  'cond-mat.mes-hall',
  'cond-mat.supr-con',
  'physics.plasm-ph',
  'cond-mat',
  'cond-mat.other',
  'physics.class-ph',
  'nlin',
  'cond-mat.quant-gas',
  'cond-mat.dis-nn',
  'nlin.CD',
  'cond-mat.soft',
  'nlin.CG',
  'cond-mat.str-el',
  'physics.ao-ph',
  'cond-mat.mtrl-sci',
  'physics.gen-ph',
  'nlin.AO',
  'physics.atm-clus',
  'physics.flu-dyn',
  'physics.atom-ph',
  'physics.optics',
  'physics',
  'physics.geo-ph'],
 'Gravitation and Cosmology': ['gr-qc'],
 'Instrumentation': ['astro-ph.IM', 'physics.ins-det'],
 'Lattice': ['hep-lat'],
 'Math and Math Physics': ['patt-sol',
  'math.GT',
  'math.CV',
  'math.MP',
  'math.GM',
  'math.PR',
  'math.GR',
  'math.DG',
  'math.NA',
  'math.AP',
  'math.CA',
  'math.LO',
  'math.NT',
  'math.AG',
  'math.KT',
  'q-alg',
  'math.ST',
  'math.CT',
  'math.QA',
  'alg-geom',
  'math',
  'math.DS',
  'math.FA',
  'math.CO',
  'math.SP',
  'math.MG',
  'math.GN',
  'math.AT',
  'nlin.PS',
  'math.OC',
  'math.SG',
  'math.HO',
  'math.RT',
  'math.IT',
  'math.RA',
  'math.OA',
  'math-ph',
  'dg-ga',
  'math.AC',
  'solv-int',
  'nlin.SI'],
 'Other': ['q-fin.TR',
  'q-bio.PE',
  'q-bio.CB',
  'q-bio.BM',
  'q-fin.GN',
  'q-fin.PR',
  'stat.AP',
  'physics.chem-ph',
  'physics.pop-ph',
  'q-bio.MN',
  'stat.CO',
  'stat.ML',
  'physics.hist-ph',
  'q-fin.CP',
  'stat.OT',
  'q-bio.TO',
  'q-fin.EC',
  'q-bio.GN',
  'q-fin.PM',
  'physics.med-ph',
  'stat.TH',
  'physics.bio-ph',
  'q-bio.SC',
  'physics.soc-ph',
  'physics.ed-ph',
  'q-bio.OT',
  'q-bio.QM',
  'q-fin.ST',
  'q-bio.NC',
  'q-fin.RM',
  'q-fin.MF',
  'stat.ME'],
 'Phenomenology-HEP': ['hep-ph'],
 'Theory-HEP': ['hep-th'],
 'Theory-Nucl': ['nucl-th']}

@david-caro
Copy link
Contributor

IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data, I'd try to get some from the users before doing any big effort either way (that might require some effort too, but having the machinery to easily get that kind of things will help on deciding on other features too).

@kaplun
Copy link
Contributor Author

kaplun commented Dec 7, 2016

@michamos for us developers, can you exactly define or point us to the definition of what is:

  • primary arXiv category
  • secondary arXiv category
  • top-level arXiv category

E.g. with some concrete example to reasoning on.

@michamos
Copy link
Contributor

michamos commented Dec 7, 2016

@michamos for us developers, can you exactly define or point us to the definition of what is:

primary arXiv category

the thing we put into 037, e.g. for https://inspirehep.net/record/1501963 it is hep-th

secondary arXiv category

for the same record, they are cond-mat.stat-mech and physics.flu-dyn

top-level arXiv category

a category in bold on https://arxiv.org/, which might have subcategories, so hep-th or physics.* (probably using math* instead of math.* though to take also math-ph).

E.g. with some concrete example to reasoning on.

@michamos
Copy link
Contributor

michamos commented Dec 7, 2016

IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data

the fact they are hidden and completely unkwown is an opportunity to rethink INSPIRE categories IMHO

@ksachs
Copy link

ksachs commented Dec 7, 2016

Hi Micha,

thanks for your detailed input from the user side.
Point taken - visibility is an issue.
A pitty that we don't have usage statistics from SPIRES - that's our legacy.
Both field-codes (= subject, category, ...) and keywords are neglected on INSPIRE.

However you are talking about something you barely know (as you say yourself).
The main use-case was that people would browse (we even had printed lists) through the daily inputs in their category. And from the arXiv listings I believe this is still the case.

First some statistics - records added the last 2 years:

  year   2016 / 2015 / 2015core
  all   76795 / 51412 / 28802
  arxiv 22558 / 25454 / 17228
  a     10595 / 10957 / 4982
  b      6667 / 6007 / 1494
  c       912 / 610 / 351
  e      5313 / 5311 / 5087
  g      5941 / 6463 / 4479
  i      5143 / 5086 / 3171
  l      1063 / 1205 / 1202
  m      2718 / 3160 / 1548
  n    19345* / 4451 / 2064
  o       422 / 503 / 169
  p      9304 / 9008 / 8750
  q      3894 / 4194 / 1692
  t      6819 / 7422 / 7304
  x    17809* / 3717 / 1568
  * reharvest of nucl journals

Btw: other = non-physics

We have one big category (with a lot of non-core stuff we don't want to waste time on): astro, two small category: lattice and computing, all the rest is around 5k/y. Which shows that the balance is not too bad.

You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records. For core records the big categories are p (9k), t (7k) and e (5k) which we might want to break down into smaller sub-categories; which SPIRES used to have, but that was before my time. Actually SPIRES had categories before arXiv existed.

The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators.

Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 10% I dumped right away, for about 20% I went through and made corrections, about 70% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~7k/y to assign manually.

@michamos
Copy link
Contributor

michamos commented Dec 7, 2016

@ksachs thanks for you comments. I agree that having categories is very useful, I just think those categories should be closer to the arXiv ones that people know. In this way we can avoid having our users learn and remember what from their point of view is a new classification scheme which is very similar yet subtly different in some areas.

The main use-case was that people would browse (we even had printed lists) through the
daily inputs in their category. And from the arXiv listings I believe this is still
the case.

People browse the new arXiv listings (e.g. hep-th/new), it would actually be very nice if one could do the same on INSPIRE, with all new papers in a given category. So by adopting arXiv categories on INSPIRE, one could get, say, arXiv hep-th + non-arXiv hep-th additions.

Btw: other = non-physics

I know that Other = None of the rest, but in order to know what it means precisely one has to know all the other categories that we have

You forget that the INSPIRE content is not the same as the arXiv content.
Only about half of our records come from arXiv. And we have only a very selective
part of arXiv. Where arXiv categories are split in subcategories we barely have records.

That's why it would be good to facet in a hierarchical way, so that we can have the full math category collapsed into math at first, but with possibility to expand it if one is interested in math.RT but not math.AP.

The assignment of categories has to be feasible which it is not if the
categories are too finegrained. It's mainly Florian and me who assign
categories to the non-arXiv stuff, not a bunch of moderators.

I was mentioning moderators for the arXiv content. For non-arXiv, we could do it ourselves, but based on the arXiv top-level categories if it's something we don't care about (so physics or math or cs or ...).

Lesson learned from the reharvest of nucl journals, where we used magpie to guess the
subject. 30% I dumped right away, for about 20% I went through and made corrections,
about 50% were useful. Which means even if we run magpie over everything that doesn't
come from arXiv we still have ~10k/y to assign manually.

What I have seen of magpie was very impressive. I would guess that magpie didn't have good training data in this case. By adopting arXiv categories, we could very easily train magpie on the arXiv corpus.

@michamos michamos assigned StellaCh and rikirenz and unassigned rikirenz Jan 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests