-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Privilege INSPIRE categories #41
Comments
cc: @david-caro |
So we could split it in |
I'm always confused about caps and lowercase, isn't it better to keep everything in lowercase? (As if they were variable names in python) |
Sure! |
@inspirehep/inspire-content @inspirehep/inspire-dir Given the following facts:
Given the following thoughts:
Edit: See below |
So we only use magpie if there are no external categories? |
If we don't have any INSPIRE category (they might come through a different mapping in hepcrawl, for example) |
I strongly disagree. Our user community understands what hep-th means, for instance, and ought to be able to search accordingly Possibly, we've a 'topic' search, which has all of INSPIRE present 'categories'. And arXiv-category search, which is meant to work only on arXiv content, and not second-guessed INSPIRE attribution |
I think @michamos can contribute his idea here. |
OK: New proposal:
Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories. |
I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway?
- Annette
On Dec 2, 2016, at 4:48 PM, Samuele Kaplun <[email protected]<mailto:[email protected]>> wrote:
OK: New proposal:
* we store arXiv categories (originated from arXiv) into arXiv categories
* we store INSPIRE categories (or topics or subjects) into INSPIRE categories:
* upon migration from legacy: copying them if they already exist
* upon migration from legacy: generating them from arXiv categories when available
* using magpie to guess them from the abstract (or title) in all the other cases.
* We propose to present to user 2 separate facets (initially folded, in order not to waste too much space):
* INSPIRE categories/topics/subjects (easy): human-friendly names
* arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl
…________________________________
Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub<#41 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg>.
|
I agree. But there might be some non-misleading way to expose such a
feature with clever UI, etc.
On Fri, Dec 2, 2016 at 10:52 PM, annetteholtkamp <[email protected]>
wrote:
… I agree that we should definitely store the arxiv categories. I'm not so
sure whether they should appear as facets. As Sam already pointed out only
a subset of records has an arXiv category. I assume it wouldn't be evident
to all users that focusing on hep-th e.g. would give only arXiv papers. And
why should we single out arXiv papers anyway?
- Annette
On Dec 2, 2016, at 4:48 PM, Samuele Kaplun ***@***.***<
***@***.***>>
wrote:
OK: New proposal:
* we store arXiv categories (originated from arXiv) into arXiv categories
* we store INSPIRE categories (or topics or subjects) into INSPIRE
categories:
* upon migration from legacy: copying them if they already exist
* upon migration from legacy: generating them from arXiv categories when
available
* using magpie to guess them from the abstract (or title) in all the other
cases.
* We propose to present to user 2 separate facets (initially folded, in
order not to waste too much space):
* INSPIRE categories/topics/subjects (easy): human-friendly names
* arXiv categories: possibly presented into a foldable tree structure to
represent the complex hiearchy of arXiv. Anyway only categories for which
we have papers will be displayed. (in principl
________________________________
Notes: in principle all papers/records of INSPIRE should belong to at
least one INSPIRE category. On the contrary only records harvested from
arXiv will belong to a given arXiv categories.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub<https://github.com/
inspirehep/inspire-schemas#41#issuecomment-264485570>, or mute the
thread<https://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_
fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg>.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGK0a2VZZTdcztEYR22mLVe2Z4pA9YPks5rERGrgaJpZM4K2QIg>
.
|
Sort of a hierarchical facet comes to mind. First arXiv, then it's categories.
BTW: Quite often one does not search for a term, but having the keyword on display gives additional clues about the content. hep-th is usually a different sort of paper than hep-ph or hep-ex. IOW it's not only computers looking at the data, thus IMHO bean counting alone doesn't suffice. |
I think this is a perfect example where more data/evidence is needed, as well as more testing - at least for the user facing part. I would recommend consulting Stella for this. |
Okok, for now, we all seem to agree on storing arXiv + inspire categories, just not yet sure how to expose that info to the users right? |
Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit. Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right? |
Who/what process has added the wrong INSPIRE category in the first place? We currently have a mapping that unambiguously maps an arXiv category to an INSPIRE one: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/config.py#L1445-L1604
In principle both searches are possible, but we have stats demonstrating that users are not able to search with them, via search syntax. On the other hand facets are very easy to manipulate. |
Not wrong in the sense of 'wrong mapping', but wrong in the sense of 'this is not about Experiment-HEP'. |
In the new model a record can have more than one INSPIRE subject, so crosslisting is preserved. |
IMHO users will not be damaged by having a record to belong to multiple categories. It's just a way for a paper to be reached. Once reached, the user will be able to make is own judgement. |
side effect: everything with a INSPIRE category *-HEP will (automatically) get CORE. |
@ksachs I think that is the idea, the INSPIRE categories are only guessed/automatically added if they are not there already, and only on ingestion. After that, they can be manually changed if needed. The point that @kaplun is trying to make, is that we can have multiple INSPIRE ones, just as we have multiple arXiv ones, with the same treatment to main category and multiple secondary ones. |
Yeah, plus I was really curios to know if there was maybe something we could correct, since you were mentioning that you were actually removing INSPIRE categories. Fully understood now. |
Correcting myself: the ratio of SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction AND
name LIKE '% fc %'
); and SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction
); is 0.07%. Still low, but mostly I wanted to write down how to compute this kind of number if someone else is interested. Also: this is the ratio of queries containing |
It's actually pretty easy. The ratio of: SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction AND
name LIKE '% fc %'
); and SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction
); is 0.1%. That is, of all distinct users that made at least a query on INSPIRE in the last six months, 0.1% of them used the |
@annetteholtkamp, rightly points out, of course, that it does make quite a difference to know if an INSPIRE category had been:
So finally in our data structure seems like we still need to capture the source for INSPIRE categories. |
If the inspire categories guess is done only on ingestion and only if not there already, I don't think that it adds much value to track the source of the categories, let me explain (it's very possible that I just don't get your point ;), I don't see @annetteholtkamp rationale here). The arxiv keywords come from outside, and will never change, so letting them have their own section already expresses the source (passive info). The inspire ones, we just care if they are right or not, if they are not right, we just change them, no matter if they come from magpie, if they came from arxiv or from a cataloger no? I only see a couple of use cases where we would want that info, to train better magpie by comparing the result it gave with the one from the cataloger, or something like getting a list of all the records with automated categories... though I don't see that being very useful, as we can just train magpie with the whole set of data just expecting it to be correct (as it passed validation from a cataloger). |
How do you envisage a workflow where a cataloger check a record only 2 weeks after it arrives on INSPIRE. At that point we already have ingested it, but the INSPIRE category has been already guessed by arXiv/magpie. How the cataloger can now if it can trust the INSPIRE category? Beside this small point I agree with you that in general we don't care between arXiv Vs. magpie. |
Well, it does not matter who put it no? Just if it is correct or not, and for that you don't care of the origin right? |
It's about the trust level. Up to now the only sources for subject categories where catalogers (Annette, Florian and me) or arXiv (obvious to spot). If we have more sources (magpie, journal category, ...) it would be nice to know whether it is worth a second look. |
Excellent point, @ksachs. This calls for the notion of provenance (for later audit should the case be) in the record. Didn't Invenio offer something on that? |
@salmele nothing out of the box. It's up to us to decide what provenance to store for what and where, and to implement support for it. Invenio supports basic history, information, such as this version of record X was touched by curator Y, but we don't know which field is actually involved. OK. So all in all, we don't simplify data-model WRT INSPIRE subjects, but we keep them as list of objects with two attributes: BTW how shall we call it? |
@salmele Quick question, what should we do when:
This (and choosing the name) is currently blocking @rikirenz on #47. |
I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are. I am neutral about what to do with INSPIRE subjects and I recommend @ksachs or @annetteholtkamp decide how they want to define the ontology of sources and their priority |
i concur.
…On Tue, Dec 6, 2016 at 4:07 PM, Salvatore Mele ***@***.***> wrote:
I am concerned to preserve arXiv categories for arXiv-ingested preprints
for user to (faceted) search for them, as they of course are aware of what
arXiv categories are.
I am neutral about what to do with INSPIRE *subjects* and I recommend
@ksachs <https://github.com/ksachs> or @annetteholtkamp
<https://github.com/annetteholtkamp> decide how they want to define the
ontology of sources and their priority
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGK0XVH1wuD0CLOcYS549sI3e2cTMhcks5rFfilgaJpZM4K2QIg>
.
|
@david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one. |
Actually, we could run magpie also on arXiv papers and keep magpie’s suggestion in case of difference to arXiv. This could be helpful for the content people as a hint to carefully check the subject.
- Annette
… On 7 Dec 2016, at 08:58, Samuele Kaplun ***@***.***> wrote:
@david-caro <https://github.com/david-caro> IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AM1-OxcZWXYAuPj0fSbTqwXTzgDIWuTxks5rFmcWgaJpZM4K2QIg>.
|
For the avoidance of doubt: using magpie or human classification on arXiv would create a new INSPIRE subject and NOT modify the arXiv category (and cross-listing) as chosen by the submitter (which we'd show somewhere else)... right? |
@salmele right: arXiv categories are never going to be touched. |
So I propose then having this:
|
Do we want to go for the |
...preserving the concept of primary category in arXiv (current PRIMARC in SPIRES syntax and dedicated MARC field) |
Yup. This is something that can be implemented at search time. The first arXiv category in the list of arXiv categories will be searchable also with primarc. Ticketized in inspirehep/inspire-next#1791 so we don't forget) |
TL;DR: adopt arXiv categories instead of INSPIRE categories My opinion is rather different from those expressed before.
One point of view is largely missing, the user point of view, which is probably the most important. I was a simple user until not too long ago, and my point of view is that INSPIRE categories are terrible, and I would suggest getting rid of them in their current form, for the following reasons. nobody knows about themas @jacquerie showed, ~0.1% of our users ever tried to perform a search with the 'fc' keyword. Of those, I am sure a large fraction is doing it wrong, e.g. they are confusingthere are three types of categories: See the list on the bottom of the INSPIRE categories and the arXiv categories that map to it. They are inconsistent, as General Physics is NOT the same as physics.gen-ph (which is a euphemism for garbage on the arXiv), and Data Analysis and Statistics does not contain the stat.* statistics categories. And who knows what Other means? they are largely uselessthe purpose of categories is to organize the records in such a way that the user can easily filter them out and focus on what she is interested in. Either she is interested in a small category, in which case she is looking conceptually for a single arXiv category, but with another name, or she is not. In the latter case, the category is probably useless anyway. For example, if someone wants to know more about entanglement entropy (which is a hot subject right now in hep-th, but was born in quant-ph and is used quite a lot in cond-mat), he is not interested whether the paper is General Physics, but in the distinction between quant-ph and cond-mat.* ). More importantly, our coverage will probably be quite poor in those areas, but the user is not necessarily aware of this fact. solution: adopt arXiv categoriesSo I would suggest overhauling INSPIRE categories to match arXiv categories. If we keep the primary/secondary distinction, I don't think there would be any need for overriding categories (maybe @ksachs still sees a usecase), as we can trust the arXiv moderators which vastly outnumber the handful of people assigning categories in INSPIRE and are active researchers in their fields. But we could still have the provenance field just to be sure, and for the case in which we add a non-arXiv record that is later added to arXiv. {'Accelerators': ['physics.acc-ph'],
'Astrophysics': ['physics.space-ph',
'astro-ph.EP',
'astro-ph.GA',
'astro-ph.SR',
'astro-ph.CO',
'astro-ph',
'astro-ph.HE'],
'Computing': ['cs.ET',
'cs.RO',
'cs.CY',
'cs.NE',
'cs.DB',
'cs.DC',
'cs.SI',
'cs.DL',
'cs.CV',
'cs.FL',
'cs.SD',
'cs.PF',
'cs.LG',
'cs.DS',
'cs.OH',
'cs.OS',
'cs.LO',
'cs.MM',
'cs.AI',
'physics.comp-ph',
'cs.GT',
'cs.IR',
'cs.NA',
'cs.SE',
'cs.CL',
'cs.CG',
'cs.DM',
'cs.SY',
'cs.GL',
'cs.IT',
'cs.CR',
'cs.MS',
'cs.SC',
'cs.CC',
'cs.AR',
'cs.GR',
'cs.NI',
'cs.MA',
'cs.PL',
'cs.CE',
'cs.HC',
'cs'],
'Data Analysis and Statistics': ['physics.data-an'],
'Experiment-HEP': ['hep-ex'],
'Experiment-Nucl': ['nucl-ex'],
'General Physics': ['quant-ph',
'cond-mat.stat-mech',
'cond-mat.mes-hall',
'cond-mat.supr-con',
'physics.plasm-ph',
'cond-mat',
'cond-mat.other',
'physics.class-ph',
'nlin',
'cond-mat.quant-gas',
'cond-mat.dis-nn',
'nlin.CD',
'cond-mat.soft',
'nlin.CG',
'cond-mat.str-el',
'physics.ao-ph',
'cond-mat.mtrl-sci',
'physics.gen-ph',
'nlin.AO',
'physics.atm-clus',
'physics.flu-dyn',
'physics.atom-ph',
'physics.optics',
'physics',
'physics.geo-ph'],
'Gravitation and Cosmology': ['gr-qc'],
'Instrumentation': ['astro-ph.IM', 'physics.ins-det'],
'Lattice': ['hep-lat'],
'Math and Math Physics': ['patt-sol',
'math.GT',
'math.CV',
'math.MP',
'math.GM',
'math.PR',
'math.GR',
'math.DG',
'math.NA',
'math.AP',
'math.CA',
'math.LO',
'math.NT',
'math.AG',
'math.KT',
'q-alg',
'math.ST',
'math.CT',
'math.QA',
'alg-geom',
'math',
'math.DS',
'math.FA',
'math.CO',
'math.SP',
'math.MG',
'math.GN',
'math.AT',
'nlin.PS',
'math.OC',
'math.SG',
'math.HO',
'math.RT',
'math.IT',
'math.RA',
'math.OA',
'math-ph',
'dg-ga',
'math.AC',
'solv-int',
'nlin.SI'],
'Other': ['q-fin.TR',
'q-bio.PE',
'q-bio.CB',
'q-bio.BM',
'q-fin.GN',
'q-fin.PR',
'stat.AP',
'physics.chem-ph',
'physics.pop-ph',
'q-bio.MN',
'stat.CO',
'stat.ML',
'physics.hist-ph',
'q-fin.CP',
'stat.OT',
'q-bio.TO',
'q-fin.EC',
'q-bio.GN',
'q-fin.PM',
'physics.med-ph',
'stat.TH',
'physics.bio-ph',
'q-bio.SC',
'physics.soc-ph',
'physics.ed-ph',
'q-bio.OT',
'q-bio.QM',
'q-fin.ST',
'q-bio.NC',
'q-fin.RM',
'q-fin.MF',
'stat.ME'],
'Phenomenology-HEP': ['hep-ph'],
'Theory-HEP': ['hep-th'],
'Theory-Nucl': ['nucl-th']} |
IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data, I'd try to get some from the users before doing any big effort either way (that might require some effort too, but having the machinery to easily get that kind of things will help on deciding on other features too). |
@michamos for us developers, can you exactly define or point us to the definition of what is:
E.g. with some concrete example to reasoning on. |
the thing we put into 037, e.g. for https://inspirehep.net/record/1501963 it is
for the same record, they are
a category in bold on https://arxiv.org/, which might have subcategories, so hep-th or physics.* (probably using
|
the fact they are hidden and completely unkwown is an opportunity to rethink INSPIRE categories IMHO |
Hi Micha, thanks for your detailed input from the user side. However you are talking about something you barely know (as you say yourself). First some statistics - records added the last 2 years:
Btw: other = non-physics We have one big category (with a lot of non-core stuff we don't want to waste time on): astro, two small category: lattice and computing, all the rest is around 5k/y. Which shows that the balance is not too bad. You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records. For core records the big categories are p (9k), t (7k) and e (5k) which we might want to break down into smaller sub-categories; which SPIRES used to have, but that was before my time. Actually SPIRES had categories before arXiv existed. The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators. Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 10% I dumped right away, for about 20% I went through and made corrections, about 70% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~7k/y to assign manually. |
@ksachs thanks for you comments. I agree that having categories is very useful, I just think those categories should be closer to the arXiv ones that people know. In this way we can avoid having our users learn and remember what from their point of view is a new classification scheme which is very similar yet subtly different in some areas.
People browse the new arXiv listings (e.g. hep-th/new), it would actually be very nice if one could do the same on INSPIRE, with all new papers in a given category. So by adopting arXiv categories on INSPIRE, one could get, say, arXiv hep-th + non-arXiv hep-th additions.
I know that Other = None of the rest, but in order to know what it means precisely one has to know all the other categories that we have
That's why it would be good to facet in a hierarchical way, so that we can have the full math category collapsed into math at first, but with possibility to expand it if one is interested in math.RT but not math.AP.
I was mentioning moderators for the arXiv content. For non-arXiv, we could do it ourselves, but based on the arXiv top-level categories if it's something we don't care about (so physics or math or cs or ...).
What I have seen of magpie was very impressive. I would guess that magpie didn't have good training data in this case. By adopting arXiv categories, we could very easily train magpie on the arXiv corpus. |
Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.
However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.
It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).
The text was updated successfully, but these errors were encountered: