Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Affiliation info: cites, pubs and fields #27

Open
f-hafner opened this issue Dec 12, 2022 · 6 comments · Fixed by #29
Open

Affiliation info: cites, pubs and fields #27

f-hafner opened this issue Dec 12, 2022 · 6 comments · Fixed by #29
Assignees

Comments

@f-hafner
Copy link
Owner

f-hafner commented Dec 12, 2022

I think we can make the tables for paper/citation counts at affiliation-year-field0 as well as the keyword list at the same level in one go:

  1. Create temporary table with unique PaperId, Year, AffiliationId
  2. summarise paper outcomes at AffiliationId-Field0-Year level
  3. summarise paper keywords at AffiliationId-Field0-Year level

To consider/note

  • We will replace the table affiliation_outcomes, which is only at the AffiliationId level, with the output from step 2 above. I don't know right now where we use this old table as an input, and we should check
  • There are two ways of assigning papers to "departments" at institutions: from the author's main field, or from the paper's main field. The latter is probably more accurate
  • The output from step 3 contains additionally the columns FieldOfStudyId and a Score. We can use the next lower integer of the score as a frequency weight to calculate tf-idf (but not sure how to exactly implement frequency weights)
  • Not sure how to add the count of researcher in this query

Here are the queries, which we can use to replace the query in affiliation_outcomes.py.

-- ## 1. create temp table 
CREATE TEMP TABLE paper_affiliation_year AS 
SELECT DISTINCT AffiliationId, Year, PaperId
FROM (
    SELECT a.AuthorId, a.AffiliationId, a.Year, b.Paperid
    FROM AuthorAffiliation a -- ## if an author has 2 main affiliations in the same year, we count their papers at both institutions
    INNER JOIN (
        SELECT PaperId, AuthorId, Year
        FROM PaperAuthorUnique
        INNER JOIN (
            SELECT PaperId, Year
            FROM Papers
        ) USING(PaperId)
    ) b
    ON a.AuthorId=b.AuthorId AND a.Year=b.Year
    -- reduces size of the data set 
    INNER JOIN (
        SELECT PaperId
        FROM paper_outcomes
    ) USING(PaperId)
)

CREATE INDEX ON idx_paper_temp ON paper_affiliation_year (PaperId)

-- ## 2. create table with citation/paper counts
CREATE TABLE affiliation_outcomes AS  -- this is already defined, where? can we replace it? where do we use it?? 
SELECT COUNT(PaperId) AS PaperCount
    , SUM(CitationCount_y10) AS CitationCount_y10
    , AffiliationId
    , Year 
    , Field0
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, CitationCount_y10 
    FROM paper_outcomes 
) USING(PaperId)
INNER JOIN ( -- each field is unique per paper, so it is ok to join only here 
    SELECT PaperId, Field0 
    FROM PaperMainFieldsOfStudy
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_affo_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)


-- ## 3. table with keywords
CREATE TABLE affiliation_fields AS 
SELECT SUM(Score) AS Score
    , FieldOfStudyId
    , AffiliationId 
    , Field0
    , Year 
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, FieldOfStudyId, Score
    FROM PaperFieldsOfStudy 
    INNER JOIN (
        SELECT FieldOfStudyId 
        FROM FieldsOfStudy 
        WHERE level < 2 -- choose appropriate level 
    ) USING(FieldOfStudyId)
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_afff_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)
This was referenced Dec 13, 2022
@chrished chrished reopened this Jan 26, 2023
@chrished
Copy link
Collaborator

Add author count, simply count distinct authors with that affiliation-field

@f-hafner
Copy link
Owner Author

f-hafner commented Jan 26, 2023

I suppose you want the size of "departments"? Since we have discarded the author info in paper_affiliation_year, we can summarise directly from the publications. But then, multidisciplinary people are counted separately in each discipline. So the interpretation is not really "department size", but rather "number of potential collaborators".

@f-hafner
Copy link
Owner Author

or, we summarise from AuthorAffiliation, where we have already assigned the main institution for each author-year. if we do this, then the field information will also be from the author-level, which differs from the table on publication outcomes by university.
of course, we can also do both. and if you want a proxy for "peer quality", we can also even join the forward citations of the authors themselves and average at the university-field-year level.

another issue is that authors may not publish every year. so to a get a good proxy of the "department size" in a given year may be best an average with a rolling window?

@chrished
Copy link
Collaborator

I would say we start with what you call: "number of potential collaborators". For this it is not necessarily an issue if they do not publish every year, we can take the average over the last 5 years ex-post, like we do for citations in the analysis.

This can be done directly here without any big issue, right?

@chrished
Copy link
Collaborator

The other measure is good too, and we will almost surely want that, in particular in combination with citation measures... So if you can do it "quickly", please add both

@f-hafner
Copy link
Owner Author

I would say we start with what you call: "number of potential collaborators". For this it is not necessarily an issue if they do not publish every year, we can take the average over the last 5 years ex-post, like we do for citations in the analysis.

This can be done directly here without any big issue, right?

Should be ready tomorrow. I want to add that an author can be counted in multiple cells because they publish in multiple fields at level 0 (more likely) and/or because they publish at multiple affiliations. The latter is less likely, but AuthorAffiliation already uses the affiliation where an author publishes the most papers in a given year, and I know that this is not unique for all authors.

The other measure is good too, and we will almost surely want that, in particular in combination with citation measures... So if you can do it "quickly", please add both

Since we already use AuthorAffiliation, the average citation per publishing author in a field-affiliation cell should be citation count / author count, so I suggest this for now if needed.
Regarding department size, to build a better proxy we need information on main field and main institution of the author. As of now, we can do this only with the main field of the author over their career--we do not have their main field0/field1 for each year. Let me know if it is urgent and important: #35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants