Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing content (only titles) in Escher Tageblatt #137

Open
simon-clematide opened this issue Jan 1, 2025 · 1 comment
Open

Missing content (only titles) in Escher Tageblatt #137

simon-clematide opened this issue Jan 1, 2025 · 1 comment
Assignees
Labels
bug Something isn't working data issues that are related to the data requires re-ingestion

Comments

@simon-clematide
Copy link

simon-clematide commented Jan 1, 2025

While checking on samples for the status of title in rebuilt:
These content items only have a title ("t" property) in rebuilt, but no full text. Seems to affect a lot more articles than the ones I sampled and looked at from these issues:

https://impresso-project.ch/app/article/tageblatt-1936-12-24-a-i0030
https://impresso-project.ch/app/article/tageblatt-1936-04-03-a-i0050
https://impresso-project.ch/app/article/tageblatt-1936-10-21-a-i0031
https://impresso-project.ch/app/article/tageblatt-1936-02-01-a-i0033

@simon-clematide simon-clematide added bug Something isn't working data issues that are related to the data requires re-ingestion labels Jan 1, 2025
@e-maud
Copy link
Member

e-maud commented Jan 14, 2025

In the newly ingested data, there is also an important number of CI with content_length = 0 : 2,098,902.
With content_length = 1: 485,785 ( can be a glued word.)

These may correspond to incorrect page region segments, originating from bad OLR and/or errors in the conversion process.

We need to investigate further, also with comparing to the 'old' ingested data. I noticed that sometimes the title is equal to the full text (cannot count this in Solr, though.

The break down is available via this solr query and below.

 "meta_journal_s": [
        "GDL",
        475511,
        "tageblatt",
        448088,
        "LLE",
        357561,
        "EXP",
        145095,
        "IMP",
        132258,
        "lepetitparisien",
        68849,
        "FZG",
        55572,
        "JDG",
        53702,
        "luxwort",
        53068,
        "lematin",
        50904,
        "jdpl",
        43708,
        "DTT",
        36328,
        "legaulois",
        24572,
        "NV2",
        23848,
        "esta",
        22611,
        "oerennes",
        17876,
        "lunion",
        14317,
        "LCE",
        12536,
        "SMZ",
        11324,
        "oecaen",
        7813,
        "handelsztg",
        5922,
        "LLS",
        3303,
        "JV",
        2995,
        "volkfreu1869",
        2976,
        "SDT",
        2972,
        "obermosel",
        2295,
        "dunioun",
        1880,
        "luxland",
        1840,
        "JVE",
        1637,
        "SGZ",
        1636,
        "lepji",
        1300,
        "NZG",
        1173,
        "NTS",
        984,
        "avenirgdl",
        962,
        "EZR",
        811,
        "OIZ",
        705,
        "JH",
        704,
        "GAV",
        666,
        "NV1",
        611,
        "LES",
        599,
        "CL",
        595,
        "courriergdl",
        535,
        "TouSuIl",
        463,
        "RLA",
        422,
        "VHT",
        417,
        "indeplux",
        409,
        "MB",
        331,
        "DLE",
        264,
        "WHD",
        258,
        "FedGazFr",
        241,
        "arbeitgeber",
        229,
        "LCG",
        226,
        "FedGazDe",
        210,
        "NS",
        189,
        "luxembourg1935",
        184,
        "lafronde",
        183,
        "LBP",
        176,
        "ME",
        166,
        "VVS",
        164,
        "EM",
        143,
        "DFS",
        131,
        "AV",
        100,
        "CDV",
        93,
        "BNN",
        82,
        "LCR",
        82,
        "NZZ",
        81,
        "waechtersauer",
        72,
        "FAN",
        71,
        "PAT",
        69,
        "EDA",
        58,
        "armeteufel",
        58,
        "LSE",
        44,
        "HRV",
        41,
        "VVS1",
        41,
        "Cancoire",
        36,
        "GAVi",
        33,
        "excelsior",
        32,
        "MESSAGER",
        28,
        "Moniteur",
        28,
        "JY2",
        25,
        "buergerbeamten",
        24,
        "CharivariCH",
        23,
        "LVE",
        22,
        "luxzeit1858",
        22,
        "PDL",
        20,
        "Guepe1851",
        17,
        "onsjongen",
        16,
        "Fronde",
        15,
        "LTF",
        15,
        "GAZ",
        14,
        "Griffe",
        14,
        "oeuvre",
        14,
        "CON",
        12,
        "Croquis",
        12,
        "JDV",
        12,
        "SRT",
        12,
        "ZBT",
        12,
        "demitock",
        11,
        "FCT",
        9,
        "schmiede",
        8
      ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data issues that are related to the data requires re-ingestion
Projects
None yet
Development

No branches or pull requests

3 participants