fix: update scraping code #6

EverVino · 2024-02-03T16:29:28Z

Pull Request description

Fix scraping for title, abstract, PMID, doi
Update getContent functions from helper module

PMID, doi field were extracting more data than expected fix this part with getContentUnique function,
abstract and title field was not extracting the full content fix this part with getAllContent function

For some test that I ran locally I notice that some articles or items does not have abstract maybe due its old date.

How to test these changes

Run a simple query

Pull Request checklists

This PR is a:

bug-fix
new feature
maintenance

About this PR:

it includes tests.
the tests are executed on CI.
the tests generate log file(s) (path).
pre-commit hooks were executed locally.
this PR requires a project documentation update.

Author's checklist:

I have reviewed the changes and it contains no misspelling.
The code is well commented, especially in the parts that contain more
complexity.
New and old tests passed locally.

Reviewer's Checklist

I managed to reproduce the problem locally from the main branch
I managed to test the new changes locally
I confirm that the issues mentioned were fixed/resolved .

EverVino · 2024-02-03T16:44:00Z

The failing test will need more investigation
some reference
https://pypi.org/project/types-lxml/

EverVino · 2024-02-03T18:15:42Z

@xmnlab the test is not passing. Maybe its about the version. What do you think?

xmnlab · 2024-02-03T19:39:26Z

hey @EverVino I am taking a look into that now

xmnlab · 2024-02-03T19:47:55Z

not really sure about this error, but this is the context:

you changed from

from xml.etree.ElementTree import Element

to

from lxml.etree import Element

the error happens here:

https://github.com/osl-incubator/pymedx/blob/main/src/pymedx/article.py#L165

the error states: FAILED tests/test_article.py::TestArticle::test_toJSON - TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

so my guess is that the new Element is not a type?

xmnlab · 2024-02-03T19:50:09Z

let me know if you want to debug directly inside the CI and I can enable that for you

EverVino · 2024-02-04T01:32:28Z

@xmnlab ready for a review

EverVino · 2024-02-04T01:34:15Z

Basically the problem was that Element is a generator of classes _Element
for future references
https://stackoverflow.com/questions/72226485/mypy-function-lxml-etree-elementtree-is-not-valid-as-a-type-but-why

github-actions · 2024-02-04T02:49:56Z

🎉 This PR is included in version 0.2.1 🎉

The release is available on:

0.2.1
GitHub release

Your semantic-release bot 📦🚀

fix: update scraping code

b428445

EverVino added 2 commits February 3, 2024 12:54

update dependencies

7815ec6

add lxml-stubs as dev dependency

f674349

EverVino added 3 commits February 3, 2024 21:08

add type-lxml as dev dependency

189d235

update type Element

4a53339

fix dependencies

279796b

xmnlab merged commit 220f2bf into main Feb 4, 2024
8 checks passed

xmnlab deleted the update-scraping branch February 4, 2024 02:47

github-actions bot added the released label Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update scraping code #6

fix: update scraping code #6

EverVino commented Feb 3, 2024

EverVino commented Feb 3, 2024

EverVino commented Feb 3, 2024

xmnlab commented Feb 3, 2024

xmnlab commented Feb 3, 2024

xmnlab commented Feb 3, 2024

EverVino commented Feb 4, 2024

EverVino commented Feb 4, 2024

github-actions bot commented Feb 4, 2024

fix: update scraping code #6

fix: update scraping code #6

Conversation

EverVino commented Feb 3, 2024

Pull Request description

How to test these changes

Pull Request checklists

Reviewer's Checklist

EverVino commented Feb 3, 2024

EverVino commented Feb 3, 2024

xmnlab commented Feb 3, 2024

xmnlab commented Feb 3, 2024

xmnlab commented Feb 3, 2024

EverVino commented Feb 4, 2024

EverVino commented Feb 4, 2024

github-actions bot commented Feb 4, 2024