Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update scraping code #6

Merged
merged 6 commits into from
Feb 4, 2024
Merged

fix: update scraping code #6

merged 6 commits into from
Feb 4, 2024

Conversation

EverVino
Copy link

@EverVino EverVino commented Feb 3, 2024

Pull Request description

  • Fix scraping for title, abstract, PMID, doi
  • Update getContent functions from helper module

PMID, doi field were extracting more data than expected fix this part with getContentUnique function,
abstract and title field was not extracting the full content fix this part with getAllContent function

For some test that I ran locally I notice that some articles or items does not have abstract maybe due its old date.

How to test these changes

Run a simple query

Pull Request checklists

This PR is a:

  • bug-fix
  • new feature
  • maintenance

About this PR:

  • it includes tests.
  • the tests are executed on CI.
  • the tests generate log file(s) (path).
  • pre-commit hooks were executed locally.
  • this PR requires a project documentation update.

Author's checklist:

  • I have reviewed the changes and it contains no misspelling.
  • The code is well commented, especially in the parts that contain more
    complexity.
  • New and old tests passed locally.

Reviewer's Checklist

  • I managed to reproduce the problem locally from the main branch
  • I managed to test the new changes locally
  • I confirm that the issues mentioned were fixed/resolved .

@EverVino
Copy link
Author

EverVino commented Feb 3, 2024

The failing test will need more investigation
some reference
https://pypi.org/project/types-lxml/

@EverVino
Copy link
Author

EverVino commented Feb 3, 2024

@xmnlab the test is not passing. Maybe its about the version. What do you think?

@xmnlab
Copy link
Member

xmnlab commented Feb 3, 2024

hey @EverVino I am taking a look into that now

@xmnlab
Copy link
Member

xmnlab commented Feb 3, 2024

not really sure about this error, but this is the context:

you changed from

from xml.etree.ElementTree import Element

to

from lxml.etree import Element

the error happens here:

https://github.com/osl-incubator/pymedx/blob/main/src/pymedx/article.py#L165

the error states: FAILED tests/test_article.py::TestArticle::test_toJSON - TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

so my guess is that the new Element is not a type?

@xmnlab
Copy link
Member

xmnlab commented Feb 3, 2024

let me know if you want to debug directly inside the CI and I can enable that for you

@EverVino
Copy link
Author

EverVino commented Feb 4, 2024

@xmnlab ready for a review

@EverVino
Copy link
Author

EverVino commented Feb 4, 2024

Basically the problem was that Element is a generator of classes _Element
for future references
https://stackoverflow.com/questions/72226485/mypy-function-lxml-etree-elementtree-is-not-valid-as-a-type-but-why

@xmnlab xmnlab merged commit 220f2bf into main Feb 4, 2024
8 checks passed
@xmnlab xmnlab deleted the update-scraping branch February 4, 2024 02:47
Copy link

github-actions bot commented Feb 4, 2024

🎉 This PR is included in version 0.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants