Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to citations/bibliography rather than links #24

Open
retorquere opened this issue Feb 6, 2019 · 10 comments
Open

Convert to citations/bibliography rather than links #24

retorquere opened this issue Feb 6, 2019 · 10 comments

Comments

@retorquere
Copy link
Contributor

I'm interested in adding the possibility to have the ODF-scanner use the Zotero-embedded citeproc to create a finalized document; not to remove the existing functionality to create a Zotero-compatible document but so that I can use Word-online + ODF scan without requiring the use of Word to finalize the document. Would this be:

  • feasible?
  • desirable?

and if so, can you point me to the part of the code that does the current replacement?

@adam3smith
Copy link
Collaborator

It'd definitely be desirable, but it wouldn't be easy. Currently the thing that makes the tool uncomplicated is that it doesn't need to talk to Zotero at all during the scan. All the citation data is added when setting a citation style in LibreOffice. The scan just converts the markers to LO Reference Marks with Zotero format and Zotero item.uris that allow for updating.

The relevant function is here: https://github.com/Juris-M/zotero-odf-scan-plugin/blob/master/chrome/content/rtfScan.js#L271

Hope you like regular expressions ;)

@retorquere
Copy link
Contributor Author

The talking to Zotero bit isn't really too hard. Is https://github.com/Juris-M/zotero-odf-scan-plugin/blob/master/chrome/content/rtfScan.js#L512 the central function that orchestrates the finding and replacing, and https://github.com/Juris-M/zotero-odf-scan-plugin/blob/master/chrome/content/rtfScan.js#L594 the part that does the actual replacements?

If I may ask, why use regexes when FF has XML/XPath functionality built in?

@adam3smith
Copy link
Collaborator

adam3smith commented Feb 14, 2019

I don't think there's a strong reason to use regex over XML except that Frank likes regex (the original tool this is based on was in python I think, but it's not like that would have made using XML/XPATH impossible). Might be that it actually ends up being more stable given different interpretation of the ODF XML model, but also possible that the reverse is true. Certainly worth testing out.
That looks right wrt the functions, yes.

@fbennett
Copy link
Contributor

fbennett commented Feb 14, 2019

If I may ask, why use regexes when FF has XML/XPath functionality built in?

You're not the first to ask that question. 😃 The code was originally rejected for inclusion in Zotero for exactly that reason. (Edit: Dan's third response in this thread on zotero-dev)

The problem is that the target string may be cross-nested with XML tags that capture a larger run of document text. Identifying the string and isolating it for replacement using XML methods would be very hard to do. It would also be slower to run (because you would need to iterate to the top of the XML hierarchy to determine that a given match attempt had failed). I offered that explanation at the time, and it didn't find favor, but that's the reason behind using regex there.

@retorquere
Copy link
Contributor Author

Cross-nested? I thought XML was strictly hierarchical?

@retorquere
Copy link
Contributor Author

(that link appears to want to search your mailbox -- I don't think I have access to that 😄)

@fbennett
Copy link
Contributor

Cross-nested? I thought XML was strictly hierarchical?

XML is, but the "scannable cites" are not an XML unit, so you get things like this:

<tag>blah<tag>. { See <tag>e.g. | Smith,</tag></tag> 2008 | | |zu:6204:P4KXGRZI}</tag>

Maybe there is an easy way to find the strings and adjust the tag structure to permit insertion of a well structured XML element at their location in DOM context, but it looked pretty daunting to me, and I gave up.

Didn't notice that the Google Groups links worked that way! Here's the relevant bit (from April 16, 2013):

Frank

If it's firm that regular expressions can't be used, this is probably
off the table for mainstream. That approach started as a hack, with
the intention of eventually refactoring the code to use an XML parser.
But as I played with documents, I found that the string is often
chopped up by tag nesting in the internal XML markup. You could
probably identify them, but the code would probably be harder to
follow than the regexp, and might require quite a few debugging
iterations. It's probably not worth attempting.

Dan

Well, it's just the use of regular expressions to actually parse the XML
that we object to. Can you not use XPaths to find the relevant nodes,
and then do regexps on the textContent?

@retorquere
Copy link
Contributor Author

<tag>blah<tag>. { See <tag>e.g. | Smith,</tag></tag> 2008 | | |zu:6204:P4KXGRZI}</tag>

Lord Cthulhu almighty, there's kids in the room, you can't just show things like this out in the open... alright, I see your point. The solution would be ugly in any case given this, and the regexen are arguably less ugly than the XML parsing would have been.

Wow.

@fbennett
Copy link
Contributor

It does look like a plain string in the word processor, though, so by adding LibreOffice as a dependency ...

@paultroop
Copy link

I'm afraid I do not follow all the technical discussion here, but is this issue linked to the possibility of using the ODF scan as something like a bibtex type referencing system? I'm looking at the idea of using Latex for writing, but all my research references are in the ODF scan form. I was wondering if there is an easy way of converting them into something that would be recognised in Latex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants