Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow packages without repository #36

Merged
merged 21 commits into from
Nov 25, 2021
Merged

Allow packages without repository #36

merged 21 commits into from
Nov 25, 2021

Conversation

zarianec
Copy link
Collaborator

This is the follow-up of #30 and #31. This PR allows packages without a repository to be added to the packages.json with a null value.

resolve #30

@zarianec zarianec requested a review from zeke September 13, 2021 12:52
scripts/update.js Outdated Show resolved Hide resolved
scripts/update.js Outdated Show resolved Hide resolved
scripts/update.js Outdated Show resolved Hide resolved
Copy link
Member

@zeke zeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor refactoring suggestion, but this looks good to me. I will defer to @eridal for a final approval of this.

test.js Outdated Show resolved Hide resolved
@zarianec
Copy link
Collaborator Author

zarianec commented Sep 21, 2021

Updated cache to store packages without repository as well.

@zeke
Copy link
Member

zeke commented Sep 22, 2021

This looks good to me. What do you think, @eridal?

@eridal
Copy link
Contributor

eridal commented Sep 22, 2021

I have a few concerns with this proposal:

The metadata is not handled correctly, and over time the collected stats will deviate from the packages.

Currently the sum of repos in the metadata is exactly the number of packages:

const packages = require('./data/packages.json')
const metadata = require('./data/metadata.json')

Object.keys(packages).length === Object.values(metadata.repos).reduce((a, b) => a + b, 0)

There are a few changes required to get those numbers correctly:

  1. in apply, when a change gets processed we need to update correctly the counter (+1/-1) depending if it is an insert/update/delete. My suggestion is to make extractUrl simply return the string url, so that later we can parse/validate the url and separate invalid/nulls urls.

  2. in updateRepoStats we want to allow the counter to have impact when the url is null. My suggestion is to create a new entry in the repos, lets say unset, and then handle the special case where url === null.

This new feature is quite tricky as there are multiple cases to take into consideration, like a repo that used to have an url but now is null, or the other way around.

@zarianec
Copy link
Collaborator Author

@eridal thanks for your review and suggestions. But I didn't get this one:

My suggestion is to make extractUrl simply return the string URL

Isn't it already return a string URL and we parse/validate it later? I don't see what do you want to change in it.

@zarianec
Copy link
Collaborator Author

It could be easier to implement this if we have tests for it. I already have tests that cover the update process and caching but as I mentioned before it requires some significant changes to the update.js script.

The biggest issue I had is the self-invoking anonymous function which makes it almost impossible to test because of process.exit

Currently, I use a workaround to prevent the script from immediate execution like:

module.exports = update

if (process.env.NODE_ENV !== 'test') {
  /**
   * Main
   */
  update().then(
    () => {
      console.log(metadata)
      process.exit(0)
    },
    (err) => {
      console.error(err)
      process.exit(1)
    }
  )
}

The better way IMO is to split it and move all the update logic to a separate file. update.js should only call this logic and finish the process.

@eridal I can push my tests here and we will work on edge cases/new requirements as it seems this PR will never get finished without it.

@zeke
Copy link
Member

zeke commented Sep 22, 2021

push my tests here and we will work on edge cases/new requirements

👍🏼

# Conflicts:
#	package-lock.json
#	package.json
#	scripts/stats.js
@zarianec
Copy link
Collaborator Author

@zeke @eridal pushed tests. I see we already have a lib folder. I think it's a good idea to split the update.js and move it to lib completely. Scripts in scripts will simply use it and handle the results.

eridal added a commit that referenced this pull request Sep 24, 2021
eridal added a commit that referenced this pull request Sep 24, 2021
eridal added a commit that referenced this pull request Sep 24, 2021
@eridal
Copy link
Contributor

eridal commented Sep 25, 2021

@zarianec I owe you an apology.

You have been putting a good amount of efforts into this feature. It was easy for me to quickly hack a solution, submit that PR, but doing so disregarded your work in the process. I'm sorry about it and I'm gonna take action to fix that.

This is open source, the code I wrote is from the community, and the more people involved the better; so I'd like you to take over this feature and push it to go live. So I'm gonna go ahead and close my PR, please use it as a source of inspiration --or not, you decide :)

Some ideas behind this PR are super good. My only suggestion is try not to change too much the code, but rather make smaller, incremental changes. If changes are not part of this feature, please submit another PR only for that particular change (ie: mocha update) so that this PR will stay minimal while also making progress. Also it'd be nice if you can execute a full-history reprocess, so more people knows how to do so.

Please accept my apologies, I am totally confident you will be able to deliver this feature!

@zeke
Copy link
Member

zeke commented Sep 27, 2021

Thanks for chiming in, @eridal. 💛

My only suggestion is try not to change too much the code, but rather make smaller, incremental changes.

Totally agree.

I'd like you to take over this feature and push it to go live.

Yeah let's do it while the fire is still hot. @zarianec is this PR ready to ship?

@zarianec
Copy link
Collaborator Author

@eridal thank you for your understanding, I really appreciate it. Some of the changes in your PR are really helpful and I moved it there (like handling invalid packages and better stats in the readme).

My only suggestion is try not to change too much the code, but rather make smaller, incremental changes.

I didn't want to refactor the tests as part of this PR and planned to open a new one after we finish with this. So I revert it for now and add the metadata tests you created in #39

Also it'd be nice if you can execute a full-history reprocess, so more people knows how to do so.

I did it already many times and must say that the caching you implemented is amazing - it saved me a LOT of time.

is this PR ready to ship?

Unfortunately no. I ran the full rebuild of the index and the metadata test suddenly failed:

1) metadata repos should match repos:

      AssertionError: expected 1050411 to equal 1050413
      + expected - actual

      -1050411
      +1050413

  2) metadata repos should match urls:

      AssertionError: expected 712688 to equal 712690
      + expected - actual

      -712688
      +712690

It took me some time to figure out why 2 packages are missing and the reason is pretty interesting. There is an NPM package called constructor and packages store is a generic javascript object so doing this:

const curr = packages[name]

Is the same as:

let foo = repos['constructor'] // we simply grab the object's constructor function
console.log(foo) // [Function: Object]

The better (and safer) way to handle the arbitrary list of key-value pairs in javascript is Map so I'm thinking of using it to store the packages list even considering performance overhead.

const packages = new Map(fs.existsSync(files.packages) ? Object.entries(require(files.packages)) : [])

@@ -427,7 +427,7 @@ const writeChanges = (deferred) => {
fs.writeFileSync(files.metadata, toJson(metadata))

if (batch.found > 0) {
fs.writeFileSync(files.packages, toJson(sort(packages)))
fs.writeFileSync(files.packages, toJson(sort(Object.fromEntries(packages))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're using a Map when collecting data then turning it into an Object when finished? That makes sense to me. 👍🏼

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically this is it. I updated the sort function so it will work with the Map as an input instead of an object - no need to convert the map to an object before sorting it.

sorted[key] = object[key]
delete object[key]
sorted[key] = packagesMap.get(key)
delete packagesMap.delete(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an extra delete keyword here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks


if (change.deleted) {
if (typeof curr === 'string') {
if (typeof curr !== 'undefined') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take advantage of the map: package.has(name)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked strange to me to check with packages.has() and use the curr variable then. And replacing curr with packages.get() everywhere looked too wordy.

@eridal
Copy link
Contributor

eridal commented Oct 1, 2021

Hey, this is looking great!

I wonder if it makes sense that this package exposes the Map to its consumers, so that they wont have the same issue that you found here.

@zarianec
Copy link
Collaborator Author

zarianec commented Oct 1, 2021

The issue with a constructor key happens only when you try to get the value if it wasn't set before. Since the final packages list has it already, it will work fine until the package will get deleted from the NPM 🙂

So exposing the Map instead of Object makes sense but we should consider memory overhead for Object->Map conversion.

Or we might expose the Object without a prototype (Object.create(null)) to keep the interface the same and make migration easier. But it will add performance overhead anyway.

With repository | ${notNull} | ${perc(notNull)}%
Null repository | ${areNull} | ${perc(areNull)}%
**Total** | ${total} | ${perc(total)}%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required here to split two tables. See how it looks without it:

Packages Count Percentage
With repository 1004148 64.84%
Null repository 544609 35.16%
Total 1548757 100.00%
Providers Count Percentage
:-------- -----: ----------:
GitHub 986341 63.69%
GitLab 3624 0.23%
Bitbucket 1148 0.07%
Others 13035 0.84%
Total 1004148 64.84%

hostnames.js Outdated
@@ -3,7 +3,7 @@
const URL = require('url')
const countValues = require('count-array-values')
const urls = Object.values(require('.'))
const hostnames = urls.map(url => URL.parse(url).hostname.replace(/^www\./i, ''))
const hostnames = urls.filter(u => u).map(url => URL.parse(url).hostname.replace(/^www\./i, ''))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use .filter(Boolean)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.

Comment on lines +25 to +26
expect(urls.some(url => isUrl(url)), 'should contain packages with repos').to.equal(true)
expect(urls.some(url => url === null), 'should contain packages without repos').to.equal(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expecting at least a null, which will not be the case if this is merged into current master. I'd say use the test to expect what we know it might be there, either a null or an url..

for (const url of Object.values(repos)) {
  expect(url === null || isUrl(url), url).to.equal(true)
}

Also nothing prevents the packages to be all set, and while super low probability there could be no-null at some point :))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which will not be the case if this is merged into current master

Agree, it will fail if we merge it without updating the data. On the other hand master with these changes and old index should be considered broken so probably it's fine.

while super low probability there could be no-null at some point

The probability of that is so low that I would rather consider it a bug if our index doesn't have any null value. It will prevent the situation when packages without a repository are missing in the index because of a bug in the code.

@eridal
Copy link
Contributor

eridal commented Oct 2, 2021

Leaving aside those small comments, this looks ready to ship. Great work @zarianec!

The Map can be delayed as consumers of this package will always read whatever is exported as part of the object, so the prototype chain make things like constructor or hasOwnProperty safer for them.

const foo = {}
console.log(typeof foo.hasOwnProperty) // function
foo.hasOwnProperty = 'nope'
console.log(typeof foo.hasOwnProperty) // string

Let's talk about release. This a breaking change for others so we need a major version.

@zeke
Copy link
Member

zeke commented Oct 3, 2021

🍿 Great to see the progress here!

@zarianec
Copy link
Collaborator Author

zarianec commented Oct 7, 2021

The release plan should be the same as @eridal described in the #32 (comment) with one additional step - update the major version of the package.

@zarianec
Copy link
Collaborator Author

zarianec commented Oct 7, 2021

@eridal @zeke I can do it as soon as this PR gets approved by everyone and we decide that it's ready to be shipped.

@zeke
Copy link
Member

zeke commented Oct 7, 2021

Sounds good. Wanna have a zoom call soon to pair on the release?

@zarianec
Copy link
Collaborator Author

@zeke sorry, I had a lot of stuff to do last week. I run the update script locally to have metadata and packages ready. Probably it will run all night. I will push updated data to this PR tomorrow and it will be ready for merge.

So the plan is like this:

  1. push the updated data
  2. disable actions (just in case)
  3. merge it to master
  4. update the major version
  5. enable actions

Wanna have a zoom call soon to pair on the release?

good idea but could be tricky since I'm in a CEST timezone. DM me in Twitter so we could find the best time slot.

@zeke
Copy link
Member

zeke commented Oct 21, 2021

DM me in Twitter

Followed you, but don't see an option to DM just yet. You DM me. :)

@zeke
Copy link
Member

zeke commented Oct 25, 2021

Thanks for the updates @zarianec. I'll try to get this released in the new few days.

I see that there's a conflict. Is that expected? Should I force the changes from your updated version of metadata.json here?

@zarianec
Copy link
Collaborator Author

Yes, it's safe to force the changes from metadata.json. They will get updated on the next action run.

@zarianec
Copy link
Collaborator Author

@zeke I just checked if I have all required permissions to release this PR. Looks like I'm good to go so I'm going to release this PR tomorrow (need to rebuild index from scratch to eliminate NPM's private packages leakage)

@zeke
Copy link
Member

zeke commented Nov 23, 2021

Thanks @zarianec

Sorry I totally dropped the ball on this. 🙈

@zarianec zarianec merged commit 8bf0fdc into nice-registry:master Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include all packages, even if they don't have repos
3 participants