Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index of local repository may not reflect its data holdings #104

Open
timronan opened this issue May 2, 2019 · 3 comments
Open

Index of local repository may not reflect its data holdings #104

timronan opened this issue May 2, 2019 · 3 comments

Comments

@timronan
Copy link
Contributor

timronan commented May 2, 2019

After downloading the request
IU ANMO * LHZ 2012-01-01T00:00:00 2012-02-01T00:00:00
and then manually deleting multiple data directories
rm -r ./data/IU/2012/01*
rover list-summary returns:

  IU_ANMO_00_LHZ 2012-01-01T00:00:00.069500 2012-01-31T23:59:59.069500
  IU_ANMO_10_LHZ 2012-01-01T00:00:00.069500 2012-01-31T23:59:59.069500

and rover list-index returns (only included location code 00 for readability):

  IU_ANMO_00_LHZ_M (1 Hz)
    2012-01-01T00:00:00.069500 - 2012-01-01T23:59:59.069500
    2012-01-02T00:00:00.069500 - 2012-01-02T23:59:59.069500
    2012-01-03T00:00:00.069500 - 2012-01-03T23:59:59.069500
    2012-01-04T00:00:00.069500 - 2012-01-04T23:59:59.069500
    2012-01-05T00:00:00.069500 - 2012-01-05T23:59:59.069500
    2012-01-06T00:00:00.069500 - 2012-01-06T23:59:59.069500
    2012-01-07T00:00:00.069500 - 2012-01-07T23:59:59.069500
    2012-01-08T00:00:00.069500 - 2012-01-08T23:59:59.069500
    2012-01-09T00:00:00.069500 - 2012-01-09T23:59:59.069500
    2012-01-10T00:00:00.069500 - 2012-01-10T23:59:59.069500
    2012-01-11T00:00:00.069500 - 2012-01-11T23:59:59.069500
    2012-01-12T00:00:00.069500 - 2012-01-12T23:59:59.069500
    2012-01-13T00:00:00.069500 - 2012-01-13T23:59:59.069500
    2012-01-14T00:00:00.069500 - 2012-01-14T23:59:59.069500
    2012-01-15T00:00:00.069500 - 2012-01-15T23:59:59.069500
    2012-01-16T00:00:00.069500 - 2012-01-16T23:59:59.069500
    2012-01-17T00:00:00.069500 - 2012-01-17T23:59:59.069500
    2012-01-18T00:00:00.069500 - 2012-01-18T23:59:59.069500
    2012-01-19T00:00:00.069500 - 2012-01-19T23:59:59.069500
    2012-01-20T00:00:00.069500 - 2012-01-20T23:59:59.069500
    2012-01-21T00:00:00.069500 - 2012-01-21T23:59:59.069500
    2012-01-22T00:00:00.069500 - 2012-01-22T23:59:59.069500
    2012-01-23T00:00:00.069500 - 2012-01-23T23:59:59.069500
    2012-01-24T00:00:00.069500 - 2012-01-24T23:59:59.069500
    2012-01-25T00:00:00.069500 - 2012-01-25T23:59:59.069500
    2012-01-26T00:00:00.069500 - 2012-01-26T23:59:59.069500
    2012-01-27T00:00:00.069500 - 2012-01-27T23:59:59.069500
    2012-01-28T00:00:00.069500 - 2012-01-28T23:59:59.069500
    2012-01-29T00:00:00.069500 - 2012-01-29T23:59:59.069500
    2012-01-30T00:00:00.069500 - 2012-01-30T23:59:59.069500
    2012-01-31T00:00:00.069500 - 2012-01-31T23:59:59.069500

which does not reflect the local repo's data holdings. Furthermore, there is no rover command that allows the user to reindex the data holdings.

When rover retrieve request.txt is run on this use case, all of the requested data is collected. It seems like we should expect either none of the data to be collected (if the local repo's index is being compared to the availability service) or only the missing data to be collected (if the data repo is re-indexed and is compared to the availability service). The latter is the correct use case.

@timronan timronan changed the title Index of Local Repository may not reflect its data holdings Index of local repository may not reflect its data holdings May 2, 2019
@chad-earthscope
Copy link
Contributor

ROVER's notion of data holdings is in the index database. If a user is manually removing data files from a repository and not modifying the index, I do not think that is an issue for ROVER to solve.

There is a command to (re)index data files (i.e. rover index), but I do not think this is a re-indexing issue. Instead, I think this is an issue about removing data, and this is what #21 is about. Data should be managed by ROVER if you want ROVER's notion of the data to remain consistent.

@timronan
Copy link
Contributor Author

timronan commented May 2, 2019

I forgot about the index command. Interestingly, if the data holdings are indexed after data is manually removed, running rover retrieve IU_ANMO_*_LHZ 2012-01-01T00:00:00 2012-02-01T00:00:00 only collects the data that was manually removed instead of the entire request. This seems like the behavior that we want. It seems like this indexing step could be added to list-index and list-summary so false information is not being reported to the user when these commands are run.

We could put an indexing step into retrieve._query so we are certain that the local repo's index is up to date before running the Source._new_retrieval.

This patch brings us a step close to creating the delete function outlined in issue #21 and it also prevents Rover from presenting false information to the user. It seems like adding this audit step keeps ROVER's index and data holdings aligned no matter what happens to the data. It is a way for the program to automatically check for errors that could be human or computer derived.

@chad-earthscope
Copy link
Contributor

Interestingly, if the data holdings are indexed after data is manually removed, running rover retrieve IU_ANMO_*_LHZ 2012-01-01T00:00:00 2012-02-01T00:00:00 only collects the data that was manually removed instead of the entire request.

Yes, it appears that the reindexing smartly handles the manually removed data, but see more below. The reason this is not done automatically is because it can be a huge operation checking terabytes of data. If a user manually modifies the data files "under" ROVER, they should not expect ROVER to automagically figure it out, it is perfectly reasonable for the user to issue the index command when they wish to resynchronize the files with the index after some manual modifications.

With that said, I did find an apparent bug in the (re)indexing. Repeatable test case below.

Download 15 days of data and manually remove the first 10 days:

rover init .
rover retrieve IU_ANMO_00_LHZ 2012-01-01T00:00:00 2012-01-15T00:00:00
rm -r data/IU/2012/00*

Now list the index, which shows all 15 days:

rover list-index net=*

  IU_ANMO_00_LHZ_M (1 Hz)
    2012-01-01T00:00:00.069500 - 2012-01-01T23:59:59.069500
    2012-01-02T00:00:00.069500 - 2012-01-02T23:59:59.069500
    2012-01-03T00:00:00.069500 - 2012-01-03T23:59:59.069500
    2012-01-04T00:00:00.069500 - 2012-01-04T23:59:59.069500
    2012-01-05T00:00:00.069500 - 2012-01-05T23:59:59.069500
    2012-01-06T00:00:00.069500 - 2012-01-06T23:59:59.069500
    2012-01-07T00:00:00.069500 - 2012-01-07T23:59:59.069500
    2012-01-08T00:00:00.069500 - 2012-01-08T23:59:59.069500
    2012-01-09T00:00:00.069500 - 2012-01-09T23:59:59.069500
    2012-01-10T00:00:00.069500 - 2012-01-10T23:59:59.069500
    2012-01-11T00:00:00.069500 - 2012-01-11T23:59:59.069500
    2012-01-12T00:00:00.069500 - 2012-01-12T23:59:59.069500
    2012-01-13T00:00:00.069500 - 2012-01-13T23:59:59.069500
    2012-01-14T00:00:00.069500 - 2012-01-14T23:59:59.069500

Now (re)index and list the index again:

rover index
rover list-index net=*
$

Empty!?!! Oops, something is wrong.

Do the exact same steps, (re)index and list the index again:

rover index
rover list-index net=*

  IU_ANMO_00_LHZ_M (1 Hz)
    2012-01-10T00:00:00.069500 - 2012-01-10T23:59:59.069500
    2012-01-11T00:00:00.069500 - 2012-01-11T23:59:59.069500
    2012-01-12T00:00:00.069500 - 2012-01-12T23:59:59.069500
    2012-01-13T00:00:00.069500 - 2012-01-13T23:59:59.069500
    2012-01-14T00:00:00.069500 - 2012-01-14T23:59:59.069500

Now it is back!? This is what I would have expected after the first index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants