Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running NLLoc with input time grids located in several directories. #46

Open
codeelw opened this issue Apr 19, 2024 · 11 comments
Open

Running NLLoc with input time grids located in several directories. #46

codeelw opened this issue Apr 19, 2024 · 11 comments

Comments

@codeelw
Copy link

codeelw commented Apr 19, 2024

Hi Anthony,

I am using a 1km spaced grid which covers all of New Zealand. Due to the large size of this grid (22Tb) I have split the time grids across three directories and then created symbolic links to the three directories within the directory I am running NonLinLoc in. I was wondering how it would be best to approach locating events with the grids split across the directories. Is there a way to scan across the directories recursively?

Currently my directories are set up as follows:
TIME1 -> /Volumes/GeoPhysics_42/users-data/williaco/1_km_grids_ABAZ_to_095A/TIME
TIME2 -> /Volumes/GeoPhysics_43/users-data/williaco/1_km_grids_095B_to_LRAN/TIME
TIME3 -> /Volumes/GeoPhysics_36/users-data/williaco/1km_grids/TIME

with all of these symbolic links being stored in /Volumes/GeoPhysics_35/users-data/williaco/NonLinLoc_1km_grid_run_directory/TIME.

I currently have my LOCFILES line of the control file set up as follows:
LOCFILES IN/1920935.nll NLLOC_OBS TIME/TIME1/NZ_3D OUT/located

However this only searches one directory, is there a way to adjust this line in the control file to search through TIME1, TIME2 and TIME3 at the same time? I have tried a number of combinations of wildcard in place of the TIME1 part with no success. Alternatively is there a way to run this part of the control file recursively?

Many thanks,

Codee

@alomax
Copy link
Collaborator

alomax commented Apr 19, 2024

Hi Codee,

NLL grid i/o is coded to see one buffer file and one corresponding header file. I do not see any way to change this to support scanning a single grid in different file pairs without some very complex, low level (and difficult to maintain?) C code changes and additions to NLL.

So I think the solution has to be implemented at the OS or higher level, so that NLL "sees" a single file, even if it is distributed on different physical or logical devices.

It looks like one quite simple solution would be an array of RAID 0 disks. You may be able to set up one logical disk from the point of view of the OS which is stored on several physical disks under RAID 0. See: https://www.intel.com/content/www/us/en/support/articles/000005867/technologies.html#raid0
This will also increase speed over using one disk.

I hope this helps, otherwise we can think about other options...

Best regards,

Anthony

@trap000d
Copy link

Thanks Anthony, it makes sense. Unfortunately (Codee has not mentioned it), all these volumes are network shares, so I guess RAID won't work.
I've just tried some combination of 'mount --bind' + 'mount -t overlay' - that seems working for NFS. Hopefully 3 extra layers of mount abstractions will not cause a massive performance degradation.

@alomax
Copy link
Collaborator

alomax commented Apr 22, 2024

OK, this is interesting, please tell me if it works in practice!
I seem to find on an iMac that NLLoc is fastest when the travel-time grids are all in memory, even with a local, SSD disk. But maybe in a more sophisticated network/server environment disk access (and caching) is very fast.

For all of New Zealand, a 1km grid sounds very fine - the velocity model must be much smoother in many areas, especially at depth. But I suppose there is 1km (or less) detail at shallow depths. One solution to this situation is a new feature called Cascading Grids in NLL - the travel-times are calculated in full resolution for precision, but the resulting travel-time grids are stored with increasing cell size with depth. See this paper where this procedure is introduced and described for all Italy (~1200x1200x800km):

(Latorre, D., Di Stefano, R., Castello, B., Michele, M., & Chiaraluce, L. (2023). An updated view of the Italian seismicity from probabilistic location in 3D velocity models: The 1981–2018 Italian catalog of absolute earthquake locations (CLASS). Tectonophysics, 846, 229664. https://doi.org/10.1016/j.tecto.2022.229664)

@codeelw
Copy link
Author

codeelw commented Oct 20, 2024

Hi Anthony,

Thanks for your help! In the past few months I have managed to overcome the issue of the large grid size by storing all the grids in a separate place to where I am running NonLinLoc and locating events in clusters of either time or space where I only copy over the grids I need for those events to the directory I am running NonLinLoc in (approximately 2.5Tb). However, the size of the initial grid is still an issue with each event taking approximately 14 minutes to locate and with a total of approximately 400,000 events to locate all together even running NonLinLoc in 40 instances at once it will still take approximately 3.5 months to locate all the events. Would using the cascading grids result in a quicker run time do you think? If so how long would it take to convert my 1km spaced grid files to cascading grids? I also have approximate locations for these events, is there a way to start the initial grid around this localised area, limiting the number of iterations that need to be completed?

My input parameters are as follows:
LOCSEARCH OCT 100 240 60 0.001 4320000 5000 1 0

LOCGRID 1001 2401 601 -400.0 -1200.0 -3.0 1.0 1.0 1.0 PROB_DENSITY SAVE

LOCMETH EDT_OT_WT 99999.9 4 -1 -1 -1 0 -1.0 0

LOCGAU 0.2 0.0

LOCGAU2 0.02 0.05 2

Any advice would be greatly appreciated!
Many thanks,
Codee

@alomax
Copy link
Collaborator

alomax commented Oct 21, 2024

Hello Codee,

I am not clear on the scale of your problem and the file sizes - what the size of one travel-time grid *.buf file?
And this seems a tough problem...

You seem to be using maxNum3DGridMemory = 0, so NLLoc should be leaving the grids on disk and is accessing needed travel-time directly within the grid files on disk. I think this is slower to much slower than getting all grids into memory, and may be OS dependent. But I suppose you run this way because there is much less memory than the total size of needed travel-time grids.

Would using the cascading grids result in a quicker run time do you think?

The largest cascading grids I have run were 606Mb for each original travel-time buf file and 22Mb after conversion to cascading grids. This is a factor of about 28 in size reduction, but depends on the layer thicknesses for doubling of cell size with depth. So if you can fit all the travel-time grids for events with the most arrival picks in memory when the grid size has been reduced by a factor of about 30, then there should be a quicker to much quicker run-time. But this also depends on the OS and hardware speed in shuffling grids from disk to memory, as each subsequent event will in general still need a different set of station grids.

If so how long would it take to convert my 1km spaced grid files to cascading grids?

The conversion from cubic travel-time grids to cascading is much faster (~5x ?) than takes the original Grid2Time calculation of travel-times in the cubic grid.

I also have approximate locations for these events, is there a way to start the initial grid around this localised area, limiting the number of iterations that need to be completed?

If you mean can NLLoc only read portions of the travel-time grids around the initial location, no this functionality is not implemented, and would still require alot of disk access.
However, you could divide your whole study area into overlapping regional models which include all stations needed to locate an event in the central portion of each region (e.g. the non-overlapping part). This might reduce the travel-time grid sizes by a factor that allows the run time (with grids read into memory or perhaps left on disk) to become reasonable or even much faster.

I hope the above helps you to advance,
Best regards,
Anthony

@codeelw
Copy link
Author

codeelw commented Oct 22, 2024

Hi Anthony,

Thanks for your quick response. I have been running with memory set to 0 as I found this to be quicker since I was able to run more instances of NonLinLoc without using up too much memory which worked out quicker overall than running fewer instances with more memory in each. However, I am in the process of being able to run NonLinLoc in an environment where I have more memory capacity and so will be able to utilise the maxNum3DGridMemory more effectively.

Each individual travel time .buf file is between 4GB and 6GB. My current grids go down to 600km with 1km spacing. I was going to aim for 1km spacing down to 50km and 4km spacing after this using the cascading grids. This would reduce my grid size with depth by a factor of about 3.2 which I estimate would reduce run time from 14 minutes per event to around 4.5 minutes per event. From the updates page I can see that it to create these grids I'll need to use the tool GridCascadingDecimate. I've been struggling to find documentation explaining how this tool works and what parameters I will need to provide it with. Do you have any advice on how to best use this tool or are you able to point me towards a document where I will be able to learn more about how to use the tool?

As for dividing the study area into individual models I think this would be difficult to implement for earthquakes which are detected on stations which would cover multiple regional models. There would have to also be a lot of overlap in the models to capture events on the edges of the model and I'm not sure that this would save time in the long run though definitely worth exploring for if I need to focus on smaller regions later on.

Thank you again for your support with this.
Kind regards,
Codee

@alomax
Copy link
Collaborator

alomax commented Oct 22, 2024

Hi Codee,

OK, I think I now have a good sense of the scale of your grids, files and calculation times. Wow!

Several minutes per event seems very, very large - usually I see single events times of sub-second to a few seconds, and also in the course of a study I rerun all locations many times.

I think the ultimate solution to your configuration and the huge size of the travel-time files would be to reduce the effective size of the station travel-time grids either as stored on disk or through some very elegant and fast preprocessing code so that NLL only "sees" much smaller grids.

The cascading size grids created by GridCascadingDecimate is one way to do this, but I am not sure this will be sufficient for the scale of your grids.
In any case, I provide here a generic bash script to convert regular grid travel-time files to cascading grids:
run_casc_grid.bash.txt

All you need to do is:

  1. create your regular NLL travel-time grids as usual (as you have already done)
  2. convert the grids to cascading grids using the script
  3. use the path/root name of the cascading grids in your NLLoc control file, just as you would use the path/root name of regular NLL travel-time grids.

You do not need to otherwise change your NLL control file or specify that you are using cascading grids, this is taken care of automatically by NLL as there is a CASCADING_GRID line in the cascading grid *.hdr files.

However, I wonder if an ultimate way to run locations with such a huge model might be to generate travel-time grids for each station only in a limited area centered on the station. This would "only" require chopping your velocity model around each station to create a NLL velocity (slowness-length) grid for the station, and then run Grid2Time (and perhaps GridCascadingDecimate). The same TRANS for all grids and NLLoc should be used, and a correct orig_grid_x/y/z must be set for each velocity grid to correctly position it in the geographic transform space.
With such station-centric grids and then specifying a custom local/regional LOCGRID centered on the approximate location for each location (or group of nearby events, for efficiency), NLLoc would automatically only use the stations with travel-time grids that fully contain the LOCGRID - the nearby stations.
There are probably some complications and several pre-processing scripts/procedures to create, but perhaps no changes to the base NLL software modules. Perhaps one complication is support for the large depth extent of the model/travel-times and use of fairly distant stations for deep, subduction zone events...

Anyway, I hope the above is helpful and may lead to a practical solution.

Best regards,

Anthony

@codeelw
Copy link
Author

codeelw commented Oct 24, 2024

Hi Anthony,

Thanks for your advice on this issue. I have tested this out on some test stations and events but I am getting some unexpected results and I wonder if you might be able to help. I am thinking that maybe I have not built the cascading grids correctly but I was unable to use the nllgrid function to plot or print the cascading grid files and check this.

Here is an example of the original 1km header file:
1001 2401 601 -400.000000 -1200.000000 -3.000000 1.000000 1.000000 1.000000 TIME FLOAT
CAW -93.742300 -170.910567 -0.283000
TRANSFORM TRANS_MERC RefEllipsoid GRS-80 LatOrig -41.763800 LongOrig 172.903700 RotCW 140.000000 UseFalseEasting 0 FalseEasting 500000 ScaleFactor 1.000000

Here is an example of a cascading grid header file:
1001 2401 601 -400.000000 -1200.000000 -3.000000 1.000000 1.000000 1.000000 TIME FLOAT
CAW -93.742300 -170.910567 -0.283000
TRANSFORM TRANS_MERC RefEllipsoid GRS-80 LatOrig -41.763800 LongOrig 172.903700 RotCW 140.000000 UseFalseEasting 0 FalseEasting 500000 ScaleFactor 1.000000

CASCADING_GRID 2 50.000000,100.000000,

When I create the cascading grids everything seems to go through okay and it takes about 4 and a half minutes per station/phase to make (compared to about 15 minutes per station/phase for the original 1km grids). The grid file is also much smaller reducing from about 4GB to 772MB. It only makes the .time files into cascading files not the .angle files is this okay? There aren't any errors or warnings printed when I make the grids either. To make the grids I have been using the bash script you attached with the following parameters:

MODEL_GRID_ROOT=/Volumes/GeoPhysics_46/users-data/williaco/test_Time/NZ_3D
TIME_GRIDS_PATH=/Volumes/GeoPhysics_46/users-data/williaco/test_Time
GRID_NAME=NZ_3D
STATION_SRCE_FILE=stations_test_2.in
TRANS="TRANS TRANS_MERC GRS-80 -41.7638 172.9037 140"
DEPTHS=50,100 # doubling_depths

Am I right in interpreting the DEPTHS parameter as the depths at which I want to double the cell size? My grids currently go down to 600km in 1km cells and I was hoping that setting this parameter to 50,100 would mean it would be 1km depth cells down to a depth of 50km, 2km depth cells from 50km-100km and 4km depth cells from 100km to 600km. I anticipated this would change my location parameters as follows:
LOCSEARCH OCT 100 240 18 0.001 1296000 5000 1 0
LOCGRID 1001 2401 175 -400.0 -1200.0 -3.0 1.0 1.0 1.0 PROB_DENSITY SAVE

However when I use this set up I get locations which are very different to those produced by my original 1km grids (a change in location from lat -40.295497, lon 173.482202, depth 237781.25m to lat -39.813136, lon 173.153869, depth 170697.917m with the cascading grids. From looking at other people's locations of the same event the first location (from the 1km grids) is what I would expect.

I am able to replicate the 1km location when I use the same LOCSEARCH and LOCGRID as I do with my normal 1km grids I get the same location as I do with my 1km grids.

LOCSEARCH OCT 100 240 60 0.001 4320000 5000 1 0
LOCGRID 1001 2401 600 -400.0 -1200.0 -3.0 1.0 1.0 1.0 PROB_DENSITY SAVE

However in using this set up I am completing the same number of iterations as previously and it is still taking 14 minutes per event to locate. Am I correct in assuming that I should be able to reduce the number of iterations with the cascading grids as I tried above? I have been calculating the number of iterations using initNumCells_x * initNumCells_y * initNumCells_z * 3.

I appreciate any help or suggestions you may have with this.

Thanks,
Codee

@alomax
Copy link
Collaborator

alomax commented Oct 24, 2024

Hi Codee,

It only makes the .time files into cascading files not the .angle files is this okay?

I think so, the angles files are only used once at the end of location to get take-off angles and are always read directly from disk. But if you need the angles and reading the 1km angle grids is slow, perhaps cascading angle files can be created and will work correctly (I have not investigated this). In any case there is a complication in that the time and angle grids must have the same path/root name.

Am I right in interpreting the DEPTHS parameter as the depths at which I want to double the cell size?

Correct, and I think your description of the new cell sizes with depth is correct.

But your LOCGRID
LOCGRID 1001 2401 175 -400.0 -1200.0 -3.0 1.0 1.0 1.0 PROB_DENSITY SAVE
should be the same with or without cascading grids, so always use
LOCGRID 1001 2401 600 -400.0 -1200.0 -3.0 1.0 1.0 1.0 PROB_DENSITY SAVE
LOCGRID is always defined with constant size cells, it is never a cascading geometry.
This explains your hypocenter difference, as your cascading setting for LOCGRID only extends to 171 km depth:
(175-1) * 1.0 - 3.0
Similar for LOCSEARCH, it should stay:
LOCSEARCH OCT 100 240 18 0.001 1296000 5000 1 0

More importantly, I did not previously notice your very large initNumCells for the octree grid and huge number of maxNumNodes:

LOCSEARCH OCT initNumCells_x initNumCells_y initNumCells_z minNodeSize maxNumNodes numScatter useStationsDensity stopOnMinNodeSize
LOCSEARCH OCT 100 240 60 0.001 4320000 5000 1 0

that likely explains the large location time, perhaps more than the time-grid sizes!
I would suggest:
LOCSEARCH OCT 20 48 12 0.001 50000 5000 1 0
or
LOCSEARCH OCT 30 72 18 0.001 100000 5000 1 0
where initNumCells_x * initNumCells_y * initNumCells_z ~ 1/3-1/5 maxNumNodes
and initNumCells_x, initNumCells_y, initNumCells_z are proportional to the x, y, z extent of LOCGRID

I hope the above helps advance!

Anthony

@codeelw
Copy link
Author

codeelw commented Nov 13, 2024

Hi Anthony,

Thanks for your advice regarding changing my locsearch parameters, I have managed to optimise them using your suggestions and have been able to get the events running quickly and produce several years of locations. The only issue I am encountering (and have been encountering for a short while) is that with runs of NLLoc with certain input files I get the following error message and NLLoc stops running:
free(): double free detected in tcache 2
Aborted (core dumped).

Looking into the kernel logs it seems that it gives the following error:
error 6 : (data) write to an unmapped area.

I believe this is something to do with memory allocation within NonLinLoc itself? I get the error more frequently when I use the following LOCHYPOUT parameter but I also get it when using SAVE_NLLOC_ALL on its own:
LOCHYPOUT SAVE_NLLOC_ALL SAVE_NLLOC_EXPECTATION

I wondered if you knew the best approach to avoid this in running my future locations?

Thanks again for your continued support!

Kind regards,
Codee

@alomax
Copy link
Collaborator

alomax commented Nov 15, 2024

Hi Codee,

This seems not an easy problem to resolve - it may be platform or compiler type or version dependent!

One issue with NLL input is that tabs "\t" sometimes cause Segmentation Faults.
If there are any tabs in your NLL control file(s), try replacing them with spaces.

Otherwise, can you try exactly the same input files on a different machine and/or OS to see if the error persists?

I hope this helps some.

Anthony

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants