Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Y-axis flipped when reading data with Xarray #650

Open
RY4GIT opened this issue Mar 13, 2023 · 15 comments
Open

Y-axis flipped when reading data with Xarray #650

RY4GIT opened this issue Mar 13, 2023 · 15 comments
Labels
bug Something isn't working upstream Issue is related to a dependency (upstream package).

Comments

@RY4GIT
Copy link

RY4GIT commented Mar 13, 2023

Update

Update on March 15, 2023 based on discussion here pydata/xarray#7621
First posted on March 13, 2023

Description

When reading data with Xarray using a rasterio engine, the Y axis is flipped. This happens with a specific dataset, and the summary is as follows.

Data\Engine NetCDF4 Rasterio
1. NSIDC output ❌ (Y axis is flipped)
2. NASA download

i.e., The issue happens with SMAP L4 data, pre-processed on the NSIDC server, read with rasterio engine. The issue does not happen with NetCDF4 engine. It did not happen with the one directly downloaded from NASA Earth Data.

When reading data with Xarray using both the netcdf4 engine and the rasterio engine, the Y axis is flipped. However, when reading the same data with the netcdf4 package, the issue doesn't seem to be occurring.

It's unclear whether the issue is with Xarray or the data itself.

image

Details

I have created a Jupyter notebook with the details of the issue, including the code used to analyze the data, which is available here

Data

  • The data is SMAP L4 Geophysical data, precipitation (https://nsidc.org/data/spl4smgp/versions/7). The issue happens for different versions of the dataset and files for any dates or times. I tested it with two different versions of datasets.
    • Dataset 1 is pre-processed & subset on NSIDC server
    • Dataset 2 is directly downloaded from NASA Earth Data portal

The data can be downloaded from the links provided by NASA and NSIDC (the links are specified in the notebook), or from my Google Drive.

Environment Information

  • xarray version 2023.2.0 (the issue happened with 2023.1.0 too)
  • rioxarray version 0.13.3
  • rasterio version 1.3.6 (the issue happened with 1.3.4 too)
  • GDAL version 3.6.2
  • Python version 3.11
  • Windows 11
    (But I remember I've encountered this issues since a few years ago for SMAP data with ver 3.9 on Win 10)

Installation method

  • conda forge

Please let me know if you need any further information or have any suggestions on resolving this issue.

I am cross-posting this Xarray issues.

Please let me know if you need any further information or have any suggestions on resolving this issue.
Thank you!

@dcherian
Copy link

Did you try the rioxarray engine instead of rasterio?

@RY4GIT
Copy link
Author

RY4GIT commented Mar 15, 2023

@dcherian Sorry I realized I should have posted this in rasterio repo. I was confused rasterio and rioxarray! I will try them out and let you know soon, but you may close the issue if you think this doesn't fit here.

@dcherian
Copy link

No! This is the place. Just try engine="rioxarray". engine="rasterio" is bundled with xarray and is about be removed. rioxarray is a more full-featured version of the same funcitonality

@snowman2
Copy link
Member

Actually, engine="rasterio" comes from rioxarray ref. It calls rioxarray.open_rasterio.

@dcherian
Copy link

Oh that's confusing! In that case, i guess it's calling rioxarray since its installed.

@RY4GIT
Copy link
Author

RY4GIT commented Mar 15, 2023

So I tried open_rasterio with NSIDC output, but the results are the same.
The issue did not happen with data directly downloaded from NASA Earth data (see updates on pydata/xarray#7621), so I am not sure it's an issue with rioxarray or the NSIDC output.

ds_NSIDC_rioxarray = rioxarray.open_rasterio(os.path.join(input_path, fn_NSIDC_output))
ds_NSIDC_rioxarray = ds_NSIDC_rioxarray.set_coords(["cell_lat", "cell_lon"])
ds_NSIDC_rioxarray

image

ds_NSIDC_rioxarray.precipitation_total_surface_flux[0].plot(y="cell_lat", x="cell_lon", vmin=-0.001)
image

@snowman2
Copy link
Member

snowman2 commented Jul 11, 2023

If you do this:

rds = rioxarray.open_rasterio(os.path.join(input_path, fn_NSIDC_output), group="Geophysical_Data")

or this:

rds = xarray.open_dataset(os.path.join(input_path, fn_NSIDC_output), group="Geophysical_Data", engine="rasterio")

The data plots correctly with rasterio.

EDIT: This does not fix the issue. The plot has the correct orientation. However, the data is missing coordinates.

@snowman2
Copy link
Member

If you do this, the lat/lon data looks correct:

ds_NSIDC_root  = xarray.open_dataset(data_path, variable=["cell_lon", "cell_lat"], engine='rasterio')
ds_NSIDC_root.cell_lat.plot()

@snowman2
Copy link
Member

This appears to be a rasterio/GDAL issue:

import rasterio
import netCDF4
import numpy

with netCDF4.Dataset(data_path) as nfh:
    nc_data = nfh['Geophysical_Data']['precipitation_total_surface_flux'][:]
with rasterio.open('netcdf:SMAP_L4_SM_gph_20150331T013000_Vv7032_001_HEGOUT.nc:/Geophysical_Data/precipitation_total_surface_flux') as rfh:
    rio_data = rfh.read(1, masked=True)
numpy.array_equal(nc_data, rio_data)
False
numpy.array_equal(nc_data, numpy.flip(rio_data, 0))
True

This is not something that can be resolved in rioxarray.

@snowman2 snowman2 added the upstream Issue is related to a dependency (upstream package). label Nov 9, 2023
@andypbarrett
Copy link

Hi,

I'm just found this issue while searching for a related problem. Thank you for documenting it so well.

I'm actually working on a tutorial for NSIDC to demonstrate working with SMAP data. So this is quite timely.

I do not think it is a software/tool problem but how you are trying to read the data. I'm using the original HDF5 file from NSIDC but the result is the same. I'm guessing you used a reprojection service to get the dataset you posted here.

TLDR: I think the problem is that rasterio does not interpret the coordinate information and treats the data as an image - which is what rasterio is for. If you want the features offered by rioxarray you can use:

ds_rioxarray  = xr.merge([
    xr.open_dataset(filepath, decode_coords="all"),
    xr.open_dataset(filepath, group='Geophysical_Data', decode_coords="all")['precipitation_total_surface_flux'],
])

I'm using the following packages

from pathlib import Path

import xarray as xr
import rioxarray

print(xr.__version__)
print(rioxarray.__version__)
2024.2.0
0.15.0
filepath = Path("smap_data/SMAP_L4_SM_gph_20150331T013000_Vv7032_001.h5")

The SMAP datasets you are using are either HDF5 (the original file from NSIDC) or NetCDF4 and follow most of the CF-Conventions, at least for the Level-4 data. So you can use

engine = "netcdf4"
ds = xr.merge([
    xr.open_dataset(filepath, engine=engine),
    xr.open_dataset(filepath, group='Geophysical_Data', engine=engine)['precipitation_total_surface_flux'],
    ])
ds

Or

engine="h5netcdf"
ds = xr.merge([
    xr.open_dataset(filepath, engine=engine, phony_dims="sort"),
    xr.open_dataset(filepath, group='Geophysical_Data', engine=engine, 
                               phony_dims="sort")['precipitation_total_surface_flux'],
    ])
ds

These both return an xarray dataset like

<xarray.Dataset> Size: 175MB
Dimensions:                           (phony_dim_2: 1, y: 1624, x: 3856)
Coordinates:
  * x                                 (x) float64 31kB -1.736e+07 ... 1.736e+07
  * y                                 (y) float64 13kB 7.31e+06 ... -7.31e+06
Dimensions without coordinates: phony_dim_2
Data variables:
    EASE2_global_projection           (phony_dim_2) |S1 1B ...
    cell_column                       (y, x) float64 50MB ...
    cell_lat                          (y, x) float32 25MB ...
    cell_lon                          (y, x) float32 25MB ...
    cell_row                          (y, x) float64 50MB ...
    time                              (phony_dim_2) datetime64[ns] 8B ...
    precipitation_total_surface_flux  (y, x) float32 25MB ...
Attributes:
    Comment:      HDF-5
    Contact:      http://gmao.gsfc.nasa.gov/
    Conventions:  CF
    Filename:     /discover/nobackup/projects/gmao/smap/SMAP_L4/L4_SM/Vv7032/...
    History:      File written by ldas2daac.x
    Institution:  NASA Global Modeling and Assimilation Office
    References:   see SMAP L4_SM Product Specification Documentation
    Source:       v17.11.1
    Title:        SMAP L4_SM Geophysical (GPH) Data Granule

Note that the x and y coordinates are projected-coordinates for the EASE-Grid v2.0 Global CRS. In your example they are geographic coordinates.

What I think is happening when you use either the rasterio backend or rioxarray is that the HDF5 GDAL driver (which is what rasterio is using under-the-hood) is not able to parse the coordinate information.

engine = "rasterio"
ds_rasterio = xr.open_dataset(filepath, engine=engine)

This returns a bunch of warnings about not being able to determine georeferencing

[/home/apbarret/mambaforge/envs/nsidc-tutorials/lib/python3.9/site-packages/rioxarray/_io.py:1132](http://localhost:8890/home/apbarret/mambaforge/envs/nsidc-tutorials/lib/python3.9/site-packages/rioxarray/_io.py#line=1131): NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  warnings.warn(str(rio_warning.message), type(rio_warning.message))  # type: ignore

Printing x and y shows that the data are not georeferenced because the grid cell column and row indices - image coordinates - are returned not the projected coordinates.

print(ds_rasterio.x, rasterio.y)
<xarray.DataArray 'x' (x: 3856)> Size: 31kB
array([5.0000e-01, 1.5000e+00, 2.5000e+00, ..., 3.8535e+03, 3.8545e+03,
       3.8555e+03])
Coordinates:
  * x                        (x) float64 31kB 0.5 1.5 ... 3.854e+03 3.856e+03
    EASE2_global_projection  int64 8B ... <xarray.DataArray 'y' (y: 1624)> Size: 13kB
array([5.0000e-01, 1.5000e+00, 2.5000e+00, ..., 1.6215e+03, 1.6225e+03,
       1.6235e+03])
Coordinates:
  * y                        (y) float64 13kB 0.5 1.5 ... 1.622e+03 1.624e+03
    EASE2_global_projection  int64 8B ...

I suspect that rasterio is treating the data as an image, which by convention have the origin of image coordinates at the upper-left corner of the upper-left pixel. So pixel[0.5, 0.5] is the upper-left-most pixel center. However, when you plot the data with the xarray.DataArray.plot method, the coordinate sytem that pcolormesh (the default plotting method) uses is a normal cartesian coordinate system with the origin at the lower left, so the image appears upside down.

Plotting cell_lat shows that this is the case, southern hemisphere grid-cells are plotted at the top of the image.

ds_rasterio.cell_lat.plot()

smap_h5_cell_lat

Using ds_rasterio.cell_lat.squeeze().plot.imshow(origin="upper") to plot cell lat, which sets the origin to be the upper-left cell, gives the expected result: postive latitudes at the top of the image.

If you want to take advantage of the accessors and methods provided by rioxarray then you need to import rioxarray and set decode_coords="all". Note: this throws a warning because there is no coordinate information in the Geophysical_Data group and xarray will only look in the group specified. IMO I don't think nested groups are necessary for these kinds of high-level products. Merging the root and precipitation datasets retains the georeferencing from the first call to open_dataset.

ds_rioxarray  = xr.merge([
    xr.open_dataset(filepath, decode_coords="all"),
    xr.open_dataset(filepath, group='Geophysical_Data', decode_coords="all")['precipitation_total_surface_flux'],
])

print(ds_rioxarray)
<xarray.Dataset> Size: 175MB
Dimensions:                           (phony_dim_2: 1, y: 1624, x: 3856)
Coordinates:
    EASE2_global_projection           (phony_dim_2) |S1 1B ...
  * x                                 (x) float64 31kB -1.736e+07 ... 1.736e+07
  * y                                 (y) float64 13kB 7.31e+06 ... -7.31e+06
Dimensions without coordinates: phony_dim_2
Data variables:
    cell_column                       (y, x) float64 50MB ...
    cell_lat                          (y, x) float32 25MB ...
    cell_lon                          (y, x) float32 25MB ...
    cell_row                          (y, x) float64 50MB ...
    time                              (phony_dim_2) datetime64[ns] 8B ...
    precipitation_total_surface_flux  (y, x) float32 25MB ...
Attributes:
    Source:       v17.11.1
    Institution:  NASA Global Modeling and Assimilation Office
    History:      File written by ldas2daac.x
    Comment:      HDF-5
    Filename:     /discover/nobackup/projects/gmao/smap/SMAP_L4/L4_SM/Vv7032/...
    Title:        SMAP L4_SM Geophysical (GPH) Data Granule
    Conventions:  CF
    References:   see SMAP L4_SM Product Specification Documentation
    Contact:      http://gmao.gsfc.nasa.gov/
[/tmp/ipykernel_230171/3642432009.py:3](http://localhost:8890/tmp/ipykernel_230171/3642432009.py#line=2): UserWarning: Variable(s) referenced in grid_mapping not in variables: ['EASE2_global_projection']
  xr.open_dataset(filepath, group='Geophysical_Data', decode_coords="all")['precipitation_total_surface_flux'],
ds_rioxarray.precipitation_total_surface_flux.plot(vmax=0.001)

smap_h5_precipitation

@vincentsarago
Copy link

vincentsarago commented Oct 24, 2024

@snowman2 I'm seeing the same behaviour

notebook: https://gist.github.com/vincentsarago/81f1e849ca8cef938a342ac49a2a1d49

I'm wondering if there is something going on with how the transform is calculated here

try:
src_left, _, _, src_top = self._unordered_bounds(recalc=recalc)
src_resolution_x, src_resolution_y = self.resolution(recalc=recalc)
except (DimensionMissingCoordinateError, DimensionError):
return Affine.identity() if transform is None else transform
return Affine.translation(src_left, src_top) * Affine.scale(
src_resolution_x, src_resolution_y
)

to me it seems that the transform should be

src_resolution_x, src_resolution_y = da.rio.resolution()
src_left, _, _, src_top = da.rio.bounds(recalc=True)
transform = Affine.translation(src_left, src_top) * Affine.scale(
    src_resolution_x, -src_resolution_y
)

at least for the file we are using the _unordered_bounds will have src_top be the bottom latitude 🤷 , so basically the origin for the transform will be the bottom left (instead of the top left in rasterio/gdal).

ref: cogeotiff/rio-tiler#756

@snowman2
Copy link
Member

@vincentsarago I believe that the transform in rioxarray should be equivalent. Do you have an example file you can share?

@vincentsarago
Copy link

vincentsarago commented Oct 30, 2024

@snowman2 you can get files using

import earthaccess
import xarray

earthaccess.login()

results = earthaccess.search_data(
    concept_id="C2036881735-POCLOUD", count=1, temporal=("2024-09-20", "2024-09-20")
)
fp = earthaccess.download(results, "earthaccess_data")[0]
ds = xarray.open_dataset(fp, decode_coords="all")

you can also check this file https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/2020/11/ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-20201101000000-fv2.00.nc

(Note this file has also the overflow bounds issue, with bounds < -90 😓)

import rioxarray
import xarray
import fsspec

src_path = "https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/2020/11/ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-20201101000000-fv2.00.nc"

filesystem = fsspec.filesystem("https")
fp = filesystem.open(src_path)
ds = xarray.open_dataset(fp, decode_coords="all", engine="h5netcdf")

da = ds["lst"]
print(da.rio.bounds())
print(da.rio._unordered_bounds())
print(da.rio.transform())

(-179.99999511705187, -90.00000274631076, 179.99999511705187, 89.9999874875217)
(-179.99999511705187, 89.9999874875217, 179.99999511705187, -90.00000274631076)
| 0.01, 0.00,-180.00|
| 0.00, 0.01,-90.00|
| 0.00, 0.00, 1.00|

note we've added a fix in rio-tiler to flip vertically the output array when encountering this case https://github.com/cogeotiff/rio-tiler/blob/9a39c802044b485a26c29065b539ec44c817d8a6/rio_tiler/io/xarray.py#L329-L332

@snowman2
Copy link
Member

we've added a fix in rio-tiler to flip vertically the output array when encountering this case

That sounds like a good solution to me 👍

@RY4GIT
Copy link
Author

RY4GIT commented Dec 20, 2024

Thank you all for joining the discussions and coming up with the solutions!! Glad to hear that it was helpful for developing tutorials too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream Issue is related to a dependency (upstream package).
Projects
None yet
Development

No branches or pull requests

5 participants