Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to Azure blob sink to configure Content-Encoding and Content-Type #21795

Open
heshanperera-alert opened this issue Nov 14, 2024 · 15 comments
Labels
sink: azure_blob Anything `azure_blob` sink related type: feature A value-adding code addition that introduce new functionality.

Comments

@heshanperera-alert
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When using azure blob sink to upload some log files in gzip format, vector does add 'content-encoding' header. When we try to download and extract the gzip file we are running in to file corrupted error. However when we try to manually remove the content-encoding header from the file and then download the file, everything work as expected. There doesnt seem to have a way to remove this header from the configuration. What should we do? Following is the file properties on azure portal.

image

Configuration

No response

Version

0.37.1

Debug Output

vector  | 2024-11-13T21:13:00.214520Z DEBUG sink{component_kind="sink" component_id=azstorage_out component_type=azure_blob}:request{request_id=121}:request: azure_core::policies::transport: the following request will be passed to the transport policy: Request {
vector  |     url: Url {
vector  |         scheme: "https",
vector  |         cannot_be_a_base: false,
vector  |         username: "",
vector  |         password: None,
vector  |         host: Some(
vector  |             Domain(
vector  |                 "xxxx.blob.core.windows.net",
vector  |             ),
vector  |         ),
vector  |         port: None,
vector  |         path: "xxxx/2024/11/12/rcs-2024-11-12-21-29-21.json-22dc3472-6d30-4719-a867-23678e88b43a.log.gz",
vector  |         query: None,
vector  |         fragment: None,
vector  |     },
vector  |     method: Put,
vector  |     headers: Headers(
vector  |         {
vector  |             HeaderName(
vector  |                 "user-agent",
vector  |             ): HeaderValue(
vector  |                 "azsdk-rust-storage/0.17.0 (1.77.0; linux; aarch64)",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-blob-type",
vector  |             ): HeaderValue(
vector  |                 "BlockBlob",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-version",
vector  |             ): HeaderValue(
vector  |                 "2020-10-02",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-date",
vector  |             ): HeaderValue(
vector  |                 "Wed, 13 Nov 2024 21:13:00 GMT",
vector  |             ),
vector  |             HeaderName(
vector  |                 "content-length",
vector  |             ): HeaderValue(
vector  |                 "26712",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-blob-content-encoding",
vector  |             ): HeaderValue(
vector  |                 ***"gzip",***
vector  |             ),
vector  |             HeaderName(
vector  |                 "authorization",
vector  |             ): HeaderValue(
vector  |                 "SharedKey xxxxs=",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-blob-content-type",
vector  |             ): HeaderValue(
vector  |                 "text/plain",
vector  |             ),
vector  |         },
vector  |     ),
vector  |     body: Bytes(

Example Data

No response

Additional Context

No response

References

No response

@heshanperera-alert heshanperera-alert added the type: bug A code related bug. label Nov 14, 2024
@pront
Copy link
Member

pront commented Nov 14, 2024

Hi @heshanperera-alert, thanks for creating this issue.

When we try to download and extract the gzip file we are running in to file corrupted error.

Can you help me understand the following, is the Vector request accepted or rejected?
If the Vector request is successful, does the Azure portal return an error when you attempt to download the file?

@heshanperera-alert
Copy link
Author

Hello @pront

No request does not fail, its successful. Azure doesn't return an error when downloading either. Its downloading successfully, but when i am about to extract, it just gives me the error
image

when i use gunzip

:~/Downloads/ > gunzip rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz gunzip: rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz: not in gzip format

@pront
Copy link
Member

pront commented Nov 14, 2024

Is this a valid gzip compressed file or your blob is raw bytes? For the latter, you have to set this https://vector.dev/docs/reference/configuration/sinks/azure_blob/#compression to none.

@heshanperera-alert
Copy link
Author

@pront i believe its a valid gzip. if i remove the gzip header from the azure portal by editing the blob file everything works fine. I can decompress after downloading the file. I dont want to make the compression to none since we are going to send TBs of data and would like to keep it minimum with compression.

@pront
Copy link
Member

pront commented Nov 14, 2024

I see, thank you for sharing these details. Internally we set the BlobContentEncoding which ultimately determines the value of the "x-ms-blob-content-encoding" header.

Unfortunately, I don't have an Azure environment that I can use to test this myself but I am not convinced that removing the content encoding header is the right thing to do. I wonder if it's an issue with Azure or with the crate version we are using.

@heshanperera-alert
Copy link
Author

@pront do you have any workaround you think to get around this in the short run. Its unrealistic to remove the header from each and everyfile on azure blob as we do have millions of files out there and theres no command to do that from azure cli either.

@jszwedko
Copy link
Member

jszwedko commented Nov 14, 2024

The note on the compression option on the AWS S3 sink may be relevant here:

Some cloud storage API clients and browsers handle decompression transparently, so depending on how they are accessed, files may not always appear to be compressed.

https://vector.dev/docs/reference/configuration/sinks/aws_s3/#compression

The same thing may apply to Azure Blob Storage. That is: if you download via the browser or some SDKs the file will be transparently decompressed when downloading.

@heshanperera-alert
Copy link
Author

@jszwedko interesting, good thing on s3 sink is it has the ability to override the content-encoding header. azure blob sink doesnt have that capability

@pront
Copy link
Member

pront commented Nov 14, 2024

Based a quick internet search (see this), Jesse is right. Azure decompresses automatically.

Did you inspect the contents of the downloaded file on your host? Let us know, if so we can close this issue.

(Note that gzip files start with the magic bytes [0x1f, 0x8b])

@heshanperera-alert
Copy link
Author

@pront aint the blob sink should have the same capability like s3 sink, so that we could override the header?

@pront
Copy link
Member

pront commented Nov 15, 2024

Are you referring to these?

We can add these to the Azure Blob Storage sink as well. Not opposed to that 👍


What I am trying to understand is, if we have a Vector bug or not. If I am reading the above correctly, the downloaded blob is already decompressed but has a gz extension.Should be easy to verify is this on your side. This comment explains in detail how to get raw data without decompressing using Python APIs.

@heshanperera-alert
Copy link
Author

oh yeah sorry havent answered your question regarding the magic bytes. Its not having the 0x1f, 0x8b.

~/Documents/git/vector-eventhub-poc/ > hexdump -C -n 16 ~/Downloads/rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz 00000000 7b 22 54 69 6d 65 73 74 61 6d 70 22 3a 22 32 30 |{"Timestamp":"20| 00000010

@heshanperera-alert
Copy link
Author

@pront do you know when the feature to overwrite the headers can be added to blob sink?

@pront
Copy link
Member

pront commented Nov 15, 2024

@pront do you know when the feature to overwrite the headers can be added to blob sink?

Unfortunately this is not on our radar, there's on open feature request for this. If you are motivated, you are welcome to submit a PR and we will review it.

@pront
Copy link
Member

pront commented Nov 15, 2024

oh yeah sorry havent answered your question regarding the magic bytes. Its not having the 0x1f, 0x8b.

~/Documents/git/vector-eventhub-poc/ > hexdump -C -n 16 ~/Downloads/rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz 00000000 7b 22 54 69 6d 65 73 74 61 6d 70 22 3a 22 32 30 |{"Timestamp":"20| 00000010

Thank you for confirming. You can also inspect the contents to see if it matches what you published as one more verification step.

@jszwedko jszwedko added sink: azure_blob Anything `azure_blob` sink related type: feature A value-adding code addition that introduce new functionality. and removed type: bug A code related bug. labels Nov 15, 2024
@jszwedko jszwedko changed the title Azure blob sink adding Content-Encoding header to the file corrupts upon downloading. Add support to Azure blob sink to configure Content-Encoding and Content-Type Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: azure_blob Anything `azure_blob` sink related type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

No branches or pull requests

3 participants