Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Can't handle "date" format from json schema #77

Closed
balmasi opened this issue Feb 17, 2022 · 14 comments
Closed

Bug: Can't handle "date" format from json schema #77

balmasi opened this issue Feb 17, 2022 · 14 comments

Comments

@balmasi
Copy link
Contributor

balmasi commented Feb 17, 2022

Issue

it seems like the target can't handle a field with the following schema

"some_date": {"format": "date", "type": ["string", "null"]}

it does. however, handle the date-time format.

Here's the trace-back:

Traceback (most recent call last):
  File "/opt/venv_bigquery/bin/target-bigquery", line 8, in <module>
    sys.exit(main())
  File "/opt/venv_bigquery/lib/python3.7/site-packages/target_bigquery/__init__.py", line 338, in main
    persist_lines(config, singer_messages)
  File "/opt/venv_bigquery/lib/python3.7/site-packages/target_bigquery/__init__.py", line 219, in persist_lines
    flushed_state = flush_streams(records_to_load, row_count, stream_to_sync, config, state, flushed_state)
  File "/opt/venv_bigquery/lib/python3.7/site-packages/target_bigquery/__init__.py", line 274, in flush_streams
    ) for stream in streams_to_flush)
  File "/opt/venv_bigquery/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/opt/venv_bigquery/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/venv_bigquery/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/opt/venv_bigquery/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/opt/venv_bigquery/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/venv_bigquery/lib/python3.7/site-packages/target_bigquery/__init__.py", line 301, in load_stream_batch
    flush_records(stream, records_to_load, row_count[stream], db_sync)
  File "/opt/venv_bigquery/lib/python3.7/site-packages/target_bigquery/__init__.py", line 315, in flush_records
    writer(out, parsed_schema, db_sync.records_to_avro(records_to_load.values()))
  File "fastavro/_write.pyx", line 592, in fastavro._write.writer
  File "fastavro/_write.pyx", line 547, in fastavro._write.Writer.write
  File "fastavro/_write.pyx", line 335, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 285, in fastavro._write.write_record
  File "fastavro/_write.pyx", line 333, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 249, in fastavro._write.write_union
ValueError: datetime.datetime(2021, 10, 27, 0, 0) (type <class 'datetime.datetime'>) do not match ['null', 'string']

Expected Outcome

Well, the writer should just convert that to the DATE type and happily write it to BQ

Workaround

using the date-time or string types in the schema message for a stream

@jmriego
Copy link
Owner

jmriego commented Feb 19, 2022

hi @balmasi , do you mind sending an example of what the tap is exporting that is causing this to fail? I mean, not just the schema but also the data. Basically a complete example I can pipe into this target. I'll fix that and add it to the tests

@judahrand
Copy link
Contributor

This should be easy to fix if #86 is merged.

@balmasi
Copy link
Contributor Author

balmasi commented Feb 22, 2022

hi @balmasi , do you mind sending an example of what the tap is exporting that is causing this to fail? I mean, not just the schema but also the data. Basically a complete example I can pipe into this target. I'll fix that and add it to the tests

Sure, here's some minimal stream output for you to cat into the target:

{"type": "SCHEMA", "stream": "my_stream", "schema": {"properties": {"id": {"type": ["integer"]},"some_date": {"format": "date", "type": ["string", "null"]}}, "type": "object"}, "key_properties": ["id"], "bookmark_properties": ["id"]}
{"type": "RECORD", "stream": "my_stream", "record": {"id": 1, "some_date": null}, "time_extracted": "2022-02-22T18:25:53.171486Z"}
{"type": "RECORD", "stream": "my_stream", "record": {"id": 2, "some_date": "2021-03-24"}, "time_extracted": "2022-02-22T18:25:53.186510Z"}
{"type": "STATE", "value": {"bookmarks": {"my_stream": {"replication_key": "id", "replication_key_value": 2}}}}

@jmriego
Copy link
Owner

jmriego commented Feb 26, 2022

thanks for that @balmasi ! This is almost ready and I'm just doing another round of testing to see if there's something I missed.
Can you tell me which tap are you running that gave you this format? I'm mostly working daily with other TransferWise taps and have always seem everything extracted as a timestamp instead

@gnilrets
Copy link

I ran into this issue today when developing a custom tap for quickbase. Plenty of date-only fields there.

@jmriego jmriego mentioned this issue May 23, 2022
11 tasks
@jmriego
Copy link
Owner

jmriego commented May 23, 2022

sorry about that. I just merged the fix for this. do you mind checking the newer code and letting me know how it goes for you? thanks!

@gnilrets
Copy link

@jmriego - Looks like that commit is mapping date fields to BQ timestamp fields. Can you fix it so it maps them to regular date fields so that it better aligns with the source data?

@jmriego
Copy link
Owner

jmriego commented May 24, 2022

hi @gnilrets ! do you mind testing with the changedatetype branch? I have made the change but I still need to add the right integration tests

@gnilrets
Copy link

gnilrets commented May 24, 2022

@jmriego . I tried it, and it is loading dates, but it's got some strange behavior. Dates that are missing in the source system are being parsed as the maximum allowable date? Here's the warning I'm getting:

2022-05-24T17:47:17.809888Z [info     ] time=2022-05-24 10:47:17 name=target_bigquery level=WARNING message=Parsing the date "" in key "scheduled_completion_date" has failed, thus defaulting to max acceptable value of date in BigQuery cmd_type=loader job_id=testbq name=target-bigquery run_id=6c7e3f0c-1197-41f2-a36e-9ed72acac99e stdio=stderr

@jmriego
Copy link
Owner

jmriego commented May 24, 2022

my guess is that the tap you are developing outputs an empty string instead of a date in the case of nulls. Copying a message from above, this is what a null value in a date field looks like and can be processed correctly

{"type": "RECORD", "stream": "my_stream", "record": {"id": 1, "some_date": null}, "time_extracted": "2022-02-22T18:25:53.171486Z"}

As you have found out anyway, if a date is not possible to process it defaults to the max value. This should appear very rarely. I've seen it mostly in Postgres that accepts dates way over 9999-12-31 which is the maximum BigQuery can read

@gnilrets
Copy link

@jmriego - Yeah, that's definitely it. I'm developing a tap and trying to feed it into BigQuery. I was able to correct it in my tap so that dates get output as null and when I do that it works as expected.

However, I'm new to using singer taps/targets and I know that I'm not always going to have control over what the taps do, particularly when it comes to these kinds of edge cases. I wouldn't ever expect a blank string to be interpreted as the largest date available in the database (why not the smallest?). Is there some reason you would want to handle a null vs an empty string differently? Or is this just something I should file as an issue to be resolved?

@jmriego
Copy link
Owner

jmriego commented May 25, 2022

basically what is happening is that I'm using the standard Python libraries for dates. Those libraries don't accept the same range of dates as some databases. MySQL for example can have things like 2022-05-00 and Postgres accepts dates really far in the future like 255999-01-01.
In my own experience of using this target at my company for several years, massive dates outside of the range of Python is the typical exception I was getting so that's what I'm accounting for. I've never seen an empty string as a date so far but I understand your tap is still in development so maybe not a typical case?

So we end up with two questions:

  1. What's the date range your taps and targets should support. What should the taps or targets do when something is outside this range?
  2. Is it the job of the tap to provide correct dates or is it the job of the target?

In my opinion it's the job of the tap, as I don't think adding new taps to the Singer ecosystem should mean all targets need to account for quirks in the new database/API we are reading from. But this sounds to me like a great discussion to have around Singer standards.

Are you in the Meltano slack community? It would be great if we had this discussion either there or in their repo so we can get more input from them as they are working on defining the Singer standard

@gnilrets
Copy link

I guess I'm just of the opinion that if you're going to fallback to some value when a date string can't be parsed, using NULL would be more appropriate than the largest date a db can support. 255999-01-01 looks more like a data validation issue than any representation of a real future date.

I'm on Meltano slack and would be happy to continue that conversation there.

@jmriego
Copy link
Owner

jmriego commented Jun 3, 2022

the fix for the date type is already in prod. when a tap sends a date, it will no longer appear as a timestamp.

the other point about wrong dates... we could make the parsing for wrong dates configurable. So far I'm going to leave it as defaulting to max date the same way they are doing for other targets in Meltano.
it would be great if you could send a pull request but if not I'll try to work on this as soon as I have the time

please let me know if the date parsing is fixed for you after this. thanks!

@jmriego jmriego closed this as completed Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants