-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Performance Issue #29
Comments
Benchmarks have been conservatively around 20,000 segments / minute in a single file. Can you provide how long it is taking and how many rows resulted in the claims DataFrame? Another benchmark is to run against just plainly printing the file in a Dataframe here https://github.com/databricks-industry-solutions/x12-edi-parser?tab=readme-ov-file#raw-edi-as-a-table. Is there a significant time difference in the runs? Currently the parser does not split EDI files (so one file of 2MB will take the same time regardless of cluster size). We could improve the performance by allowing a split by Functional Groups in cases where many functional groups are present in a file. |
Thanks for your quick response. 20K segments per minute isn't bad at all. I am only processing a single file and after trying out the Raw EDI as table option, I see 88775 rows. It finishes in 6 seconds. |
Good to know. 88K is not bad, but each file can take different lengths of time depending on how tightly wound the HL segments are. The parser essentially unwinds the loops which results in easier to read data but requires duplication and flattening. Can you let me know how many Segments in the Raw EDI as a table start with GS and ST in the file? That will help determine if splitting processing by Functional Group and Transactions will speed up processing time and by how much. |
There is only one GS and ST segment in my file that times out. |
Hi Aaron, quick update here. I tried with other data files as well with varying sizes between 1.5 and 4 MB but none of them are getting parsed. I killed the process after around 70 minutes else our cluster costs will keep increasing : ). |
It turns out breaking an individual file for parallel processing was not so bad. I added a branch hotfix-large-files and just reframed the order of which the json parsing happens. Install the hotfix and try: #df = spark.read....
hm = HealthcareManager()
rdd = (df
.rdd
.map(lambda row: EDI(row.value))
.map(lambda edi: hm.flatten(edi))
.flatMap(lambda x: x)
)
claims_rdd = (
rdd
.repartition(<NUMBER OF EXECUTOR CORES ON YOUR CLUSTER>)
.map(lambda x: hm.flatten_to_json(x))
.map(lambda x: json.dumps(x))
)
claims_df = spark.read.json(claims_rdd) I tested on remittance files that were 18MB file and 192MB on my local machine. The 192MB was under 3 minutes to run. |
Hi Aaron, Thanks for the hotfix. I tried it in a separate repo in my Azure Databricks environment but unfortunately, I don't see any improvement. I double checked to make sure I got the correct code and used the new approach above. Also played around with the number of executor cores to see if that makes a difference but no luck so far. The line claims_df = spark.read.json(claims_rdd) doesn't seem to finish while I waited for around 5 minutes. My cluster config is below: Runtime - 15.4 LTS (Scala 2.12, Spark 3.5.0) |
Hello,
Thanks for building this parser. I tested it with a sample 837 file and it worked fine. However, when I used a real data file, the line claims = spark.read.json(rdd) takes a long time to complete. I tried with different cluster sizes but that operation doesn't seem to complete. The data file isn't that big, around 2MB so wanted to know if there is a quick fix or if you have seen this issue during unit testing. Unfortunately, I cannot share the file due to privacy reasons.
Any help would be appreciated.
Thanks!
from databricksx12 import *
from databricksx12.hls import *
import json
from pyspark.sql.functions import input_file_name
hm = HealthcareManager()
df = spark.read.text("sampledata/837/*txt", wholetext = True)
rdd = (
df.withColumn("filename", input_file_name()).rdd
.map(lambda x: (x.asDict().get("filename"),x.asDict().get("value")))
.map(lambda x: (x[0], EDI(x[1])))
.map(lambda x: { **{'filename': x[0]}, **hm.to_json(x[1])} )
.map(lambda x: json.dumps(x))
)
claims = spark.read.json(rdd)
The text was updated successfully, but these errors were encountered: