Possible Performance Issue #29

rushi-lonkar · 2025-01-07T16:46:39Z

Hello,

Thanks for building this parser. I tested it with a sample 837 file and it worked fine. However, when I used a real data file, the line claims = spark.read.json(rdd) takes a long time to complete. I tried with different cluster sizes but that operation doesn't seem to complete. The data file isn't that big, around 2MB so wanted to know if there is a quick fix or if you have seen this issue during unit testing. Unfortunately, I cannot share the file due to privacy reasons.

Any help would be appreciated.

Thanks!

from databricksx12 import *
from databricksx12.hls import *
import json
from pyspark.sql.functions import input_file_name

hm = HealthcareManager()
df = spark.read.text("sampledata/837/*txt", wholetext = True)

rdd = (
df.withColumn("filename", input_file_name()).rdd
.map(lambda x: (x.asDict().get("filename"),x.asDict().get("value")))
.map(lambda x: (x[0], EDI(x[1])))
.map(lambda x: { **{'filename': x[0]}, **hm.to_json(x[1])} )
.map(lambda x: json.dumps(x))
)
claims = spark.read.json(rdd)

zavoraad · 2025-01-07T20:45:19Z

Benchmarks have been conservatively around 20,000 segments / minute in a single file.

Can you provide how long it is taking and how many rows resulted in the claims DataFrame?

Another benchmark is to run against just plainly printing the file in a Dataframe here https://github.com/databricks-industry-solutions/x12-edi-parser?tab=readme-ov-file#raw-edi-as-a-table. Is there a significant time difference in the runs?

Currently the parser does not split EDI files (so one file of 2MB will take the same time regardless of cluster size). We could improve the performance by allowing a split by Functional Groups in cases where many functional groups are present in a file.

rushi-lonkar · 2025-01-07T21:00:03Z

Thanks for your quick response.

20K segments per minute isn't bad at all. I am only processing a single file and after trying out the Raw EDI as table option, I see 88775 rows. It finishes in 6 seconds.
The other process to get the claims DataFrame was running for more than 15 minutes at which point I killed it. I tried multiple times but it never finished after waiting for a long time so unfortunately, I cannot give you the exact time. I am going to try it again and let it run to see if it finishes or not. Maybe there is something odd with this data file although it was parsed successfully with our in house C# parser. I will also try with some other real data files to see if they get processed.

zavoraad · 2025-01-07T21:06:16Z

Good to know. 88K is not bad, but each file can take different lengths of time depending on how tightly wound the HL segments are. The parser essentially unwinds the loops which results in easier to read data but requires duplication and flattening.

Can you let me know how many Segments in the Raw EDI as a table start with GS and ST in the file? That will help determine if splitting processing by Functional Group and Transactions will speed up processing time and by how much.

rushi-lonkar · 2025-01-07T21:29:16Z

There is only one GS and ST segment in my file that times out.

rushi-lonkar · 2025-01-07T23:13:26Z

Hi Aaron, quick update here. I tried with other data files as well with varying sizes between 1.5 and 4 MB but none of them are getting parsed. I killed the process after around 70 minutes else our cluster costs will keep increasing : ).
All of them have a single GS and ST segment. I feel it may be the recursion type approach in claim.py that may be causing the delay for files that have a lot of CLM loops? I am just guessing here so please ignore if it doesn't make sense but may be the segment/loops should be cached.

zavoraad · 2025-01-08T05:39:56Z

It turns out breaking an individual file for parallel processing was not so bad. I added a branch hotfix-large-files and just reframed the order of which the json parsing happens.

Install the hotfix and try:

#df = spark.read....
hm = HealthcareManager()
rdd = (df
	.rdd
	.map(lambda row: EDI(row.value))
	.map(lambda edi: hm.flatten(edi))
	.flatMap(lambda x: x)
)
claims_rdd = (
                    rdd
                         .repartition(<NUMBER OF EXECUTOR CORES ON YOUR CLUSTER>)
                         .map(lambda x: hm.flatten_to_json(x))
			.map(lambda x: json.dumps(x))
)

claims_df = spark.read.json(claims_rdd)

I tested on remittance files that were 18MB file and 192MB on my local machine. The 192MB was under 3 minutes to run.

rushi-lonkar · 2025-01-08T14:01:12Z

Hi Aaron, Thanks for the hotfix. I tried it in a separate repo in my Azure Databricks environment but unfortunately, I don't see any improvement. I double checked to make sure I got the correct code and used the new approach above. Also played around with the number of executor cores to see if that makes a difference but no luck so far.

The line claims_df = spark.read.json(claims_rdd) doesn't seem to finish while I waited for around 5 minutes.

My cluster config is below:

Runtime - 15.4 LTS (Scala 2.12, Spark 3.5.0)
Node Type - 14GB Memory, 4 Cores

zavoraad self-assigned this Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Performance Issue #29

Possible Performance Issue #29

rushi-lonkar commented Jan 7, 2025

zavoraad commented Jan 7, 2025

rushi-lonkar commented Jan 7, 2025

zavoraad commented Jan 7, 2025

rushi-lonkar commented Jan 7, 2025

rushi-lonkar commented Jan 7, 2025

zavoraad commented Jan 8, 2025

rushi-lonkar commented Jan 8, 2025

Possible Performance Issue #29

Possible Performance Issue #29

Comments

rushi-lonkar commented Jan 7, 2025

zavoraad commented Jan 7, 2025

rushi-lonkar commented Jan 7, 2025

zavoraad commented Jan 7, 2025

rushi-lonkar commented Jan 7, 2025

rushi-lonkar commented Jan 7, 2025

zavoraad commented Jan 8, 2025

rushi-lonkar commented Jan 8, 2025