Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Performance Issue #29

Open
rushi-lonkar opened this issue Jan 7, 2025 · 7 comments
Open

Possible Performance Issue #29

rushi-lonkar opened this issue Jan 7, 2025 · 7 comments
Assignees

Comments

@rushi-lonkar
Copy link

Hello,

Thanks for building this parser. I tested it with a sample 837 file and it worked fine. However, when I used a real data file, the line claims = spark.read.json(rdd) takes a long time to complete. I tried with different cluster sizes but that operation doesn't seem to complete. The data file isn't that big, around 2MB so wanted to know if there is a quick fix or if you have seen this issue during unit testing. Unfortunately, I cannot share the file due to privacy reasons.

Any help would be appreciated.

Thanks!

from databricksx12 import *
from databricksx12.hls import *
import json
from pyspark.sql.functions import input_file_name

hm = HealthcareManager()
df = spark.read.text("sampledata/837/*txt", wholetext = True)

rdd = (
df.withColumn("filename", input_file_name()).rdd
.map(lambda x: (x.asDict().get("filename"),x.asDict().get("value")))
.map(lambda x: (x[0], EDI(x[1])))
.map(lambda x: { **{'filename': x[0]}, **hm.to_json(x[1])} )
.map(lambda x: json.dumps(x))
)
claims = spark.read.json(rdd)

@zavoraad zavoraad self-assigned this Jan 7, 2025
@zavoraad
Copy link
Contributor

zavoraad commented Jan 7, 2025

Benchmarks have been conservatively around 20,000 segments / minute in a single file.

Can you provide how long it is taking and how many rows resulted in the claims DataFrame?

Another benchmark is to run against just plainly printing the file in a Dataframe here https://github.com/databricks-industry-solutions/x12-edi-parser?tab=readme-ov-file#raw-edi-as-a-table. Is there a significant time difference in the runs?

Currently the parser does not split EDI files (so one file of 2MB will take the same time regardless of cluster size). We could improve the performance by allowing a split by Functional Groups in cases where many functional groups are present in a file.

@rushi-lonkar
Copy link
Author

Thanks for your quick response.

20K segments per minute isn't bad at all. I am only processing a single file and after trying out the Raw EDI as table option, I see 88775 rows. It finishes in 6 seconds.
The other process to get the claims DataFrame was running for more than 15 minutes at which point I killed it. I tried multiple times but it never finished after waiting for a long time so unfortunately, I cannot give you the exact time. I am going to try it again and let it run to see if it finishes or not. Maybe there is something odd with this data file although it was parsed successfully with our in house C# parser. I will also try with some other real data files to see if they get processed.

@zavoraad
Copy link
Contributor

zavoraad commented Jan 7, 2025

Good to know. 88K is not bad, but each file can take different lengths of time depending on how tightly wound the HL segments are. The parser essentially unwinds the loops which results in easier to read data but requires duplication and flattening.

Can you let me know how many Segments in the Raw EDI as a table start with GS and ST in the file? That will help determine if splitting processing by Functional Group and Transactions will speed up processing time and by how much.

@rushi-lonkar
Copy link
Author

There is only one GS and ST segment in my file that times out.

@rushi-lonkar
Copy link
Author

Hi Aaron, quick update here. I tried with other data files as well with varying sizes between 1.5 and 4 MB but none of them are getting parsed. I killed the process after around 70 minutes else our cluster costs will keep increasing : ).
All of them have a single GS and ST segment. I feel it may be the recursion type approach in claim.py that may be causing the delay for files that have a lot of CLM loops? I am just guessing here so please ignore if it doesn't make sense but may be the segment/loops should be cached.

@zavoraad
Copy link
Contributor

zavoraad commented Jan 8, 2025

It turns out breaking an individual file for parallel processing was not so bad. I added a branch hotfix-large-files and just reframed the order of which the json parsing happens.

Install the hotfix and try:

#df = spark.read....
hm = HealthcareManager()
rdd = (df
	.rdd
	.map(lambda row: EDI(row.value))
	.map(lambda edi: hm.flatten(edi))
	.flatMap(lambda x: x)
)
claims_rdd = (
                    rdd
                         .repartition(<NUMBER OF EXECUTOR CORES ON YOUR CLUSTER>)
                         .map(lambda x: hm.flatten_to_json(x))
			.map(lambda x: json.dumps(x))
)

claims_df = spark.read.json(claims_rdd)

I tested on remittance files that were 18MB file and 192MB on my local machine. The 192MB was under 3 minutes to run.

image

@rushi-lonkar
Copy link
Author

Hi Aaron, Thanks for the hotfix. I tried it in a separate repo in my Azure Databricks environment but unfortunately, I don't see any improvement. I double checked to make sure I got the correct code and used the new approach above. Also played around with the number of executor cores to see if that makes a difference but no luck so far.

The line claims_df = spark.read.json(claims_rdd) doesn't seem to finish while I waited for around 5 minutes.

My cluster config is below:

Runtime - 15.4 LTS (Scala 2.12, Spark 3.5.0)
Node Type - 14GB Memory, 4 Cores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants