Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noise analysis #22

Merged
merged 9 commits into from
Oct 1, 2024
Merged

Noise analysis #22

merged 9 commits into from
Oct 1, 2024

Conversation

dzhingarska
Copy link
Collaborator

comparing the shock positions over time of flow data and flow data with additional noise

@warisa-r warisa-r self-requested a review September 24, 2024 16:31
analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
# Returns
- A new `FlowData` struct with noise added to the relevant fields.
"""
function apply_noise_to_flow(flow_data::FlowData, noise_data::NoiseData)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try adding noise to some properties in some cases and if things go differently?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean exactly?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example: Try adding boise to pressure and density and leave the velocity field alone

analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Show resolved Hide resolved
examples/sod_shock_tube_1d.jl Outdated Show resolved Hide resolved
@warisa-r warisa-r changed the title Ina test branch Noise analysis Sep 24, 2024
@warisa-r warisa-r linked an issue Sep 24, 2024 that may be closed by this pull request
@@ -4,6 +4,9 @@ using ShockwaveProperties
using Unitful
using ShockwaveDetection

# Create a NoiseData instance
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u leave this out :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it there because you asked not to change the file last time, but I can remove the noise data instance

Copy link
Owner

@warisa-r warisa-r Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, please don’t commit to this file since it’s not relevant to this PR? Sorry for the confusion.

analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
analysis/noise.jl Outdated Show resolved Hide resolved
Compares shock fits over time between original and noisy data.
# Pad the shorter array with zeroes to match lengths
if length(original_params) > length(noisy_params)
padding = fill(0.0, length(original_params) - length(noisy_params))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that this would only work if all of our models have only 1 important parameter. But that is not the case for most of our essential models (the majority and the ones that are fitted most are parabola, circle, and line, all of which have 2-3 parameters)

And also, what if in the case where the model actually only has 1 parameter like hline and vline (check out sod tube 2d case) and the parameter is 0.0. Then the padded value is a spot-on guess and doesn't signify whatsoever that the algorithm fails to fit something even though it does.

Even though the number in the missing_count does vaguely say something but the mean you get from this part of the code is tampered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix the issue with hline and vline with padding with special value like NaN. This will now give a clear distinction between models that fit poorly (high RMSE) and models where parameters were simply missing. Additionally, I extract all the parameters not just the first one, that was my bad.

println("No valid differences calculated.")
return NaN
end

mean_diff = mean(valid_diffs)
inf_ratio = inf_count / length(original_positions) # Calculate ratio based on original count
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the length of the original positions is 0? This isn't a priority but it is something to look out for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it practically never happens but I can still handle this case if necessary, but I don't think it is for our simulation

"""
compare_shock_fits_over_time(original_fits, noisy_fits)
# Ensure both parameter vectors are not empty
if isempty(original_params) || isempty(noisy_params)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fit result will never be empty if there is a cluster of points. If there isn't any shock formed in that frame because there isn't any e.g. supersonic inflow in some of the simulation instances since it will be introduced later, does this mean we can't compare fits of the entire simulation run?

What would be the reason to throw an error here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit result is not going to be empty, the condition is not met but I thought it should be there rather for the overall look of the function, and yes we cannot compare flow without shocks and we also do not want to, the purpose of this function is also not that

@warisa-r
Copy link
Owner

warisa-r commented Sep 27, 2024

@dzhingarska Could you please resolve the things you already fixed? And I think that this task will take at least a while until we can get a sensible result with not only the sod 2d right case, but with shock orb and obstacles too. (if I were to do it. But maybe it can take shorter for you). I mean or at least shock orb since sod 2d right is just simply a line. (And for the obstacle case the detection works as well as in any other case. No shocks are detected on NaN cells so it's like working with normal sim scenarios. You just have more diverse fits.)

In my humble opinion, it isn't easy to quantify false detection correctly and logically when we take a close look at all the false detection scenarios, be it missing points, be it too many points. Initially, I thought this would take around a month or something no shorter than that. It has been a month and maybe it's better if we view what we want to do more realistically. I don't know if this is something we can report on yet.

We could move on to the report with what we have so we can finally be done with the software since we have been doing this since the beginning of June which is quite a long time already.

It is not a big deal, but if we can get this working, that'd be cool.

Let me know what you think and plan to do :)

@warisa-r warisa-r closed this Sep 27, 2024
@warisa-r warisa-r reopened this Sep 27, 2024
@@ -10,4 +10,4 @@ dbscan_algo = DBSCANAlgo(0.25, 3, 10)
detection = detect(flow_data, point_detect_algo, dbscan_algo, fitting_algo)

plot_shock_fits_over_time(flow_data, detection, true)
#create_heatmap_evo_with_shock(flow_data, detection, :density_field, true, false)
#create_heatmap_evo_with_shock(flow_data, detection, :density_field, true, false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not commit to the examples in src? Since we are not doing anything with this example right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just copy it from main and commit, since it would be complicated reverting commits just because of this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this matters much so I'm probably going to merge despite this minor difference

println("Padded Noisy Parameters: ", noisy_params)

# Calculate the differences while ignoring NaNs
differences = original_params .- noisy_params
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what padding is doing here in terms of dimensions. You mentioned missing parameters: which I don't really understand what that means.

  1. You are talking about cases where there isn't any fit for that certain cluster or even worse, frame. In that case, you are right in that the parameter is missing. Since there is no fit, there can be no parameter. In this case you won't be able to perform arithmetic operation anyway in cases that either case detect different models at the time spot in your params array.

  2. You are talking about different models with different numbers of parameters: like

  • Line 2 parameters for 3x-1
  • Parabola 3 parameters 4x^{2}\ -0.1x\ -0.5
    They are very close to each other between 0 and 1 and it is very likely that our algorithm fits these two different models for scenarios with and without noises, respectively. You can plot this. So intuitive, the difference between these two models in this context should be very small right?

But with your padding, the norm of original_params .- noisy_params is going to be quite big. (Consider that we are dealing with small bounds, which we do for every case of simulation we have. We are talking (-1,1) x (-1,1) or (-2,1) x (1,1)

Even more severe. Say that you have both models with exactly the same parameter
1,1,0.5
but they are circle model and parabola model respectively. Then they are two completely different shape, but you algorithm is suggesting that it is the same.

I don't think that this is a good way to solve arithmetic operations. Is there other things we can compare if the models don't match? Like how each models deviate from the original cluster of points (noiseless points) they were trying to fit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate for an approach that is not solely based on how to make arithmetic operation with broadcasting operator work since our curves are parametized starkly differently. Comparing the parameters can work if they are the exact same model from the exact time frame and from the exact cluster (but looking at the way you store the parameter, it is hard to identify if parameters at the same order of the array (noisy and original) is from the same cluster and frame since push! is the mechanism of storing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a point there, actually I comparing the parameters of the fit does not make so much sense as I was expecting and it seems another approach is needed. Now instead of focusing on the parameters themselves, we can compare how well the models fit the data points (original_shock_positions_over_time and noisy_shock_positions_over_time). The new solution compares the model residuals (the difference between the original data points and the fitted data) rather than the raw parameters. Check it out and suggest improvements or feel free to adjust as well.

src/ShockwaveDetection.jl Outdated Show resolved Hide resolved
diff = (orig_pos - closest_noisy_pos)^2
push!(diffs, diff)
# Remove the matched noisy position to avoid double matching
deleteat!(noisy, closest_idx)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm looks fine to me. My problem with this kind of penalty (even though it is fair) is that now the difference is bloated with a large penalty and this renders the difference in the close noisy positions and the original positions trivial (occupying floating point spaces and pushing the smaller values away from storage).

Could we not add large penalties and only count missing shock positions? That would lessen the meaning of diffs calculated here but I think that there is a limit to what we can get out of arithmetic operations.

Copy link
Collaborator Author

@dzhingarska dzhingarska Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, no more large penalties, just missing_count. If you still want some refinements, let me know or feel free to edit as you like as well.

Copy link
Owner

@warisa-r warisa-r Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! One more thing is that I think it might be a good idea if we have a separate count of how many our algorithm falsely detected and failed to detect

# Calculate the RMSE (ignoring NaN values)
rmse = sqrt(mean(filtered_differences .^ 2))

return rmse, missing_count
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try this with funky_square.celltape with a gradient threshold of 0.05. There are all sorts of clusters and fit should show up. If this algorithm is working then it should run fine with this case.

analysis/noise.jl Outdated Show resolved Hide resolved
diff = (orig_pos - closest_noisy_pos)^2
push!(diffs, diff)
# Remove the matched noisy position to avoid double matching
deleteat!(noisy, closest_idx)
Copy link
Owner

@warisa-r warisa-r Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! One more thing is that I think it might be a good idea if we have a separate count of how many our algorithm falsely detected and failed to detect

y = range(bounds[2][1], bounds[2][2], length=ncells[2])

for t in eachindex(nsteps) # Use eachindex for time steps
if isempty(original_shock_positions_over_time[t])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fit should be compared to its respective cluster, not all the shock points in 1-time frame. Since in 1 time frame we could be trying to fit 4 curves and comparing all the points to only 1 left curve will result in a false outcome of bloated residuum (see the last frames of sod shock orb for example)

Copy link
Owner

@warisa-r warisa-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things should be working quite smoothly now if you test it with shock orb. But if it doesn't, totally up to you if you want to test it. Apart from some comments here about the comparison logic(which might be wrong from my side), I think all should be functioning :D


# Calculate residuals for original and noisy fits
original_residuals = calculate_residuals(original_fit[t][i], original_data_points)
noisy_residuals = calculate_residuals(noisy_fit[t][i], noisy_data_points)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong here but isn't this line supposed to be

noisy_residuals = calculate_residuals(noisy_fit[t][i], original_data_points) ?

We want to see how much the curve deviated from its original points (original data) e.g. how sensitive the curve is compared to the original findings. Since if we compared the noisy fit to the noisy data points, it is very likely we might get a new far-off curve that fit the noises better as the Levenberg-Marquardt algorithm does naturally.

What we rather want to see is how clustering, or different kernels, or the concept of curves themselves even, can help preserve the original shock curves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I have totally overseen that, I think you are right. Thanks.

end

# Ensure residuals have the same length
if length(original_residuals) != length(noisy_residuals)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cautious of you :) However, I haven't encountered this before and am not sure where this can come from. So I'm also not sure if this is a case we can encounter if we have the same amount of data points we are comparing. But in the case that noisy shock points and original shock points don't match maybe we can find the difference in the second norm of both residues?

namely

norm(original_residuals,2) - norm(noisy_residuals,2)

This way we are performing a scalar operation but as mentioned before, I am not so sure about this as I haven't experienced this problem before

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable, I will implement this difference

residual_length_mismatch_count = 0 # Number of times residuals had different lengths

# Cluster the shock points for original and noisy data
original_clusters = ShockwaveDetection.cluster_shock_points(dbscan_algo, original_shock_positions_over_time, flow_data)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a priority but shock clusters can already be accessed from ShockDetectionResult as an attribute called

shock_clusters_over_time

@dzhingarska
Copy link
Collaborator Author

I think we can upload to the registry now, if there is something more, it won't be relevant for the user.

@warisa-r warisa-r merged commit 8092273 into main Oct 1, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test with noise
2 participants