-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noise analysis #22
Noise analysis #22
Conversation
# Returns | ||
- A new `FlowData` struct with noise added to the relevant fields. | ||
""" | ||
function apply_noise_to_flow(flow_data::FlowData, noise_data::NoiseData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we try adding noise to some properties in some cases and if things go differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: Try adding boise to pressure and density and leave the velocity field alone
examples/sod_shock_tube_1d.jl
Outdated
@@ -4,6 +4,9 @@ using ShockwaveProperties | |||
using Unitful | |||
using ShockwaveDetection | |||
|
|||
# Create a NoiseData instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can u leave this out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it there because you asked not to change the file last time, but I can remove the noise data instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, please don’t commit to this file since it’s not relevant to this PR? Sorry for the confusion.
analysis/noise.jl
Outdated
Compares shock fits over time between original and noisy data. | ||
# Pad the shorter array with zeroes to match lengths | ||
if length(original_params) > length(noisy_params) | ||
padding = fill(0.0, length(original_params) - length(noisy_params)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that this would only work if all of our models have only 1 important parameter. But that is not the case for most of our essential models (the majority and the ones that are fitted most are parabola, circle, and line, all of which have 2-3 parameters)
And also, what if in the case where the model actually only has 1 parameter like hline
and vline
(check out sod tube 2d case) and the parameter is 0.0. Then the padded value is a spot-on guess and doesn't signify whatsoever that the algorithm fails to fit something even though it does.
Even though the number in the missing_count
does vaguely say something but the mean you get from this part of the code is tampered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix the issue with hline and vline with padding with special value like NaN. This will now give a clear distinction between models that fit poorly (high RMSE) and models where parameters were simply missing. Additionally, I extract all the parameters not just the first one, that was my bad.
analysis/noise.jl
Outdated
println("No valid differences calculated.") | ||
return NaN | ||
end | ||
|
||
mean_diff = mean(valid_diffs) | ||
inf_ratio = inf_count / length(original_positions) # Calculate ratio based on original count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the length of the original positions is 0? This isn't a priority but it is something to look out for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it practically never happens but I can still handle this case if necessary, but I don't think it is for our simulation
analysis/noise.jl
Outdated
""" | ||
compare_shock_fits_over_time(original_fits, noisy_fits) | ||
# Ensure both parameter vectors are not empty | ||
if isempty(original_params) || isempty(noisy_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fit result will never be empty if there is a cluster of points. If there isn't any shock formed in that frame because there isn't any e.g. supersonic inflow in some of the simulation instances since it will be introduced later, does this mean we can't compare fits of the entire simulation run?
What would be the reason to throw an error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fit result is not going to be empty, the condition is not met but I thought it should be there rather for the overall look of the function, and yes we cannot compare flow without shocks and we also do not want to, the purpose of this function is also not that
@dzhingarska Could you please resolve the things you already fixed? And I think that this task will take at least a while until we can get a sensible result with not only the sod 2d right case, but with shock orb and obstacles too. (if I were to do it. But maybe it can take shorter for you). I mean or at least shock orb since sod 2d right is just simply a line. (And for the obstacle case the detection works as well as in any other case. No shocks are detected on NaN cells so it's like working with normal sim scenarios. You just have more diverse fits.) In my humble opinion, it isn't easy to quantify false detection correctly and logically when we take a close look at all the false detection scenarios, be it missing points, be it too many points. Initially, I thought this would take around a month or something no shorter than that. It has been a month and maybe it's better if we view what we want to do more realistically. I don't know if this is something we can report on yet. We could move on to the report with what we have so we can finally be done with the software since we have been doing this since the beginning of June which is quite a long time already. It is not a big deal, but if we can get this working, that'd be cool. Let me know what you think and plan to do :) |
@@ -10,4 +10,4 @@ dbscan_algo = DBSCANAlgo(0.25, 3, 10) | |||
detection = detect(flow_data, point_detect_algo, dbscan_algo, fitting_algo) | |||
|
|||
plot_shock_fits_over_time(flow_data, detection, true) | |||
#create_heatmap_evo_with_shock(flow_data, detection, :density_field, true, false) | |||
#create_heatmap_evo_with_shock(flow_data, detection, :density_field, true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not commit to the examples in src? Since we are not doing anything with this example right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I just copy it from main and commit, since it would be complicated reverting commits just because of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this matters much so I'm probably going to merge despite this minor difference
analysis/noise.jl
Outdated
println("Padded Noisy Parameters: ", noisy_params) | ||
|
||
# Calculate the differences while ignoring NaNs | ||
differences = original_params .- noisy_params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what padding is doing here in terms of dimensions. You mentioned missing parameters: which I don't really understand what that means.
-
You are talking about cases where there isn't any fit for that certain cluster or even worse, frame. In that case, you are right in that the parameter is missing. Since there is no fit, there can be no parameter. In this case you won't be able to perform arithmetic operation anyway in cases that either case detect different models at the time spot in your params array.
-
You are talking about different models with different numbers of parameters: like
- Line 2 parameters for 3x-1
- Parabola 3 parameters 4x^{2}\ -0.1x\ -0.5
They are very close to each other between 0 and 1 and it is very likely that our algorithm fits these two different models for scenarios with and without noises, respectively. You can plot this. So intuitive, the difference between these two models in this context should be very small right?
But with your padding, the norm of original_params .- noisy_params
is going to be quite big. (Consider that we are dealing with small bounds, which we do for every case of simulation we have. We are talking (-1,1) x (-1,1) or (-2,1) x (1,1)
Even more severe. Say that you have both models with exactly the same parameter
1,1,0.5
but they are circle model and parabola model respectively. Then they are two completely different shape, but you algorithm is suggesting that it is the same.
I don't think that this is a good way to solve arithmetic operations. Is there other things we can compare if the models don't match? Like how each models deviate from the original cluster of points (noiseless points) they were trying to fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate for an approach that is not solely based on how to make arithmetic operation with broadcasting operator work since our curves are parametized starkly differently. Comparing the parameters can work if they are the exact same model from the exact time frame and from the exact cluster (but looking at the way you store the parameter, it is hard to identify if parameters at the same order of the array (noisy and original) is from the same cluster and frame since push! is the mechanism of storing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point there, actually I comparing the parameters of the fit does not make so much sense as I was expecting and it seems another approach is needed. Now instead of focusing on the parameters themselves, we can compare how well the models fit the data points (original_shock_positions_over_time and noisy_shock_positions_over_time). The new solution compares the model residuals (the difference between the original data points and the fitted data) rather than the raw parameters. Check it out and suggest improvements or feel free to adjust as well.
diff = (orig_pos - closest_noisy_pos)^2 | ||
push!(diffs, diff) | ||
# Remove the matched noisy position to avoid double matching | ||
deleteat!(noisy, closest_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm looks fine to me. My problem with this kind of penalty (even though it is fair) is that now the difference is bloated with a large penalty and this renders the difference in the close noisy positions and the original positions trivial (occupying floating point spaces and pushing the smaller values away from storage).
Could we not add large penalties and only count missing shock positions? That would lessen the meaning of diffs calculated here but I think that there is a limit to what we can get out of arithmetic operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, no more large penalties, just missing_count. If you still want some refinements, let me know or feel free to edit as you like as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! One more thing is that I think it might be a good idea if we have a separate count of how many our algorithm falsely detected and failed to detect
analysis/noise.jl
Outdated
# Calculate the RMSE (ignoring NaN values) | ||
rmse = sqrt(mean(filtered_differences .^ 2)) | ||
|
||
return rmse, missing_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try this with funky_square.celltape
with a gradient threshold of 0.05
. There are all sorts of clusters and fit should show up. If this algorithm is working then it should run fine with this case.
diff = (orig_pos - closest_noisy_pos)^2 | ||
push!(diffs, diff) | ||
# Remove the matched noisy position to avoid double matching | ||
deleteat!(noisy, closest_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! One more thing is that I think it might be a good idea if we have a separate count of how many our algorithm falsely detected and failed to detect
analysis/noise.jl
Outdated
y = range(bounds[2][1], bounds[2][2], length=ncells[2]) | ||
|
||
for t in eachindex(nsteps) # Use eachindex for time steps | ||
if isempty(original_shock_positions_over_time[t]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fit should be compared to its respective cluster, not all the shock points in 1-time frame. Since in 1 time frame we could be trying to fit 4 curves and comparing all the points to only 1 left curve will result in a false outcome of bloated residuum (see the last frames of sod shock orb for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think things should be working quite smoothly now if you test it with shock orb. But if it doesn't, totally up to you if you want to test it. Apart from some comments here about the comparison logic(which might be wrong from my side), I think all should be functioning :D
analysis/noise.jl
Outdated
|
||
# Calculate residuals for original and noisy fits | ||
original_residuals = calculate_residuals(original_fit[t][i], original_data_points) | ||
noisy_residuals = calculate_residuals(noisy_fit[t][i], noisy_data_points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong here but isn't this line supposed to be
noisy_residuals = calculate_residuals(noisy_fit[t][i], original_data_points)
?
We want to see how much the curve deviated from its original points (original data) e.g. how sensitive the curve is compared to the original findings. Since if we compared the noisy fit to the noisy data points, it is very likely we might get a new far-off curve that fit the noises better as the Levenberg-Marquardt algorithm does naturally.
What we rather want to see is how clustering, or different kernels, or the concept of curves themselves even, can help preserve the original shock curves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I have totally overseen that, I think you are right. Thanks.
end | ||
|
||
# Ensure residuals have the same length | ||
if length(original_residuals) != length(noisy_residuals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cautious of you :) However, I haven't encountered this before and am not sure where this can come from. So I'm also not sure if this is a case we can encounter if we have the same amount of data points we are comparing. But in the case that noisy shock points and original shock points don't match maybe we can find the difference in the second norm of both residues?
namely
norm(original_residuals,2) - norm(noisy_residuals,2)
This way we are performing a scalar operation but as mentioned before, I am not so sure about this as I haven't experienced this problem before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable, I will implement this difference
analysis/noise.jl
Outdated
residual_length_mismatch_count = 0 # Number of times residuals had different lengths | ||
|
||
# Cluster the shock points for original and noisy data | ||
original_clusters = ShockwaveDetection.cluster_shock_points(dbscan_algo, original_shock_positions_over_time, flow_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily a priority but shock clusters can already be accessed from ShockDetectionResult as an attribute called
shock_clusters_over_time
I think we can upload to the registry now, if there is something more, it won't be relevant for the user. |
comparing the shock positions over time of flow data and flow data with additional noise