How to understand why LUT resources are so extreme in HLS estimations? #1106

ghanimmustafa · 2024-10-30T23:18:14Z

ghanimmustafa
Oct 30, 2024

Hello,
I am trying to implement the following network using HLS4ML flow with Vivado 2020.1 HLS & Tensorflow:

def load_model(weights_path=None):
    """
    Build a neural network similar to the specified architecture in TensorFlow.
    Use quantization where appropriate for compatibility with hls4ml.
    """
    # Quantization settings (if required)
    # quant_relu = quantized_relu(training_word_width)  # Example if using quantized activations
    # kernel_quant = quantized_bits(training_word_width, training_int_width, symmetric=0, alpha=1)
    # bias_quant = quantized_bits(training_word_width, training_int_width, symmetric=0, alpha=1)

    # Define the Input layer with the appropriate shape
    input_shape = (30, 30, 1)  # MNIST input shape

    in_layer = Input(shape=input_shape)

    x = Conv2D(filters=6, kernel_size=(7, 7), strides=(2, 2), padding='valid', activation='relu')(in_layer)

    x = Conv2D(filters=50, kernel_size=(7, 7), strides=(2,2), padding='valid', activation='relu')(x)

    x = Flatten()(x)

    x = Dense(20, activation='relu')(x)

    out_layer = Dense(10, activation='softmax')(x)

    # Create the Model
    custom_model = Model(inputs=in_layer, outputs=out_layer)

    # (Optional) Load weights if provided
    if weights_path is not None:
        custom_model.load_weights(weights_path)
        print("Loaded weights from disk")

    return custom_model

I am not sure if it is a surprise or not, so I am also doing my research on the internet to find some estimations about hardware utilization to synthesize such a network on ZCU102@200 MHz.

I am applying a fixed-point 32-bit for all layers, weights, results, and biases. Resource strategy for all layers and the network as well (by the why I'm not sure what the difference is between specifying the model strategy as Resource and manually setting it as Resource for each layer, I am doing both of them to save as much resources as I can). Moreover, I am setting a high ReuseFactor equal to 4096 which is modified by the HLS tool to different numbers based on the supported reuse factor of each operation/layer.

HLS flow is taking a long time, ~4 hours and it was getting stuck at the following warning for softmax activation saying:
INFO: [HLS 200-42] -- Implementing module 'init_exp_table_ap_fixed_32_16_4_0_0_softmax_config10_s' INFO: [HLS 200-10] ---------------------------------------------------------------- INFO: [SCHED 204-11] Starting scheduling ... INFO: [SCHED 204-61] Pipelining function 'init_exp_table<ap_fixed<32, 16, 4, 0, 0>, softmax_config10>'. WARNING: [SCHED 204-69] Unable to schedule 'store' operation ('table_out_1_V_addr_18_write_ln160', firmware/nnet_utils/nnet_activation.h:160) of variable 'select_ln340_1090', firmware/nnet_utils/nnet_activation.h:159 on array 'table_out_1_V' due to limited memory ports. Please consider using a memory core with more ports or partitioning the array 'table_out_1_V'.

I have used complete array partitioning pragma for the mentioned array in the HLS code, which resulted in:

It looks to me that this array partitioning caused many resources for only this function. Which would fit on only a much larger FPGA. I would like to hear your thoughts about my progress, what I can try else, and if it is normal for this type of network. I have also tried cyclic partitioning for the same array but it also did not pass the HLS scheduling phase.

Are there more straightforward ways for such design including directly from HLS4ML or in HLS that I can use to save up many LUTs? I think quantization may reduce the resources but it will still probably show a large shift in LUT resources of the final layer compared to other layers in the design.

I am attaching the current HLS code for Softmax and its estimated area and timing results.

softmax_hls_and_results.zip

bo3z · 2024-10-31T15:35:51Z

bo3z
Oct 31, 2024
Maintainer

There are several techniques to reduce the area footprint of your model. First, if you can always quantise the model. At the very least, do post-training quantisation - most applications don't require 32-bit fixed point; usually 16-bit has no impact on accuracy. But, ideally if you can, do quantisation-aware training as shown here: https://github.com/fastmachinelearning/hls4ml-tutorial/blob/main/part4_quantization.ipynb

Secondly, consider using io_stream instead of io_parallel for CNNs. It's scaled better for more computationally heavy models. See tutorial on CNNs here: https://github.com/fastmachinelearning/hls4ml-tutorial/blob/main/part6_cnns.ipynb

3 replies

ghanimmustafa Oct 31, 2024
Author

I will check how quantization affects the design resources. However, the mean issue here is about the last layer's resources which are > x10 of any other layer including convolutions, I think this has to do with the inverse and exponential operations as well of the softmax function. But, I do not know a way to reduce that specifically. I am also specifying my io_type to be stream in the hls4ml_cfg.

bo3z Oct 31, 2024
Maintainer

Do you need the output probabilities or just the decision? E.g. if you had 2 classes with inference output [0.75, 0.25] would you need the probabilites or would [1.0, 0.0] work? Because Softmax is only needed in training and inference it can be bypassed (for classification). We support this in hls4ml, by doing: config["LayerName"]["output_class"]["Implementation"] = "argmax". Just change the output_class to the name of your softmax layer.

ghanimmustafa Oct 31, 2024
Author

Thanks. I just need the decision. I will try that option, and hopefully, it will reduce the complexity of Softmax.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to understand why LUT resources are so extreme in HLS estimations? #1106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to understand why LUT resources are so extreme in HLS estimations? #1106

ghanimmustafa Oct 30, 2024

Replies: 1 comment · 3 replies

bo3z Oct 31, 2024 Maintainer

ghanimmustafa Oct 31, 2024 Author

bo3z Oct 31, 2024 Maintainer

ghanimmustafa Oct 31, 2024 Author

ghanimmustafa
Oct 30, 2024

Replies: 1 comment 3 replies

bo3z
Oct 31, 2024
Maintainer

ghanimmustafa Oct 31, 2024
Author

bo3z Oct 31, 2024
Maintainer

ghanimmustafa Oct 31, 2024
Author