Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work hero doc #8

Merged
merged 55 commits into from
Oct 7, 2024
Merged

Work hero doc #8

merged 55 commits into from
Oct 7, 2024

Conversation

mgheorghe
Copy link
Owner

No description provided.

https://www.keysight.com/us/en/product/944-1188/uhd400t.html
https://www.keysight.com/us/en/products/network-test/network-test-hardware/xgs12-chassis-platform.html

Amount of hardware needed varies based on the device performance. Curent DASH requirment specifies 24M CPS as minimum requirment but each vendor wants to showcase how much more they can do so based on that plus adding a 10%-20% for the headroom we can calculate the amount of hardware needed.
Copy link

@chrispsommers chrispsommers Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you provide the BOM for what is needed? Why not spell it out here along with the calculations?


#### Test Tools packet generator (keysight version)

One solution to test the smart switch is makeing use of Keysight(Ixia) packet generator for TCP traffic we use CloudStorm & IxLoad and for UDP traffic we use Novus & IxNetwork and it is all mixed in by the UHD Connect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sentence explaining what UHD Connect is. The UHT400T data sheet is for a packet blaster. Do we have a data sheet for UHD-C?

- DASH device port speeds are 100G or 200G or 400G, PAM4 or NRZ are UHD400C device port speeds are 100G or 200G or 400G, PAM4 or NRZ so far the 2 should interface with no issues.
- IEEE defaults autoneg is preferable but at a minimum if AN is disabled please ensure FEC is enabled. With FEC disabled we observed few packet drops in the DACs and that can create a lot of hasle hunting down a lost packet that has nothing to do with DASH performance.

#### Testbed examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice for some bullets summarizing the key features of each testbed. How does someone determine which one is appropriate for their needs?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some bullets but it may need more work


##### validate the hardware and software.

It ensures we can program the DPU via private API, SAI or DASH and that we can pass 1 packets end to end from traffic generator through the device under test and back.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to expand a bit on these various APIs and under what conditions we'd use one or the other.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wiped the sentence and added a whole new test/paragraph about loading the dash configuration.


##### can also provide best case scenario performance numbers

its a maybe because 1 packets replicated milions of times may not necesarly work best for all hardware implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very ambiguous. Do you mean, we could take the 1P test case but somehow send more packets to simulate worst-case?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased "Not always, but occasionally, this test also yields the best case scenario values because the best case scenario is frequently reached at the lowest scale."


### In between scale

If Hero test scale numbers cannot be met we can add another checkpoint to gather aditional data before final implementation is ready

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have such a configuration?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no such configuration can be shared; it is just an intermediary phase in the development cycle, explained better

Before the final solution is finished, we can add another checkpoint to collect further data if the Hero test scale numbers are not fulfilled.
This will have custom scale values agreed upfront by all the parties and consitues an intermediary point in the DASH development.
Usually becomes irrelevant as soon as the Hero test scale is achieved.


### Best case scenario

If any of the previous tests have not shown the best case scenario we can run a test with the best case scenario in mind.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wat does this mean?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see results section explains the best case,scenario.
what i wanted to say here is that during 1ip test, baby hero test or hero test, those may not show the best case scenario and in that case we can add one more datapoint showcasing base case scenario

i rephrased the sentence


### Worst case scenario

If any of the previous tests have not shown the worst case scenario we add this test as well (without exceeding hero test scale).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased "If we can find a scenario where we obtain lower performance numbers then the numbers previously obtained during (1ip, baby hero, hero test ...) this will be added as a new data point to the results."


Latency value is most acurate when we have highest PPS, smallest packet, and zero packet loss. and is measured using Ixnetwork and Novus card.

Aim for DPU is 2us, for smart switch we have to consider that packet travels twice through the NPU as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency through a switch is very dependent upon the queuing and congestion. Are you trying to find minimum latency through the switch? Do you have a way to measure just the switch latency w/o DPUs?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can find the NPU/switch latency, it is considered a known variable. since the NPU is usually a 32x400G asic and we use only 8x400G for the test the NPU is usually not a bottleneck or point of issues.

rephrased

"When testing the smart switch we have to run a test to get the switch latency without running the traffic through the DPU and then get the total system latency with the understanding that each packets travel once through the NPU to reach the DPU, than it travel through teh DPU and once more it will travel through teh NPU after it leave the DPU.

smart switch latency = 2 x NPU latency + DPU latency"


PPM may need to be adjusted between test gear and device under test to get that 100G or 200G or 400G perfect number.

Consider looking at UHD400C stats and when looking at IxNetwork/Ixload stats will show less because the vxlan header is added later by UHD so we are interested in packet size as it enters the DPU x pps to get the throughput.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to unpack, perhaps explain a little better.


For TCP we use IxLoad since it has full TCP stack that is very configurable and can simulate a lot of diferent scenarios.

While the hero test calls for 6 TCP packets SYN/SYNACK/ACK/FIN/FINACK/ACK, we make use of HTTP as aplication tha runs over TCP and on the wire we will end up with 7 packets for every conection.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention we do a 1-byte GET?


PPS used for CPS test can be sen the the L23 stats in IxLoad.

Keep an eye on TCP failures on client and server a retransmit is bad it simbolizes packet drop that was detected and TCP stack had to rentransmit. a conection drop is super extra bad it means even after 3-5 retries packet did not made it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put these kinds of tips and techniques in a quote > to stand out.

@chrispsommers
Copy link

My overall impression is that there is a lot of expertise, experience and practical advice and rationale in here, kind of jotted down quickly to get the big picture w/o worrying too much about making it clear and readable. Besides just spelling and grammar, I think it needs to be more readable in general and explain some more along the way. I think it could be a very valuable document.

Copy link

@chrispsommers chrispsommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than minor spelling/grammar, LGTM.

@mgheorghe mgheorghe merged commit 79ae65e into pr-hero-doc Oct 7, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants