Abstract
This document is intended to guide a new ODP application developer. Further details about ODP may be found at the ODP home page.
ODP is an API specification that allows many implementations to provide platform independence, automatic hardware acceleration and CPU scaling to high performance networking applications. This document describes how to write an application that can successfully take advantage of the API.
1. Introduction
ODP consists of three separate but related component parts. First, ODP is an
abstract API specification that describes a functional model for
data plane applications. This specification covers many common data plane
application programming needs, such as the ability to receive, manipulate, and
transmit packet data, without specifying how these functions are performed. This
is quite intentional. It is precisely because ODP APIs do not have a preferred
embodiment that they permit innovation in how these functions can
be realized on various platforms that offer implementations of ODP. To achieve
this goal, ODP APIs are described using abstract data types whose definition
is left up to the ODP implementer. For example, in ODP packets are referenced
by abstract handles of type odp_packet_t
, and packet-related APIs take
arguments of this type. What an odp_packet_t
actually is is not part of the
ODP API specification—that is the responsibility of each ODP implementation.
The ODP API specification is designed to permit wide latitude on the part of implementations while at the same time supporting highly efficient processing, especially for APIs that are executed frequently.
Both applications and implementations must comply with the API specification. If not otherwise documented, results are undefined if an application acts against the specification. For example, if an application passes bad parameters to an ODP API one implementation may report an error, while another may not check them (to maximize performance) but would just crash while using the bad values.
Note that many ODP component areas provide an odp_xxx_capability()
API that
returns platform-specific information regarding valid input to other APIs in
that component. For best portability applications should always use these
capability APIs to determine valid parameter input.
-
Open Source, open contribution, BSD-3 licensed.
-
Vendor and platform neutral.
-
Application-centric. Covers functional needs of data plane applications.
-
Ensures portability by specifying the functional behavior of ODP.
-
Both applications and implementations must conform to the API specification.
-
Defined jointly and openly by application writers and platform implementers.
-
Architected to be implementable on a wide range of platforms efficiently
-
Sponsored, governed, and maintained by the Linaro Networking Group (LNG)
Second, ODP consists of multiple implementations of this API specification, each tailored to a specific target platform. ODP implementations determine how each ODP abstract type is represented on that platform and how each ODP API is realized. On some platforms, ODP APIs will be realized using specialized instructions that accelerate the functional behavior specified by the API. On others, hardware co-processing engines may completely offload the API so that again it can be performed with little or no involvement by a CPU. In all cases, the application sees the same functional behavior independent of how a given platform has chosen to realize it. By allowing each platform the freedom to determine how best to realize each API’s specified functional behavior in an optimal manner, ODP permits applications written to its APIs to take full advantage of the unique capabilities of each platform without the application programmer needing to have specialist knowledge of that platform or to be concerned with how best to tune the application to a particular platform. This latter consideration is particularly important in Network Function Virtualization (NFV) environments where the application will run on a target platform chosen by someone else.
-
One size does not fit all—supporting multiple implementations allows ODP to adapt to widely differing internals among platforms.
-
Anyone can create an ODP implementation tailored to their platform
-
Distribution and maintenance of each implementation is as owner wishes
-
Open source or closed source as business needs determine
-
Have independent release cycles and service streams
-
-
Allows HW and SW innovation in how ODP APIs are implemented on each platform.
To make it easy to get started with implementing ODP on a new platform, ODP supplies a number of reference implementations that can serve as a starting point. The two primary references implementations supplied by ODP are odp-linux and odp-dpdk
The odp-linux reference implementation is a pure SW implementation of the ODP API that relies only on the Linux programming API. As a functional model for ODP, it enables ODP to be bootstrapped easily to any platform that supports a Linux kernel.
The odp-dpdk reference implementation is a pure SW implementation of the ODP API that uses DPDK as a SW accelerator. In particular, odp-dpdk offers superior I/O performance for systems that use NICs, allowing ODP applications to take immediate full advantage of the various NIC device drivers supported by DPDK.
-
Open source, open contribution, BSD-3 licensed.
-
Provide easy bootstrapping of ODP onto new platforms
-
Implementers free to borrow or tailor code as needed for their platform
-
Implementers retain full control over their implementations whether or not they are derived from a reference implementation.
Third, to ensure consistency between different ODP implementations, ODP consists of a validation suite that verifies that any given implementation of ODP faithfully provides the specified functional behavior of each ODP API. As a separate open source component, the validation suite may be used by application writers, system integrators, and platform providers alike to confirm that any purported implementation of ODP does indeed conform to the ODP API specification.
-
Synchronized with ODP API specification
-
Maintained and distributed by LNG
-
Open source, open contribution, BSD-3 licensed.
-
Key to ensuring application portability across all ODP implementations
-
Tests that ODP implementations conform to the specified functional behavior of ODP APIs.
-
Can be run at any time by users and vendors to validate implementations of ODP.
1.1. ODP API Specification Versioning
As an evolving standard, the ODP API specification is released under an incrementing version number, and corresponding implementations of ODP, as well as the validation suite that verifies API conformance, are linked to this version number. ODP versions are specified using a standard three-level number (major.minor.fixlevel) that are incremented according to the degree of change the level represents. Increments to the fix level represent clarification of the specification or other minor changes that do not affect either the syntax or semantics of the specification. Such changes in the API specification are expected to be rare. Increments to the minor level represent the introduction of new APIs or functional capabilities, or changes to the specified syntax or functional behavior of APIs and thus may require application source code changes. Such changes are well documented in the release notes for each revision of the specification. Finally, increments to the major level represent significant structural changes that most likely require some level of application source code change, again as documented in the release notes for that version.
1.2. ODP Implementation Versioning
ODP implementations are free to use whatever release naming/numbering conventions they wish, as long as it is clear what level of the ODP API a given release implements. A recommended convention is to use the same three level numbering scheme where the major and minor numbers correspond to the ODP API level and the fix level represents an implementation-defined service level associated with that API level implementation. The LNG-supplied ODP reference implementations follow this convention.
1.3. ODP Validation Test Suite Versioning
The ODP validation test suite follows these same naming conventions. The major and minor release numbers correspond to the ODP API level that the suite validates and the fix level represents the service level of the validation suite itself for that API level.
1.4. ODP Design Goals
ODP has three primary goals that follow from its component structure. The first is application portability across a wide range of platforms. These platforms differ in terms of processor instruction set architecture, number and types of application processing cores, memory organization, as well as the number and type of platform specific hardware acceleration and offload features that are available. ODP applications can move from one conforming implementation to another with at most a recompile.
Second, ODP is designed to permit data plane applications to avail themselves of platform-specific features, including specialized hardware accelerators, without specialized programming. This is achieved by separating the API specification from their implementation on individual platforms. Since each platform implements each ODP API in a manner optimal to that platform, applications automatically gain the benefit of such optimizations without the need for explicit programming.
Third, ODP is designed to allow applications to scale out automatically to support many core architectures. This is done using an event based programming model that permits applications to be written to be independent of the number of processing cores that are available to realize application function. The result is that an application written to this model does not require redesign as it scales from 4, to 40, to 400 cores.
2. Organization of this Document
This document is organized into several sections. The first presents a high level overview of ODP applications, the ODP API component areas, and their associated abstract data types. This section introduces ODP APIs at a conceptual level. The second provides a tutorial on the programming model(s) supported by ODP, paying particular attention to the event model as this represents the preferred structure for most ODP applications. This section builds on the concepts introduced in the first section and shows how ODP applications are structured to best realize the three ODP design goals mentioned earlier. The third section provides a more detailed overview of the major ODP API components and is designed to serve as a companion to the full reference specification for each API. The latter is intended to be used by ODP application programmers, as well as implementers, to understand the precise syntax and semantics of each API.
3. ODP Applications and Packet Flow
Data plane applications are fundamentally concerned with receiving, examining, manipulating, and transmitting packets. The distinguishing feature of the data plane is that these applications are mostly concerned with the lowest layers of the ISO stack (Layers 2 and 3) and they have very high to extreme performance requirements. ODP is designed to provide a portable framework for such applications.
At the highest level, an ODP Application is a program that uses one or more ODP APIs. Because ODP is a framework rather than a programming environment, applications are free to also use other APIs that may or may not provide the same portability characteristics as ODP APIs.
ODP applications vary in terms of what they do and how they operate, but in general all share the following characteristics:
-
They are organized into one or more threads that execute in parallel.
-
These threads communicate and coordinate their activities using various synchronization mechanisms.
-
They receive packets from one or more packet I/O interfaces.
-
They examine, transform, or otherwise process packets.
-
They transmit packets to one or more packet I/O interfaces.
At the highest level, an ODP application looks as follows:
Packets arrive and are received (RX) from a network interface represented by a PktIO abstraction. From here they go either directly to Queues that are polled by ODP Threads, or can pass through the Classifier and sorted into Queues that represent individual flows. These queues can then be dispatched to application threads via the Scheduler.
Threads, in turn can invoke various ODP APIs to manipulate packet contents prior to disposing of them. For output processing, packets make by directly queued to a PktIO output queue or else they may be handed to the Traffic Manager for programmatic Quality of Service (QoS) processing before winding up being transmitted (TX). Note that output interfaces may operate in loopback mode, in which case packets sent to them are re-routed back to the input lines for "second pass" processing. For example, an incoming IPsec packet cannot be properly classified (beyond being IPsec traffic) until it is decrypted. Once decrypted and its actual contents made visible, it can then be classified into its real flow.
What is important to note is that the only part of the above diagram that need be written are the boxes in yellow that contain the application logic. Everything else shown here is provided by the ODP framework and available for use by any ODP application. This represents the "machinery" of a data plane application and is structured to allow applications written to the ODP APIs to be both portable and optimized for each platform that offers an ODP implementation without additional programming effort.
4. ODP API Concepts
ODP programs are built around several conceptual structures that every application programmer needs to be familiar with to use ODP effectively. The main ODP concepts are: Thread, Event, Queue, Pool, Shared Memory, Buffer, Packet, PktIO, Time, Timer, and Synchronizer.
4.1. Thread
The thread is the fundamental programming unit in ODP. ODP applications are
organized into a collection of threads that perform the work that the
application is designed to do. ODP threads may or may not share memory with
other threads—that is up to the implementation. Threads come in two "flavors":
control and worker, that are represented by the abstract type
odp_thread_type_t
.
A control thread is a supervisory thread that organizes the operation of worker threads. Worker threads, by contrast, exist to perform the main processing logic of the application and employ a run to completion model. Worker threads, in particular, are intended to operate on dedicated processing cores, especially in many core processing environments, however a given implementation may multitask multiple threads on a single core if desired (typically on smaller and lower performance target environments).
In addition to thread types, threads have associated attributes such as thread mask and scheduler group that determine where they can run and the type of work that they can handle. These will be discussed in greater detail later.
4.2. Event
Events are what threads process to perform their work. Events can represent
new work, such as the arrival of a packet that needs to be processed, or they
can represent the completion of requests that have executed asynchronously.
Events can also represent notifications of the passage of time, or of status
changes in various components of interest to the application. Events have an
event type that describes what it represents. Threads can create new events
or consume events processed by them, or they can perform some processing on
an event and then pass it along to another component for further processing.
References to events are via handles of abstract type odp_event_t
. Cast
functions are provided to convert these into specific handles of the
appropriate type represented by the event.
4.3. Queue
A queue is a message passing channel that holds events. Events can be added to a queue via enqueue operations or removed from a queue via dequeue operations. The endpoints of a queue will vary depending on how it is used. Queues come in two major types: plain and scheduled, which will be discussed in more detail when the event model is introduced. Queues may also have an associated context, which represents a persistent state for all events that make use of it. These states are what permit threads to perform stateful processing on events as well as stateless processing.
Queues are represented by handles of abstract type odp_queue_t
.
4.4. Pool
A pool is a shared memory area from which elements may be drawn. Pools represent the backing store for events, among other things. Pools are typically created and destroyed by the application during initialization and termination, respectively, and then used during processing. Pools may be used by ODP components exclusively, by applications exclusively, or their use may be shared between the two. Pools have an associated type that characterizes the elements that they contain. The two most important pool types are Buffer and Packet.
Pools are represented by handles of abstract type odp_pool_t
.
4.5. Shared Memory
Shared memory represents raw blocks of storage that are sharable between threads. They are the building blocks of pools but can be used directly by ODP applications if desired.
Shared memory is represented by handles of abstract type odp_shm_t
.
4.6. Buffer
A buffer is a fixed sized block of shared storage that is used by ODP components and/or applications to realize their function. Buffers contain zero or more bytes of application data as well as system maintained metadata that provide information about the buffer, such as its size or the pool it was allocated from. Metadata is an important ODP concept because it allows for arbitrary amounts of side information to be associated with an ODP object. Most ODP objects have associated metadata and this metadata is manipulated via accessor functions that act as getters and setters for this information. Getter access functions permit an application to read a metadata item, while setter access functions permit an application to write a metadata item. Note that some metadata is inherently read only and thus no setter is provided to manipulate it. When object have multiple metadata items, each has its own associated getter and/or setter access function to inspect or manipulate it.
Buffers are represented by handles of abstract type odp_buffer_t
.
4.7. Packet
Packets are received and transmitted via I/O interfaces and represent
the basic data that data plane applications manipulate.
Packets are drawn from pools of type ODP_POOL_PACKET
.
Unlike buffers, which are simple objects,
ODP packets have a rich set of semantics that permit their inspection
and manipulation in complex ways to be described later. Packets also support
a rich set of metadata as well as user metadata. User metadata permits
applications to associate an application-determined amount of side information
with each packet for its own use.
Packets are represented by handles of abstract type odp_packet_t
.
4.8. Packet I/O (PktIO)
PktIO is how ODP represents I/O interfaces. A pktio object is a logical port capable of receiving (RX) and/or transmitting (TX) packets. This may be directly supported by the underlying platform as an integrated feature, or may represent a device attached via a PCIE or other bus.
PktIOs are represented by handles of abstract type odp_pktio_t
.
4.9. Time
The time API is used to measure time intervals and track time flow of an application and presents a convenient way to get access to an implementation-defined time source. The time API consists of two main parts: local time API and global time API.
4.9.1. Local time
The local time API is designed to be used within one thread and obtaining
local time may be more efficient in some implementations than global
time. Local time stamps are local to the calling thread and should not be
shared with other threads, as local time is not guaranteed to be consistent
between threads. Current local time can be read with odp_time_local()
.
4.9.2. Global time
The global time API is designed to be used for tracking time between threads.
So, global time stamps may safely be shared between threads. Current global
time can be read with odp_time_global()
.
Both local and global time is not wrapped during the application life cycle.
The time API includes functions to operate with time, such as odp_time_diff()
,
odp_time_sum()
, odp_time_cmp()
, conversion functions like
odp_time_to_ns()
, odp_time_local_from_ns()
, odp_time_global_from_ns()
.
To get rate of time source odp_time_local_res()
, odp_time_global_res()
are used. To wait, odp_time_wait_ns()
and odp_time_wait_until()
are used,
during which a thread potentially busy loops the entire wait time.
The odp_time_t
opaque type represents local, global and PktIO timestamps.
4.9.3. Portability Considerations
The ODP Time APIs are designed to permit high-precision relative time
measurement within an ODP application. No attempt is made to correlate an
odp_time_t
object with "wall time" or any other external time reference.
As defined by the ODP specification, odp_time_t
values are required to
be unique over a span of at least 10 years. Most implementations will choose
to implement time values using 64-bit values, whose wrap times exceed 500
years, making wrapping concerns not relevant to ODP applications.
4.10. Timer
Timers are how ODP applications measure and respond to the passage of time.
Timers are drawn from specialized pools called timer pools that have their
own abstract type (odp_timer_pool_t
). Applications may have many timers
active at the same time and can set them to use either relative or absolute
time. When timers expire they create events of type odp_timeout_t
, which
serve as notifications of timer expiration.
4.11. Synchronizer
Multiple threads operating in parallel typically require various synchronization services to permit them to operate in a reliable and coordinated manner. ODP provides a rich set of locks, barriers, and similar synchronization primitives, as well as abstract types for representing various types of atomic variables. The ODP event model also makes use of queues to avoid the need for explicit locking in many cases. This will be discussed in the next section.
5. ODP Components
Building on ODP concepts, ODP offers several components that relate to the flow of work through an ODP application. These include the Classifier, Scheduler, and Traffic Manager. These components relate to the three main stages of packet processing: Receive, Process, and Transmit.
5.1. Classifier
The Classifier provides a suite of APIs that control packet receive (RX) processing.
The classifier provides two logically related services:
Packet parsing |
Verifying and extracting structural information from a received packet. |
Packet classification |
Applying Pattern Matching Rules (PMRs) to the parsed results to assign an incoming packet to a Class of Service (CoS). |
Combined, these permit incoming packets to be sorted into flows, which are logically related sequences of packets that share common processing requirements. While many data plane applications perform stateless packet processing (e.g., for simple forwarding) others perform stateful packet processing. Flows anchor state information relating to these groups of packets.
A CoS determines two variables for packets belonging to a flow:
-
The pool that they will be stored in on receipt
-
The queue that they will be added to for processing
The PMRs supported by ODP permit flow determination based on combinations of packet field values (tuples). The main advantage of classification is that on many platforms these functions are performed in hardware, meaning that classification occurs at line rate as packets are being received without any explicit processing by the ODP application.
Note that the use of the classifier is optional. Applications may directly receive packets from a corresponding PktIO input queue via direct polling if they choose.
5.2. Scheduler
The Scheduler provides a suite of APIs that control scalable event processing.
The Scheduler is responsible for selecting and dispatching one or more events
to a requesting thread. Event selection is based on several factors involving
both the queues containing schedulable events and the thread making an
odp_schedule()
or odp_schedule_multi()
call.
ODP queues have a scheduling priority that determines how urgently events on them should be processed relative to events contained in other queues. Queues also have a scheduler group id associated with them that must match the associated scheduler group thread mask of the thread calling the scheduler. This permits events to be grouped for processing into classes and have threads that are dedicated to processing events from specified classes. Threads can join and leave scheduler groups dynamically, permitting easy application response to increases in demand.
When a thread receives an event from the scheduler, it in turn can invoke other processing engines via ODP APIs (e.g., crypto processing) that can operate asynchronously. When such processing is complete, the result is that a completion event is added to a schedulable queue where it can be scheduled back to a thread to continue processing with the results of the requested asynchronous operation.
Threads themselves can enqueue events to queues for downstream processing by other threads, permitting flexibility in how applications structure themselves to maximize concurrency.
5.3. Traffic Manager
The Traffic Manager provides a suite of APIs that control traffic shaping and Quality of Service (QoS) processing for packet output.
The final stage of packet processing is to transmit it. Here, applications have several choices. As with RX processing, applications may send packets directly to PktIO TX queues for direct transmission. Often, however, applications need to perform traffic shaping and related Quality of Service (QoS) processing on the packets comprising a flow as part of transmit processing. To handle this need, ODP provides a suite of Traffic Manager APIs that permit programmatic establishment of arbiters, shapers, etc. that control output packet processing to achieve desired QoS goals. Again, the advantage here is that on many platforms traffic management functions are implemented in hardware, permitting transparent offload of this work.
6. ODP Application Programming Structure
6.1. The include structure
Applications only include the 'include/odp_api.h' file, which includes the
'platform/<implementation name>/include/odp/api' files to provide a complete
definition of the API on that platform. The doxygen documentation defining
the behavior of the ODP API is all contained in the public API files, and the
actual definitions for an implementation will be found in the per platform
directories. Per-platform data that might normally be a #define
can be
recovered via the appropriate access function if the #define is not directly
visible to the application.
./ ├── include/ │  ├── odp/ │  │  └── api/ │ │ └── spec/ │ │ └── The Public API and the documentation. │ │ │ │ │ ├── odp_api.h This file should be the only file included by the │ │ application.
6.2. Initialization
ODP depends on the application to perform a graceful shutdown, calling the terminate functions should only be done when the application is sure it has closed the ingress and subsequently drained all queues, etc. |
6.3. Startup
The first API that must be called by an ODP application is odp_init_global()
:
int odp_init_global(odp_instance_t *instance,
const odp_init_t *param,
const odp_platform_init_t *platform_param);
This takes two pointers. The first, odp_init_t
, contains ODP initialization
data that is platform independent and portable, while the second,
odp_platform_init_t
, is passed unparsed to the implementation
to be used for platform specific data that is not yet, or may never be
suitable for the ODP API. Each of these parameters is optional and may be
specified as NULL to accept the implementation-defined default initialization
values.
Calling odp_init_global()
establishes the ODP API framework and MUST be
called before any other ODP API may be called. Note that it is only called
once per application. A successful call to odp_init_global()
returns rc = 0
and sets the instance
variable supplied as input to the call to a handle
representing this unique ODP instance.
The odp_init_t
parameter is used to specify various customizations to the
ODP environment being established by this call. For example, the caller can
specify the maximum number of worker threads it will use, the thread masks
associated with these threads, as well as whether the default logging or
abort functions are to be overridden with an application-supplied handler.
The application may also provide optimization hints to the ODP implementation if it knows that it will never use specific ODP feature sets, such as the packet classifier or traffic manager. Implementations may use such hints to provide optimized behavior to applications that are known not to need these features.
Following global initialization, each thread in turn
calls 'odp_init_local()'. This establishes the local ODP thread
context for that thread and MUST be called before other ODP APIs may be
called by that thread. The sole argument to this call is the instance
variable returned by odp_init_global()
.
6.4. Shutdown
Shutdown is the logical reverse of the initialization procedure, with
odp_term_local()
called for each thread before odp_term_global()
is
called to terminate ODP.
6.5. Application Initialization/Termination Structure
ODP Applications follow the general structure flow shown below:
7. Common Conventions
Many ODP APIs share common conventions regarding their arguments and return types. This section highlights some of the more common and frequently used conventions.
7.1. Handles and Special Designators
ODP resources are represented via handles that have abstract type
odp_resource_t. So pools are represented by handles of type odp_pool_t
,
queues by handles of type odp_queue_t
, etc. Each such type
has a distinguished type ODP_RESOURCE_INVALID that is used to indicate a
handle that does not refer to a valid resource of that type. Resources are
typically created via an API named odp_resource_create() that returns a
handle of type odp_resource_t that represents the created object. This
returned handle is set to ODP_RESOURCE_INVALID if, for example, the
resource could not be created due to resource exhaustion. Invalid resources
do not necessarily represent error conditions. For example, ODP_EVENT_INVALID
in response to an odp_queue_deq()
call to get an event from a queue simply
indicates that the queue is empty.
7.2. Addressing Scope
Unless specifically noted in the API, all ODP resources are global to the ODP application, whether it runs as a single process or multiple processes. ODP handles therefore have common meaning within an ODP application but have no meaning outside the scope of the application.
7.3. Resources and Names
Many ODP resource objects, such as pools and queues, support an application-specified character string name that is associated with an ODP object at create time. This name serves two purposes: documentation, and lookup. The lookup function is particularly useful to allow an ODP application that is divided into multiple processes to obtain the handle for the common resource.
8. Application Portability Considerations
ODP is designed to support the creation of portable data plane applications that can easily be run on multiple target platforms while at the same time fully exploit hardware acceleration capabilities native to whatever platform it is running on. This section discusses tradeoffs that application writers should consider when using ODP.
First, it should be noted that portability is not an absolute good nor is it a single-valued attribute (application is portable or is not portable). While any application can be ported from one platform to another the real question is: at what cost? Costs can be measured in two dimensions: The level of effort needed to port, and the resulting performance differences seen due to the port. Ideally an application should be portable between platforms with minimal effort and equally minimal performance impact. While ODP is designed to support this ideal, each application must assess what its goals are in this area and how best to use ODP to achieve these goals.
8.1. Portability and Coexistence
Because ODP offers a programming framework rather than a programming environment, it is designed to be able to work alongside APIs offered by other frameworks with minimal interference. Therefore when we speak of portability in an ODP context, we of necessity speak of portability of those portions of the application that make use of ODP APIs. If an application uses non-ODP APIs then those must be taken into consideration as well when assessing the portability of the entire application. For many applications, it suffices to isolate certain non-portable code to a few areas of the application with the result that the application is significantly more portable than it would be without using ODP. Especially when dealing with existing applications that run in production environments, ODP may well be introduced in an incremental manner with the result being that the application becomes more portable only over time.
8.2. Source vs. Binary Portability
ODP has been designed to support both source and binary portability. Source portability is intrinsic to the ODP API specification itself. Any application written to the ODP API specification will be source portable between any conforming ODP implementation with at most a recompile. This is because ODP APIs do not expose implementation details or internal structures that may vary from platform to platform.
For platforms that share a common Instruction Set Architecture (ISA), ODP can also offer binary portability via the specification of an Application Binary Interface (ABI). This is especially useful in a Network Function Virtualization (NFV) environment where a data plane application may be developed and compiled on one platform for distribution and then deployed on many different platforms by an NFV Orchestrator function.
8.3. ODP Application Profiles
To assist in meeting these needs, ODP offers two distinct application profiles that are designed to characterize the needs of different types of data plane applications: the Embedded Profile and the Cloud Profile.
8.3.1. Embedded Profile
The ODP Embedded Profile is designed to support applications that wish to target a specific platform and achieve optimal performance on that platform and where source code portability is sufficient. If such applications need to support more than one platform then they simply need to be recompiled against the ODP implementation for that platform.
Embedded applications will typically work with a copy of ODP downloaded from a git repository so that it can be configured for the application’s precise needs. To specify that the application wishes to use the embedded profile:
./configure --disable-abi-compat …
should be used as part of the ODP configuration options. This allows applications to use inline forms of ODP APIs to give optimal performance on this platform, and may include additional optimizations that preclude binary portability to other platforms. The result is a binary that will achieve maximum performance on a given target platform and that can be ported to other platforms with a recompile.
8.3.2. Cloud Profile
By contrast, the ODP Cloud Profile is designed to support applications that
wish to be platform-agnostic and be binary compatible across all platforms
sharing this ABI. Any ODP implementation included in a Linux distribution will
be configured for the cloud profile, so no additional action is required on
the part of applications when compiling against a distributed copy of ODP (one
that is installed via sudo apt-get install
or equivalent command).
When using a copy of ODP downloaded from a repository, the cloud profile is selected at configure time:
./configure --enable-abi-compat …
8.4. ABI Characteristics
An ABI consists of several conventions that ensure that a program compiled against one ODP implementation can run unchanged on another platform that has a possibly very different ODP implementation without requiring recompilation. These include:
-
A set of function calling conventions that define how functions call other functions, pass parameters, and receive returned results. These are typically specified by the Operating System (e.g., Linux) and are independent of ODP.
-
Avoiding the use of inline expansions for any ODP API. This ensures that differing ODP implementations can maintain their different internals without these differences being visible to the application.
-
Agreement as to the size and alignment of ODP abstract datatypes used by all ODP implementations sharing this ABI definition. This means that, for example, the size of an
odp_packet_t
handle is the same across all members of the ABI. Since these handles are opaque, it doesn’t matter if their structure differs between ODP implementations since applications never reference these possibly different internals.
Note that an ABI definition exists within a specific Instruction Set Architecture (ISA), such as x86-64 or AArch64. Binaries cannot directly port between ISAs—that requires a recompilation.
Each ODP implementation will identify which ABI definition it supports, if any.
When compiling against an ODP implementation in ABI compatibility mode, the
resulting binary is automatically binary compatible with all other ODP
implementations that share this ABI. For example, for the x86-64 ISA, both
the odp-linux
and odp-dpdk
implementations are a common ABI.
9. Shared memory
9.1. Allocating shared memory
Blocks of shared memory can be created using the odp_shm_reserve()
API
call. The call expects a shared memory block name, a block size, an alignment
requirement, and optional flags as parameters. It returns a odp_shm_t
handle. The size and alignment requirement are given in bytes. The provided
name does not have to be unique, i.e. a given name can be used multiple times,
when reserving different blocks.
#define ALIGNMENT 128
#define BLKNAME "shared_items"
odp_shm_t shm;
uint32_t shm_flags = 0;
typedef struct {
...
} shared_data_t;
shm = odp_shm_reserve(BLKNAME, sizeof(shared_data_t), ALIGNMENT, shm_flags);
9.2. Getting the shared memory block address
The returned odp_shm_t handle can then be used to retrieve the actual address (in the caller’s ODP thread virtual address space) of the created shared memory block.
shared_data_t *shared_data;
shared_data = odp_shm_addr(shm);
The address returned by odp_shm_addr()
is normally valid only in the calling
ODP thread space: odp_shm_t handles can be shared between ODP threads and
remain valid within any threads, whereas the address returned by
odp_shm_addr(shm)
may differ from ODP threads to ODP threads (for the same
'shm' block), and should therefore not be shared between ODP threads. For
instance, it would be correct to send a shm handle using IPC between two ODP
threads and let each of these thread do their own odp_shm_addr()
to get the
block address. Directly sending the address returned by odp_shm_addr()
from
one ODP thread to another would however possibly fail (the address may make no
sense in the receiver address space).
The address returned by odp_shm_addr()
is nevertheless guaranteed to be
aligned according to the alignment requirements provided at block creation
time, even if the call to odp_shm_addr()
is performed by a different ODP
thread than the one which originally called odp_shm_reserve()
.
All shared memory blocks are contiguous in any ODP thread addressing space:
'address' to 'address'\+'size' (where 'size' is the shared memory block size,
as provided in the odp_shm_reserve()
call) is read and writeable and
mapping the shared memory block. There is no fragmentation.
The exception to this rule is if the odp_shm_t
is created with the
ODP_SHM_SINGLE_VA
flag. This requests that odp_shm_addr()
return the same
virtual address for all ODP threads in this instance. Note that there may be a
performance cost or shm size limit associated with providing this function in
some implementations.
9.3. Memory behavior
By default ODP threads are assumed to behave as cache coherent systems: Any change performed on a shared memory block is guaranteed to eventually become visible to other ODP threads sharing this memory block. Nevertheless, there is no implicit memory barrier associated with any action on shared memories: When a change performed by an ODP thread becomes visible to another ODP thread is not known: An application using shared memory blocks has to use some memory barrier provided by ODP to guarantee shared data validity between ODP threads.
The virtual address at which a given memory block is mapped in different ODP
threads may differ from ODP thread to ODP thread, if ODP threads have separate
virtual spaces (for instance if ODP threads are implemented as processes).
However, the ODP_SHM_SINGLE_VA flag can be used at odp_shm_reserve()
time
to guarantee address uniqueness in all ODP threads, regardless of their
implementation or creation time.
9.4. Lookup by name
As mentioned, shared memory handles can be sent from ODP threads to ODP
threads using any IPC mechanism, and then the block address retrieved.
A simpler approach to get the shared memory block handle of an already created
block is to use the odp_shm_lookup()
API function call.
This nevertheless requires the calling ODP thread to provide the name of the
shared memory block:
odp_shm_lookup()
will return ODP_SHM_INVALID
if no shared memory block
with the provided name is known by ODP. When multiple blocks were reserved
using the same name, the lookup function will return the handle of any
of these blocks.
#define BLKNAME "shared_items"
odp_shm_t shm;
shared_data_t *shared_data;
shm = odp_shm_lookup(BLKNAME);
if (shm != ODP_SHM_INVALID) {
shared_data = odp_shm_addr(shm);
...
}
9.5. Freeing memory
Freeing shared memory is performed using the odp_shm_free()
API call.
odp_shm_free()
takes one single argument, the shared memory block handle.
Any ODP thread is allowed to perform a odp_shm_free()
on a shared memory
block (i.e. the thread performing the odp_shm_free()
may be different
from the thread which did the odp_shm_reserve()
). Shared memory blocks should
be freed only once, and once freed, a shared memory block should no longer
be referenced by any ODP threads.
if (odp_shm_free(shm) != 0) {
...//handle error
}
9.6. sharing memory with the external world
ODP provides ways of sharing memory with entities located outside ODP instances:
Sharing a block of memory with an external (non ODP) thread is achieved
by setting the ODP_SHM_PROC flag at odp_shm_reserve()
time.
How the memory block is retrieved on the Operating System side is
implementation and Operating System dependent.
Sharing a block of memory with an external ODP instance (running
on the same Operating System) is achieved
by setting the ODP_SHM_EXPORT flag at odp_shm_reserve()
time.
A block of memory created with this flag in an ODP instance A, can be "mapped"
into a remote ODP instance B (on the same OS) by using the
odp_shm_import()
, on ODP instance B:
odp_shm_t shmA;
shmA = odp_shm_reserve("memoryA", size, 0, ODP_SHM_EXPORT);
odp_shm_t shmB;
odp_instance_t odpA;
/* get ODP A instance handle by some OS method */
odpA = ...
/* get the shared memory exported by A:
shmB = odp_shm_import("memoryA", odpA, "memoryB", 0, 0);
Note that the handles shmA and shmB are scoped by each ODP instance
(you can not use them outside the ODP instance they belong to).
Also note that both ODP instances have to call odp_shm_free()
when done.
9.7. Memory creation flags
The last argument to odp_shm_reserve() is a set of ORed flags. The following flags are supported:
9.7.1. ODP_SHM_PROC
When this flag is given, the allocated shared memory will become visible outside ODP. Non ODP threads (e.g. usual linux process or linux threads) will be able to access the memory using native (non ODP) OS calls such as 'shm_open()' and 'mmap' (for linux). Each ODP implementation should provide a description on exactly how this mapping should be done on that specific platform.
9.7.2. ODP_SHM_EXPORT
When this flag is given, the allocated shared memory will become visible
to other ODP instances running on the same OS.
Other ODP instances willing to see this exported memory should use the
odp_shm_import()
ODP function.
9.7.3. ODP_SHM_SINGLE_VA
This flag is used to guarantee the uniqueness of the address at which
the shared memory is mapped: without this flag, a given memory block may be
mapped at different virtual addresses (assuming the target have virtual
addresses) by different ODP threads. This means that the value returned by
odp_shm_addr()
would be different in different threads, in this case.
Setting this flag guarantees that all ODP threads sharing this memory
block will see it at the same address (odp_shm_addr()
would return the
same value on all ODP threads, for a given memory block, in this case)
Note that ODP implementations may have restrictions of the amount of memory
which can be allocated with this flag.
10. Queues and the Scheduler
Queues are the fundamental event sequencing mechanism provided by ODP and all
ODP applications make use of them either explicitly or implicitly. Queues are
created via the 'odp_queue_create()' API that returns a handle of type
odp_queue_t
that is used to refer to this queue in all subsequent APIs that
reference it. Queues have one of two ODP-defined types, PLAIN, and SCHED that
determine how they are used. PLAIN queues are directly managed by the ODP
application while SCHED queues make use of the ODP scheduler to provide
automatic scalable dispatching and synchronization services.
odp_queue_param_t qp;
odp_queue_param_init(&qp);
qp.type = ODP_QUEUE_TYPE_PLAIN;
odp_queue_t plain_q1 = odp_queue_create("poll queue 1", &qp);
odp_queue_t plain_q2 = odp_queue_create("poll queue 2", &qp);
...
odp_event_t ev = odp_queue_deq(plain_q1);
...do something
int rc = odp_queue_enq(plain_q2, ev);
The key distinction is that dequeueing events from PLAIN queues is an application responsibility while dequeueing events from SCHED queues is the responsibility of the ODP scheduler.
odp_queue_param_t qp;
odp_queue_param_init(&qp);
qp.type = ODP_QUEUE_TYPE_SCHED;
qp.sched.group = ODP_SCHED_GROUP_WORKER;
qp.sched.prio = odp_schedule_max_prio();
qp.sched.sync = ODP_SCHED_SYNC_PARALLEL;
odp_queue_t sched_q1 = odp_queue_create("sched queue 1", &qp);
...thread init processing
while (1) {
odp_event_t ev;
odp_queue_t src_q;
ev = odp_schedule(&src_q, ODP_SCHED_WAIT);
...process the event
}
With scheduled queues, events are sent to a queue, and the sender chooses a queue based on the service it needs. The sender does not need to know which ODP thread (on which core) or hardware accelerator will process the event, but all the events on a queue are eventually scheduled and processed.
As can be seen, SCHED queues have additional attributes that are specified at queue create that control how the scheduler is to process events contained on them. These include group, priority, and synchronization class.
10.1. Scheduler Groups
The scheduler’s dispatching job is to return the next event from the highest
priority SCHED queue that the caller is eligible to receive events from.
This latter consideration is determined by the queues scheduler group, which
is set at queue create time, and by the caller’s scheduler group mask that
indicates which scheduler group(s) it belongs to. Scheduler groups are
represented by handles of type odp_scheduler_group_t
and are created by
the odp_scheduler_group_create() API. A number of scheduler groups are
predefined by ODP. These include ODP_SCHED_GROUP_ALL
(all threads),
ODP_SCHED_GROUP_WORKER
(all worker threads), and ODP_SCHED_GROUP_CONTROL
(all control threads). The application is free to create additional scheduler
groups for its own purpose and threads can join or leave scheduler groups
using the odp_scheduler_group_join() and odp_scheduler_group_leave() APIs
10.2. Scheduler Priority
The prio
field of the odp_queue_param_t
specifies the queue’s scheduling
priority, which is how queues within eligible scheduler groups are selected
for dispatch. Queues have a default scheduling priority of NORMAL but can be
set to HIGHEST or LOWEST according to application needs.
10.3. Scheduler Synchronization
In addition to its dispatching function, which provide automatic scalability to ODP applications in many core environments, the other main function of the scheduler is to provide event synchronization services that greatly simplify application programming in a parallel processing environment. A queue’s SYNC mode determines how the scheduler handles the synchronization processing of multiple events originating from the same queue.
Three types of queue scheduler synchronization area supported: Parallel, Atomic, and Ordered.
10.3.1. Parallel Queues
SCHED queues that specify a sync mode of ODP_SCHED_SYNC_PARALLEL are unrestricted in how events are processed.
All events held on parallel queues are eligible to be scheduled simultaneously and any required synchronization between them is the responsibility of the application. Events originating from parallel queues thus have the highest throughput rate, however they also potentially involve the most work on the part of the application. In the Figure above, four threads are calling odp_schedule() to obtain events to process. The scheduler has assigned three events from the first queue to three threads in parallel. The fourth thread is processing a single event from the third queue. The second queue might either be empty, of lower priority, or not in a scheduler group matching any of the threads being serviced by the scheduler.
10.4. Atomic Queues
Atomic queues simplify event synchronization because only a single thread may process event(s) from a given atomic queue at a time. Events scheduled from atomic queues thus can be processed lock free because the locking is being done implicitly by the scheduler. Note that the caller may receive one or more events from the same atomic queue if odp_schedule_multi() is used. In this case these multiple events all share the same atomic scheduling context.
In this example, no matter how many events may be held in an atomic queue, only one calling thread can receive scheduled events from it at a time. Here two threads process events from two different atomic queues. Note that there is no synchronization between different atomic queues, only between events originating from the same atomic queue. The queue context associated with the atomic queue is held until the next call to the scheduler or until the application explicitly releases it via a call to odp_schedule_release_atomic().
Note that while atomic queues simplify programming, the serial nature of atomic queues may impair scaling.
10.5. Ordered Queues
Ordered queues provide the best of both worlds by providing the inherent scalability of parallel queues, with the easy synchronization of atomic queues.
When scheduling events from an ordered queue, the scheduler dispatches multiple events from the queue in parallel to different threads, however the scheduler also ensures that the relative sequence of these events on output queues is identical to their sequence from their originating ordered queue.
As with atomic queues, the ordering guarantees associated with ordered queues refer to events originating from the same queue, not for those originating on different queues. Thus in this figure three thread are processing events 5, 3, and 4, respectively from the first ordered queue. Regardless of how these threads complete processing, these events will appear in their original relative order on their output queue.
10.5.1. Order Preservation
Relative order is preserved independent of whether events are being sent to different output queues. For example, if some events are sent to output queue A while others are sent to output queue B then the events on these output queues will still be in the same relative order as they were on their originating queue. Similarly, if the processing consumes events so that no output is issued for some of them (e.g., as part of IP fragment reassembly processing) then other events will still be correctly ordered with respect to these sequence gaps. Finally, if multiple events are enqueued for a given order (e.g., as part of packet segmentation processing for MTU considerations), then each of these events will occupy the originator’s sequence in the target output queue(s). In this case the relative order of these events will be in the order that the thread issued odp_queue_enq() calls for them.
The ordered context associated with the dispatch of an event from an ordered queue lasts until the next scheduler call or until explicitly released by the thread calling odp_schedule_release_ordered(). This call may be used as a performance advisory that the thread no longer requires ordering guarantees for the current context. As a result, any subsequent enqueues within the current scheduler context will be treated as if the thread was operating in a parallel queue context.
10.5.2. Ordered Locking
Another powerful feature of the scheduler’s handling of ordered queues is ordered locks. Each ordered queue has associated with it a number of ordered locks as specified by the lock_count parameter at queue create time.
Ordered locks provide an efficient means to perform in-order sequential processing within an ordered context. For example, supposed events with relative order 5, 6, and 7 are executing in parallel by three different threads. An ordered lock will enable these threads to synchronize such that they can perform some critical section in their originating queue order. The number of ordered locks supported for each ordered queue is implementation dependent (and queryable via the odp_config_max_ordered_locks_per_queue() API). If the implementation supports multiple ordered locks then these may be used to protect different ordered critical sections within a given ordered context.
10.5.3. Summary: Ordered Queues
To see how these considerations fit together, consider the following code:
void worker_thread()
odp_init_local();
...other initialization processing
while (1) {
ev = odp_schedule(&which_q, ODP_SCHED_WAIT);
...process events in parallel
odp_schedule_order_lock(0);
...critical section processed in order
odp_schedule_order_unlock(0);
...continue processing in parallel
odp_queue_enq(dest_q, ev);
}
}
This represents a simplified structure for a typical worker thread operating
on ordered queues. Multiple events are processed in parallel and the use of
ordered queues ensures that they will be placed on dest_q
in the same order
as they originated. While processing in parallel, the use of ordered locks
enables critical sections to be processed in order within the overall parallel
flow. When a thread arrives at the odp_schedule_order_lock() call, it waits
until the locking order for this lock for all prior events has been resolved
and then enters the critical section. The odp_schedule_order_unlock() call
releases the critical section and allows the next order to enter it.
10.6. Scheduler Capabilities and Configuration
As with other ODP components, the ODP scheduler offers a range of capabilities and configuration options that are used by applications to control its behavior.
The sequence of API calls used by applications that make use of the scheduler is as follows:
odp_schedule_capability()
odp_schedule_config_init()
odp_schedule_config()
odp_schedule()
The odp_schedule_capability()
API returns an odp_schedule_capability_t
struct that defines various limits and capabilities offered by this
implementation of the ODP scheduler. Of note is the max_flow_id
capability,
which indicates whether this implementation is able to operate in
flow aware mode.
10.6.1. Flow Aware Scheduling
A flow is a sequence of events that share some application-specific meaning and context. A good example of a flow might be a TCP connection. Various events associated with that connection, such as packets containing connection data, as well as associated timeout events used for transmission control, are logically connected and meaningful to the application processing that TCP connection.
Normally a single flow is associated with an ODP queue. That is, all events
on a given queue belong to the same flow. So the queue id is synonymous with
the flow id for those events. However, this is not without drawbacks. Queues
are relatively heavyweight objects and provide both synchronization as well as
user contexts. The number of queues supported by a given implementation
(max_queues
) may be less than the number of flows an application needs to
be able to process.
To address these needs, ODP allows schedulers to operate in flow aware mode in which flow id is maintained separately as part of each event. Two new APIs:
-
odp_event_flow_id()
-
odp_event_flow_id_set()
are used to query and set a 32-bit flow id associated with individual events. The assignment and interpretation of individual flow ids is under application control.
When operating in flow aware mode, it is the combination of flow id and queue id that is used by the scheduler in making scheduling decisions. So, for example, an Atomic queue would normally be able to dispatch events only a single thread at a time. When operating in flow aware mode, however, the scheduler will provide this exclusion only when two events on the same atomic queue have the same flow id. If they have different flow ids, then they can be scheduled concurrently to different threads.
Note that when operating in this mode, any sharing of queue context must be done with application-provided synchronization controls (similar to how parallel queues behave).
10.6.2. Scheduler Configuration
After determining the scheduler’s capabilities, but before starting to use
the scheduler to process events, applications must configure the scheduler
by calling odp_schedule_config()
API:
void odp_schedule_config_init(odp_schedule_config_t *config)
The odp_schedule_config_t
struct argument must first be initialized to its
default values with odp_schedule_config_init()
API. An application can then
set whatever overrides it needs prior to calling odp_schedule_config()
to
activate them. Note that NULL
is a valid value for the argument if the
application simply wants to use the implementation-defined default
configuration. In the default configuration, the scheduler does not operate in
flow aware mode.
Once configured, odp_schedule()
calls can be made to get events. It is
a programming error to attempt to use the scheduler before it has been
configured.
10.7. Queue Scheduling Summary
Both ordered and parallel queues improve throughput over atomic queues due to parallel event processing, but require that the application take steps to ensure context data synchronization if needed. The same is true for atomic queues when the scheduler is operating in flow aware mode. |
11. Packet Processing
ODP applications are designed to process packets, which are the basic unit of data of interest in the data plane. To assist in processing packets, ODP provides a set of APIs that enable applications to examine and manipulate packet data and metadata. Packets are referenced by an abstract odp_packet_t handle defined by each implementation.
Packet objects are normally created at ingress when they arrive at a source
odp_pktio_t and are received by an application either directly or (more
typically) via a scheduled receive queue. They MAY be implicitly freed when
they are transmitted to an output odp_pktio_t via an associated transmit
queue, or freed directly via the odp_packet_free()
API.
Occasionally an application may originate a packet itself, either directly or by deriving it from an existing packet, and APIs are provided to assist in these cases as well. Application-created packets can be recycled back through a loopback interface to reparse and reclassify them, or the application can do its own parsing as desired.
Various attributes associated with a packet, such as parse results, are stored as metadata and APIs are provided to permit applications to examine and/or modify this information.
11.1. Packet Structure and Concepts
A packet consists of a sequence of octets conforming to an architected format, such as Ethernet, that can be received and transmitted via the ODP pktio abstraction. Packets have a length, which is the number of bytes in the packet. Packet data in ODP is referenced via offsets since these reflect the logical contents and structure of a packet independent of how particular ODP implementations store that data.
These concepts are shown in the following diagram:
Packet data consists of zero or more headers followed by 0 or more bytes of payload, followed by zero or more trailers. Shown here are various APIs that permit applications to examine and navigate various parts of a packet and to manipulate its structure.
To support packet manipulation, predefined headroom and tailroom
areas are logically associated with a packet. Packets can be adjusted by
pulling and pushing these areas. Typical packet processing might consist
of stripping headers from a packet via odp_packet_pull_head()
calls as part of
receive processing and then replacing them with new headers via
odp_packet_push_head()
calls as the packet is being prepared for transmit.
Note that while headroom and tailroom represent reserved areas of memory, these
areas are not addressable or directly usable by ODP applications until they are
made part of the packet via associated push operations. Similarly, bytes
removed via pull operations become part of a packet’s headroom or tailroom
and are again no longer accessible to the application.
11.2. Packet Segments and Addressing
ODP platforms use various methods and techniques to store and process packets efficiently. These vary considerably from platform to platform, so to ensure portability across them ODP adopts certain conventions for referencing packets.
ODP APIs use a handle of type odp_packet_t to refer to packet objects. Associated with packets are various bits of system metadata that describe the packet. By referring to the metadata, ODP applications accelerate packet processing by minimizing the need to examine packet data. This is because the metadata is populated by parsing and classification functions that are coupled to ingress processing that occur prior to a packet being presented to the application via the ODP scheduler.
When an ODP application needs to examine the contents of a packet, it requests addressability to it via an API call that makes the packet (or a contiguously addressable segment of it) available for coherent access by the application. To ensure portability, ODP applications assume that the underlying implementation stores packets in segments of implementation-defined and managed size. These represent the contiguously addressable portions of a packet that the application may refer to via normal memory accesses. ODP provides APIs that allow applications to operate on packet segments in an efficient and portable manner as needed. By combining these with the metadata provided by packets, ODP applications can operate in a fully platform-independent manner while still achieving optimal performance across the range of platforms that support ODP.
The use of segments for packet addressing and their relationship to metadata is shown in this diagram:
The packet metadata is set during parsing and identifies the starting offsets
of the various headers in the packet. The packet itself is physically stored
as a sequence of segments that area managed by the ODP implementation.
Segment 0 is the first segment of the packet and is where the packet’s headroom
and headers typically reside. Depending on the length of the packet,
additional segments may be part of the packet and contain the remaining packet
payload and tailroom. The application need not concern itself with segments
except that when the application requires addressability to a packet it
understands that addressability is provided on a per-segment basis. So, for
example, if the application makes a call like odp_packet_l4_ptr()
to obtain
addressability to the packet’s Layer 4 header, the returned length from that
call is the number of bytes from the start of the Layer 4 header that are
contiguously addressable to the application from the returned pointer address.
This is because the following byte occupies a different segment and may be
stored elsewhere. To obtain access to those bytes, the application simply
requests addressability to that offset and it will be able to address the
packet bytes that occupy the next segment, etc. Note that the returned
length for any packet addressability call is always the lesser of the remaining
packet length or size of its containing segment. So a mapping for segment 2
in the above figure, for example, would return a length that extends only to
the end of the packet since the remaining bytes are part of the tailroom
reserved for the packet and are not usable by the application until made
available to it by an appropriate API call.
While the push/pull APIs permit applications to perform efficient manipulation
of packets within the current segment structure, ODP also provides APIs that
permit segments to be added or removed. The odp_packet_extend_head()
and
odp_packet_trunc_head()
APIs permit segments to be added or removed from
the beginning of a packet, while odp_packet_extend_tail()
and
odp_packet_trunc_tail()
permit segments to be added or removed from the end
of a packet. Extending a packet adds one or more segments to permit packets to
grow up to implementation-defined limits. Truncating a packet removes one or
more segments to shrink the size of a packet beyond its initial or final
segment.
11.3. Metadata Processing
As noted, packet metadata is normally set by the parser as part of
classification that occurs during packet receive processing. It is important
to note that this metadata may be changed by the application to reflect
changes in the packet contents and/or structure as part of its processing of
the packet. While changing this metadata may effect some ODP APIs, changing
metadata is designed to document application changes to the packet but
does not in itself cause those changes to be made. For example, if an
application changes the Layer 3 offset by using the odp_packet_l3_offset_set()
API, the subsequent calls to odp_packet_l3_ptr()
will return an address
starting from that changed offset, changing an attribute like
odp_packet_has_udp_set()
will not, by itself, turn a non-UDP packet into
a valid UDP packet. Applications are expected to exercise appropriate care
when changing packet metadata to ensure that the resulting metadata changes
reflect the actual changed packet structure that the application has made.
11.4. Packet Manipulation
ODP Packet manipulation APIs can be divided into two categories: Those that do not change a packet’s segment structure, and those that potentially do change this structure. We’ve already seen one example of this. The push/pull APIs permit manipulation of packet headroom/tailroom that does not result in changes to packet segmentation, while the corresponding extend/trunc APIs provide the same functionality but with the potential that segments may be added to or removed from the packet as part of the operation.
The reason for having two different types of APIs that perform similar functions is that it is expected that on most implementations operations that do not change packet segment structure will be more efficient than those that do. To account for this, APIs that potentially involve a change in packet segmentation always take an output odp_packet_t parameter or return value. Applications are expected to use this new handle for the resulting packet instead of the old (input) handle as the implementation may have returned a new handle that now represents the transformed packet.
To enable applications that manipulate packets this way to operate most
efficiently the return codes from these APIs follow a standard convention. As
usual, return codes less than zero indicate error and result in no change to
the input packet. A return code of zero indicates success, but also indicates
that any cached addressability to the packet is still valid. Return codes
greater than zero also indicate success but with a potential change to packet
addressability. For example, if an application had previously obtained
addressability to a packet’s Layer 3 header via the odp_packet_l3_ptr()
API,
a return code of zero would mean that the application may continue to use that
pointer for access to the L3 header, while a return code greater than zero
would mean that the application should reissue that call to re-obtain
addressability as the packet segmentation may have changed and hence the old
pointer may no longer be valid.
11.4.1. Packet Copying
One of the simplest manipulations that can be done is to make a copy of all or
part of a packet. The odp_packet_copy()
and odp_packet_copy_part()
APIs
are used to return a new packet that contains either the entirety or a
selected part of an existing packet. Note that these operations also specify
the packet pool from which the new packet is to be drawn.
11.4.2. Packet Data Copying and Moving
ODP provides several APIs to enable portions of a packet to be copied either to or from a memory area, another packet, or within a single packet, as illustrated below:
These APIs provide bounds checking when the source or destination is an ODP
packet. This means that data must be in the offset range
0
..odp_packet_len()-1
. For operations involving memory areas,
the caller takes responsibility for ensuring that memory areas
referenced by odp_packet_copy_to/from_mem()
are valid.
When manipulating data within a single packet, two similar APIs are provided:
odp_packet_copy_data()
and odp_packet_move_data()
. Of these, the move
operation is more general and may be used even when the source and destination
data areas overlap. The copy operation must only be used if the caller knows
that the two areas do not overlap, and may result in more efficient operation.
When dealing with overlapping memory areas, odp_packet_move_data()
operates
as if the source area was first copied to a non-overlapping separate memory
area and then copied from that area to the destination area.
11.4.3. Adding and Removing Packet Data
The various copy/move operations discussed so far only affect the data
contained in a packet do not change its length. Data can also be added to
or removed from a packet via the odp_packet_add_data()
and
odp_packet_rem_data()
APIs as shown below:
Adding data simply creates the requested amount of "space" within the packet
at the specified offset. The length of the packet is increased by the number
of added bytes. The contents of this space upon successful completion
of the operation is unspecified. It is the application’s responsibility to then
fill this space with meaningful data, e.g., via a subsequent
odp_packet_copy_from_mem()
or odp_packet_copy_from_pkt()
call.
Removing data from a packet has the opposite effect. The specified number of bytes at the designated offset are removed from the packet and the resulting "hole" is collapsed so that the remainder of the packet immediately follows the removal point. The resulting packet length is decreased by the number of removed bytes.
Note that adding or removing data from a packet may affect packet segmentation,
so the application must use the returned packet handle and abide by the
return code results of the operation. Whether or not segmentation is
changed by these operations, the amount of available packet headroom and/or
tailroom may also be changed by these operations, so again applications should
not attempt to cache the results of prior odp_packet_headroom()
or
odp_packet_tailroom()
calls across these APIs.
11.4.4. Packet Splitting and Concatenation
Another type of manipulation is to split a packet into two packets as shown below:
The odp_packet_split()
API indicates the split point by specifying the
resulting desired length of the original packet. Upon return, the original
packet ends at the specified split point and the new "tail" is returned as
its own separate packet. Note that this new packet will always be from the same
packet pool as the original packet.
The opposite operation is performed by the odp_packet_concat()
API. This API
takes a destination and source packet as arguments and the result is that
the source packet is concatenated to the destination packet and ceases to
have any separate identity. Note that it is legal to concatenate a packet to
itself, in which case the result is a packet with double the length of the
original packet.
11.4.5. Packet Realignment
As previously discussed, packets are divided into implementation-defined
segments that normally don’t concern applications since contiguous
addressability extents are returned as part of APIs such as
odp_packet_offset()
. However, if the application has performed a lot of
manipulation or processing on a packet, this can sometimes result in segment
boundaries appearing at inconvenient locations, such as in the middle of
headers or individual fields, or for headers to become misaligned with respect
to their addresses in memory. This can make subsequent processing of the
packet inefficient.
To address these issues, ODP provides a means of realigning a packet to allow for more efficient processing as shown below:
Input to odp_packet_align()
specifies the number of contiguous bytes that
are needed at a given packet offset as well as the memory alignment required
for that offset. A value of zero may be specified for either as a "don’t care"
value. If these criteria are already satisfied then the call is an effective
no-op and will result in a return code of zero to tell the caller that all is
well. Otherwise, the packet will be logically "shifted" within its containing
segment(s) to achieve the requested addressability and alignment constraints,
if possible, and a return code greater than zero will result.
The requested operation may fail for a number of reasons. For example, if the caller is requesting contiguous addressability to a portion of the packet larger than the underlying segment size. The call may also fail if the requested alignment is too high. Alignment limits will vary among different ODP implementations, however ODP requires that all implementations support requested alignments of at least 32 bytes.
11.5. Packet References
To support efficient multicast, retransmit, and related processing, ODP supports two additional types of packet manipulation: static and dynamic references. A reference is a lightweight mechanism for creating aliases to packets as well as to create packets that share data bytes with other packets to avoid unnecessary data copying.
11.5.1. Static References
The simplest type of reference is the static reference. A static reference is created by the call:
ref_pkt = odp_packet_ref_static(pkt);
If the reference fails, ODP_PACKET_INVALID
is returned and pkt
remains unchanged.
The effect of this call is shown below:
A static reference provides a simple and efficient means of creating an alias
for a packet handle that prevents the packet itself from being freed until all
references to it have been released via odp_packet_free()
calls. This is
useful, for example, to support retransmission processing, since as part of
packet TX processing, odp_pktout_send()
or odp_tm_enq()
will free
the packet after it has been transmitted.
odp_packet_ref_static()
might be used in a transmit routine wrapper
function like:
int xmit_pkt(odp_pktout_queue_t queue, odp_packet_t pkt)
{
odp_packet_t ref = odp_packet_ref_static(pkt);
return ref == ODP_PACKET_INVALID ? -1 : odp_pktout_send(queue, ref, 1);
}
This transmits a reference to pkt
so that pkt
is retained by the caller,
which means that the caller is free to retransmit it if needed at a later
time. When a higher level protocol (e.g., receipt of a TCP ACK packet)
confirms that the transmission was successful, pkt
can then be discarded via
an odp_packet_free()
call.
The key characteristic of a static reference is that because there are multiple independent handles that refer to the same packet, the caller should treat the packet as read only following the creation of a static reference until all other references to it are freed. This is because all static references are simply aliases of the same packet, so if multiple threads were independently manipulating the packet this would lead to unpredictable race conditions.
To assist in determining whether there are other references to a packet, ODP provides the API:
int odp_packet_has_ref(odp_packet_t pkt);
that indicates whether other packets exist that share bytes with this packet. If this routine returns 0 then the caller can be assured that it is safe to modify it as this handle is the only reference to the packet.
11.5.2. Dynamic References
While static references are convenient and efficient, they are limited by the need to be treated as read only. For example, consider an application that needs to multicast a packet. Here the same packet needs to be sent to two or more different destinations. While the packet payload may be the same, each sent copy of the packet requires its own unique header to specify the destination that is to receive the packet.
To address this need, ODP provides dynamic references. These are created by the call:
ref_pkt = odp_packet_ref(pkt, offset);
The offset
parameter specifies the byte offset into pkt
at which the
reference is to begin. This must be in the range
0..odp_packet_len(pkt)
-1. As before, if the reference is unable to be
created ODP_PACKET_INVALID
is returned and pkt
is unchanged, otherwise the
result is as shown below:
Following a successful reference creation, the bytes of pkt
beginning at
offset offset
are shared with the created reference. These bytes should be
treated as read only since multiple references point to them. Each reference,
however still retains its own individual headroom and metadata that is not
shared with any other reference. This allows unique headers to be created by
calling odp_packet_push_head()
or odp_packet_extend_head()
on either
handle. This allows multiple references to the same packet to prefix unique
headers onto common shared data it so that they can be properly multicast
using code such as:
int pkt_fanout(odp_packet_t payload, odp_queue_t fanout_queue[], int num_queues)
{
int i;
for (i = 0, i < num_queues, i++)
odp_queue_enq(fanout_queue[i], odp_packet_ref(payload, 0));
}
Receiver worker threads can then operate on each reference to the packet in parallel to prefix a unique transmit header onto it and send it out.
11.5.3. Dynamic References with Headers
The dynamic references discussed so far have one drawback in that the headers needed to make each reference unique must be constructed individually after the reference is created. To address this problem, ODP allows these headers to be created in advance and then simply prefixed to a base packet as part of reference creation:
ref_pkt = odp_packet_ref_pkt(pkt, offset, hdr_pkt);
Here rather than creating a reference with a null header, a header packet is supplied that is prefixed onto the reference. The result looks like this:
So now multicasting can be more efficient using code such as:
int pkt_fanout_hdr(odp_packet_t payload, odp_queue_q fanout_queue[],
odp_packet_t hdr[], int num_queues)
{
int i;
for (i = 0; i < num_queues, i++)
odp_queue_enq(fanout_queue[i],
odp_packet_ref_pkt(payload, 0, hdr[i]));
}
Now each individual reference has its own header already prefixed to it ready for transmission.
Note that when multiple references like this are made they can each have their own offset. So if the following code is executed:
ref_pkt1 = odp_packet_ref_pkt(pkt, offset1, hdr_pkt1);
ref_pkt2 = odp_packet_ref_pkt(pkt, offset2, hdr_pkt2);
the result will look like:
Here two separate header packets are prefixed onto the same shared packet, each at their own specified offset, which may or may not be the same. The result is three packets visible to the application:
-
The original
pkt
, which can still be accessed and manipulated directly. -
The first reference, which consists of
hdr_pkt1
followed by bytes contained inpkt
starting atoffset1
. -
The second reference, which consists of
hdr_pkt2
followed by bytes contained inpkt
starting atoffset2
.
Only a single copy of the bytes in pkt
that are common to the
references exist.
Data Sharing with References
Because a pkt
is a shared object when referenced, applications must observe
certain disciplines when working with them. For best portability and
reliability, the shared data contained in any packet referred to by references
should be treated as read only once it has been successfully referenced until
it is known that all references to it have been freed.
To assist applications in working with references, ODP provides the additional API:
int odp_packet_has_ref(odp_packet_t pkt);
The odp_packet_has_ref()
API says whether any other packets
exist that share any bytes with this packet.
Compound References
Note that architecturally ODP does not limit referencing and so it is possible that a reference may be used as a basis for creating another reference. The result is a compound reference that should still behave as any other reference.
As noted earlier, the intent behind references is that they are lightweight objects that can be implemented without requiring data copies. The existence of compound references may complicate this goal for some implementations. As a result, implementations are always free to perform partial or full copies of packets as part of any reference creation call.
Note also that a packet may not reference itself, nor may circular reference relationships be formed, e.g., packet A is used as a header for a reference to packet B and B is used as a header for a reference to packet A. Results are undefined if such circular references are attempted.
11.6. Packet Parsing, Checksum Processing, and Overrides
Packet parsing is normally triggered automatically as part of packet RX processing. However, the application can trigger parsing explicitly via the API:
int odp_packet_parse(odp_packet_t pkt, uint32_t offset,
const odp_packet_parse_param_t *param);
This is typically done following packet decapsulation or other preprocessing
that would prevent RX parsing from "seeing" the relevant portion of the
packet. The odp_packet_parse_param_t
struct that is passed to control the
depth of the desired parse, as well as whether checksum validation should be
performed as part of the parse, and if so which checksums require this
processing.
Packets containing Layer 3 (IPv4) and Layer 4 (TCP, UDP, SCTP) checksums can have these validated (on RX) and generated (on TX) automatically. This is normally controlled by the settings on the PktIOs that receive/transmit them, however they can also be controlled on an individual packet basis.
Packets have associated odp_packet_chksum_status_t
metadata that indicates
the state any checksums contained in that packet. These can be queried via
the APIs odp_packet_l3_chksum_status()
and odp_packet_l4_chksum_status()
,
respectively. Checksums can either be known good, known bad, or unknown, where
unknown means that checksum validation processing has not occurred or the
attempt to validate the checksum failed.
Similarly, the odp_packet_l3_chksum_insert()
and
odp_packet_l4_chksum_insert()
APIs may be used to override default checksum
processing for individual packets prior to transmission. If no explicit
checksum processing is specified for a packet, then any checksum generation
is controlled by the PktIO configuration of the interface used to transmit it.
12. PktIO Processing
Before packets can be manipulated they typically need to be received and after they are manipulated they need to be transmitted. The ODP abstraction that captures these operations is the Packet I/O (PktIO). PktIOs are represented by handles of type odp_pktio_t and represent a logical I/O interface that is mapped in an implementation-defined manner to an underlying integrated I/O adapter or NIC.
PktIO objects are manipulated through various state transitions via
odp_pktio_xxx()
API calls as shown below:
PktIOs begin in the Unallocated state. From here a call odp_pktio_open()
is used to create an odp_pktio_t handle that is used in all subsequent calls
to manipulate the object. This call puts the PktIO into the Unconfigured
state. To become operational, a PktIO must first be
configured for Input, Output, or both Input and Output via the
odp_pktin_queue_config()
and/or odp_pktout_queue_config()
APIs, and then
started via the odp_pktio_start()
to make it Ready.
Following the completion of I/O processing, the odp_pktio_stop()
API returns
the PktIO to the Configured state. From here it may be Reconfigured via
additional odp_pktin_queue_config()
and/or odp_pktout_queue_config()
calls,
or Closed via the odp_pktio_close()
API to return the PktIO to the
Unallocated state.
12.1. PktIO Allocation
PktIO objects begin life by being opened with odp_pktio_open()
API:
odp_pktio_t odp_pktio_open(const char *name,
odp_pool_t pool,
const odp_pktio_param_t *param)
The function has three arguments: a name, which is an implementation-defined string that identifies the logical interface to be opened, a pool that identifies the ODP pool that storage for received packets should be allocated from, and a param structure that specifies I/O options to be associated with this PktIO instance.
ODP defines "loop" as a reserved name to indicate that this PktIO represents a loopback interface. Loopback interfaces are useful as a means of recycling packets back for reclassification after decryption or decapsulation, as well as for diagnostic or testing purposes. For example, when receiving IPsec traffic, the classifier is able to recognize that the traffic is IPsec, however until the traffic is decrypted it is unable to say what that traffic contains. So following decryption, sending the decrypted packet back to a loopback interface allows the classifier to take a "second look" at the packet and properly classify the decrypted payload. Similar considerations apply to tunneled packets that must first be decapsulated to reveal the true payload.
The pool specifies the default pool to use for packet allocation if not overridden by the classifier due to a specific or default Class-of-Service (CoS) match on the packet. The param struct, in turn, specifies the input and output modes of the PktIO.
12.2. PktIO Capabilities and PktIn/PktOut Configuration
Associated with each PktIO is a set of capabilities that provide information
such as the maximum number of input/output queues it supports, its configuration
options, and the operations is supports. These are aggregated into
odp_pktio_capability_t
the struct, which is returned by
odp_pktio_capability()
API.
12.2.1. PktIn Configuration
For PktIOs that will receive packets, the odp_pktin_config_opt_t
struct
controls the RX processing to be performed on received packets. For example,
odp_pktin_config_opt_t
includes options for controlling packet timestamping as
well as default packet checksum verification processing.
12.2.2. PktIO Parsing Configuration
For RX processing, packets may also be parsed automatically as part of receipt
as controlled by the odp_pktio_parser_config_t
struct.
12.2.3. PktOut Configuration
For PktIOs that will transmit packets, the odp_pktout_config_opt_t
struct
controls the TX processing to be performed on transmitted packets. For example,
odp_pktout_config_opt_t
includes options for controlling checksum insertion
for transmitted packets.
12.3. PktIO Input and Output Modes
PktIO objects support four different Input and Output modes, that may be specified independently at open time.
-
ODP_PKTIN_MODE_DIRECT
-
ODP_PKTIN_MODE_QUEUE
-
ODP_PKTIN_MODE_SCHED
-
ODP_PKTIN_MODE_DISABLED
-
ODP_PKTOUT_MODE_DIRECT
-
ODP_PKTOUT_MODE_QUEUE
-
ODP_PKTOUT_MODE_TM
-
ODP_PKTOUT_MODE_DISABLED
The DISABLED modes indicate that either input or output is prohibited on this
PktIO. Attempts to receive packets on a PktIO whose in_mode
is DISABLED
return no packets while packets sent to a PktIO whose out_mode
is DISABLED
are discarded.
12.3.1. Direct I/O Modes
DIRECT I/O is the default mode for PktIO objects. It is designed to support poll-based packet processing, which is often found in legacy applications being ported to ODP, and can also be a preferred mode for some types of packet processing. By supporting poll-based I/O processing, ODP provides maximum flexibility to the data plane application writer.
Direct RX Processing
The processing of DIRECT input is shown below:
In DIRECT mode, received packets are stored in one or more special PktIO queues
of type odp_pktin_queue_t and are retrieved by threads calling the
odp_pktin_recv()
API.
Once opened, setting up a DIRECT mode PktIO is performed by the
odp_pktin_queue_config()
API, whose purpose is to specify the number of PktIn
queues to be created and to set their attributes.
It is important to note that while odp_pktio_queue_config()
creates a
requested number of RX queues that are associated with the PktIO and accepts
optimization advice as to how the application intends to use them, i.e.,
whether the queues need to be safe for concurrent use by multiple threads
(OP_MT) or only one thread at a time (OP_MT_UNSAFE), these queues are not
associated with any specific thread. Applications use a discipline
appropriate to their design, which may involve restricting PktIn queue use
to separate threads, but that is an aspect of the application design. ODP
simply provides a set of tools here, but it is the application that determines
how those tools are used.
Hash Processing
Another feature of DIRECT mode input is the provision of a hash function used
to distribute incoming packets among the PktIO’s PktIn queues. If the
hash_enable
field of the odp_pktin_queue_param_t is true
,
then the hash_proto
field is used to specify which field(s) of incoming
packets should be used as input to an implementation-defined packet
distribution hash function.
Note that the hash function used in PktIO poll mode operation is intended to provide simple packet distribution among multiple PktIn queues associated with the PktIO. It does not have the sophistication of the ODP Classifier, however it also does not incur the setup requirements of pattern matching rules, making it a simpler choice for less sophisticated applications. Note that ODP does not specify how the hash is to be performed. That is left to each implementation. The hash only specifies which input packet fields are of interest to the application and should be considered by the hash function in deciding how to distribute packets among PktIn queues. The only expectation is that packets that have the same hash values should all be mapped to the same PktIn queue.
PktIn Queues
A PktIn Queue is a special type of queue that is used internally by PktIOs
operating in DIRECT mode. Applications cannot perform enqueues to these queues,
however they may obtain references to them via the odp_pktin_queue()
API.
Once configured, prior to receiving packets the PktIO must be placed into the
Ready state via a call to odp_pktio_start()
. Once started, the PktIn queue
handles are used as arguments to odp_pktin_recv()
to receive packets from the
PktIO.
Note that it is the caller’s responsibility to ensure that PktIn queues are used correctly. For example, it is an error for multiple threads to attempt to perform concurrent receive processing on the same PktIn queue if that queue has been marked MT_UNSAFE. Performance MAY be improved if the application observes the discipline of associating each PktIn queue with a single RX thread (in which case the PktIn queue can be marked MT_UNSAFE), however this is up to the application to determine how best to structure itself.
Direct TX Processing
A PktIO operating in DIRECT mode performs TX processing as shown here:
Direct TX processing operates similarly to Direct RX processing. Following
open, the odp_pktout_queue_config()
API is used to create and configure
one or more PktOut queues of type odp_pktout_queue_t to be used for packet
transmission by this PktIO. As with PktIn queues, the handles for these created
PktOut queues may be retrieved by the odp_pktout_queue()
API.
Once the PktIO has been configured for output and started via
odp_pktio_start()
, packets may be transmitted to the PktIO by calling
odp_pktout_send()
:
int odp_pktout_send(odp_pktout_queue_t queue,
const odp_packet_t packets[],
int num)
Note that the first argument to this call specifies the PktOut queue that the packet is to be added to rather than the PktIO itself. This permits multiple threads (presumably operating on different cores) a more efficient means of separating I/O processing destined for the same interface.
12.3.2. Queued I/O Modes
To provide additional flexibility when operating in poll mode, PktIOs may also be opened in QUEUE Mode. The difference between DIRECT and QUEUE mode is that QUEUE mode uses standard ODP event queues to service packets.
Queue RX Processing
The processing for QUEUE input processing is shown below:
In QUEUE mode, received packets are stored in one or more standard ODP queues.
The difference is that these queues are not created directly by the
application. Instead, they are created in response to an
odp_pktin_queue_config()
call.
As with DIRECT mode, the odp_pktin_queue_param_t
specified to this call
indicates whether an input hash should be used and if so which field(s) of
the packet should be considered as input to the has function.
The main difference between DIRECT and QUEUE RX processing is that because
the PktIO uses standard ODP event queues, other parts of the application can
use odp_queue_enq()
API calls to enqueue packets to these queues for
"RX" processing in addition to those originating from the PktIO interface
itself. To obtain the handles of these input queues, the
odp_pktin_event_queue()
API is used.
Similarly, threads receive packets from PktIOs operating in QUEUE mode by
making standard odp_queue_deq()
calls to one of the event queues associated
with the PktIO.
Queue TX Processing
Transmit processing for PktIOs operating in QUEUE mode is shown below:
For TX processing QUEUE mode behaves similar to DIRECT mode except that
output queues are regular ODP event queues that receive packets via
odp_queue_enq()
calls rather than special PktOut queues that use
odp_pktout_send()
. Again, these queues are created via a call to
odp_pktout_queue_config()
following odp_pktio_open()
.
The main reason for selecting QUEUE mode for output is flexibility. If an application is designed to use a pipeline model where packets flow through a series of processing stages via queues, then having the PktIO in QUEUE mode means that the application can always use the same enq APIs to pass packets from one stage to the next, including the final transmit output stage.
12.3.3. Scheduled I/O Modes
The final PktIO mode supported integrates RX and TX processing with the ODP event model. For RX processing this involves the use of the Scheduler while for TX processing this involves the use of the Traffic Manager.
Scheduled RX Processing is further divided based on whether or not the Classifier is used.
Scheduled RX Processing
When a PktIO is opened with ODP_PKTIN_MODE_SCHED
, it indicates that the
input queues created by a subsequent odp_pktin_queue_config()
call are to
be used as input to the ODP Scheduler.
For basic use, SCHED mode simply associates the PktIO input event queues
created by odp_pktin_queue_config()
with the scheduler. Hashing may still be
employed to distribute input packets among multiple input queues. However
instead of these being plain queues they are scheduled queues and have
associated scheduling attributes like priority, scheduler group, and
synchronization mode (parallel, atomic, ordered). SCHED mode thus provides
both packet distribution (via the optional hash) as well as scalability via
the ODP event model.
In its fullest form, PktIOs operating in SCHED mode use the ODP Classifier to permit fine-grained flow separation on Class of Service (CoS) boundaries.
In this mode of operation, the hash function of odp_pktin_queue_config()
is
typically not used. Instead, the event queues created by this call,
as well as any additional event queues created via separate
odp_queue_create()
calls are associated with classes of service via
odp_cls_cos_create()
calls. Classification is enabled for the PktIO as a
whole by assigning a default CoS via the odp_pktio_default_cos_set()
API.
When operating in SCHED mode, applications do not call PktIn receive functions. Instead the PktIn queues are scanned by the scheduler and, if classification is enabled on the PktIO, inbound packets are classified and put on queues associated with their target class of service which are themelves scheduled to threads. Note that on platforms that support hardware classification and/or scheduling these operations will typically be performed in parallel as packets are arriving, so this description refers to the logical sequence of classification and scheduling, and does not imply that this is a serial process.
Scheduled TX Processing
Scheduled transmit processing is performed via the ODP Traffic Manager and
is requested when a PktIO is opened with an out_mode
of ODP_PKTOUT_MODE_TM
.
For TX processing via the Traffic Manager, applications use the odp_tm_enq()
API. See the Traffic Manager section of this document for more information
about Traffic Manager configuration and operation.
13. Timers and Timeout Events
The ODP Timer APIs offer a set of functions that permit applications to react to the passage of time, and are designed to reflect the underlying hardware timing features found in various platforms that support ODP implementations.
Timers are drawn from specialized pools called timer pools that have their
own abstract type (odp_timer_pool_t
). Each timer pool is a logically
independent time source with its own resolution measured in nanoseconds (ns)
and a maximum number of timers that it can support. The max resolution is
able to be obtained from odp_timer_capability()
. Applications can have many
timers active at the same time and can set them to use either relative or
absolute time. Associated with each timer is a queue that is to receive events
when this timer expires. This queue is created by a separate
odp_queue_create()
call that is passed as a parameter to odp_timer_alloc()
.
Timeouts are specialized events of type odp_timeout_t
that are used to
represent the expiration of timers. Timeouts are drawn from pools of type
ODP_POOL_TIMEOUT
that are created by the standard odp_pool_create()
API.
Timeout events are associated with timers when those timers are set and are
enqueued to their timer’s associated queue whenever a set timer expires. So the
effect of timer expiration is a timeout event being added to a queue and
delivered via normal ODP event scheduling.
The following diagrams show the life cycle of timers and timeout events. Transitions in these finite state machines are marked by the event triggering them. Events marked in green are common to both state machines, i.e., trigger both state machines.
Reminder:
On a timer expire
event, the related timeout event is enqueued to the timer
related queue.
Timers measure time in ticks rather than nanoseconds because each timer pool
may have its own time source and associated conversion ratios. It is thus more
efficient to manipulate time in these native tick values. As a result time
measured in nanoseconds must be converted between timer-pool specific tick
values via the conversion functions odp_timer_ns_to_tick()
and
odp_timer_tick_to_ns()
as needed. Both of these functions take a timer pool
as an input parameter to enable the pool-specific conversion ratios to be
used.
Associated with each timer pool is a free running tick counter that can be
sampled at any time via the odp_timer_current_tick()
API. Timers are started
with odp_timer_start()
and the expiration time can be an absolute future tick
value or a future tick value relative to the current tick. Implementations may
impose minimum and maximum future values supported by a given timer pool and
timer start operations will fail if the requested value is outside of the
supported range.
Before a set timer expires, it can be canceled via the odp_timer_cancel()
API. A successful cancel has the same effect as if the timer were never set.
An attempted cancel will fail if the timer is not set or if it has already
expired.
13.1. Timer Pool Management
To facilitate implementation of the ODP timer APIs, an additional timer API is
provided. During initialization, applications are expected to create the timer
pools they need and then call odp_timer_pool_start_multi()
. Following start,
applications may allocate, set, cancel, and free timers from their associated
timer pools. During termination processing, after all timers allocated from a
timer pool have been freed, the pool itself should be released via a call to
odp_timer_pool_destroy()
.
13.2. Timeout Event Management
The purpose of ODP timers is to schedule their associated timeout events, which are how applications actually react to the passage of time. To help with this, several additional APIs and conventions are provided.
Timer allocation is performed via the odp_timer_alloc()
API:
odp_timer_t odp_timer_alloc(odp_timer_pool_t timer_pool,
odp_queue_t queue,
const void *user_ptr)
Note that in addition to the timer pool and queue, a user pointer is provided.
This is to allow context associated with the timeout to be communicated. Upon
receiving a timeout event, the application can use the odp_timeout_user_ptr()
API to retrieve the user pointer associated with the timer that triggered this
event.
An worker thread receiving events that may include timeouts might be structured as follows:
while (1) {
ev = odp_schedule(&from, ODP_SCHED_WAIT);
switch (odp_event_type(ev)) {
case ODP_EVENT_TIMEOUT:
odp_timeout_t timeout = odp_timeout_from_event(ev);
odp_timer_t timer = odp_timeout_timer(timeout);
void *userptr = odp_timeout_user_ptr(timeout);
uint64_t expiration = odp_timeout_tick(timeout);
...process the timeout event
break;
...process other event types
}
}
14. Cryptographic services
ODP provides APIs to perform cryptographic operations required by applications. ODP cryptographic APIs are session based and provide cryptographic algorithm offload services. ODP also offers cryptographic protocol offload services for protocols such as IPsec using a different set of APIs. This section covers the main crypto APIs.
ODP provides APIs for following cryptographic services:
-
Ciphering
-
Authentication/data integrity via Keyed-Hashing (HMAC)
-
Random number generation
-
Crypto capability inquiries
14.1. Crypto Sessions
To apply a cryptographic operation to a packet a session must be created. All packets processed by a session share the parameters that define the session.
ODP supports synchronous and asynchronous crypto sessions. For asynchronous sessions, the output of crypto operation is posted in a queue defined as the completion queue in its session parameters.
ODP crypto APIs support chained operation sessions in which hashing and
ciphering both can be achieved using a single session and operation call. The
order of cipher and hashing can be controlled by the auth_cipher_text
session parameter.
Other Session parameters include algorithms, keys, initialization vector lengths, encode or decode, output queue for async mode and output packet pool for allocation of an output packet if required.
The parameters that describe the characteristics of a crypto session are
encoded in the odp_crypto_session_param_t
struct that is passed to the
odp_crypto_session_create()
API. A successful call returns an
odp_crypto_session_t
object that in turn is passed as an input parameter to
crypto operation calls.
When an application is finished with a crypto session the
odp_crypto_session_destroy()
API is used to release the resources associated
with an odp_crypto_session_t
.
14.2. Crypto operations
After session creation, a cryptographic operation can be applied to a packet
synchronously or asynchronously. odp_crypto_op()
is the synchronous API
while odp_crypto_op_enq()
is the asynchronous API. To check which of these
are supported by the ODP implementation, examine the sync_mode
and
async_mode
fields in the odp_crypto_capability_t
struct returned by the
odp_crypto_capability()
API.
Both forms take an input array of packets, an optional output array of packets
to receive the results, and an array of odp_crypto_packet_op_param_t
structs
that describe the operation to be performed on each input packet. The output
array may be the same packets to request in-place operation, or may be
specified as ODP_PACKET_INVALID
to request that ODP allocate output packets
from the pool associated with the odp_crypto_session_t
being used.
The op_mode field of odp_crypto_session_t
indicates whether asynchronous
or synchronous operations are used with the session. If op_mode
is set
to ODP_CRYPTO_SYNC
then the synchronous API must be used and if op_mode
is set to ODP_CRYPTO_ASYNC
then the asynchronous API must be used. It is
an error to use a form of the API that does not match the mode of the crypto
session.
The output of a crypto operation is an odp_packet_t
(one for each input
packet) that is returned either synchronously or asynchronously. Asynchronous
return is in the form of ODP_EVENT_PACKET
events that have event subtype
ODP_EVENT_PACKET_CRYPTO
. The packet associated with such events is obtained
via the odp_crypto_packet_from_event()
API. The odp_crypto_result()
API,
in turn, retrieves the odp_crypto_packet_result_t
from this odp_packet_t
that contains:
-
An indication of whether the crypto operation was successful or not
-
The
odp_crypto_op_status_t
for the requested cipher operation -
The
odp_crypto_op_status_t
for the requested authentication operation
14.3. Random number Generation
ODP provides two APIs to generate various kinds of random data bytes. Random data is characterized by kind, which specifies the "quality" of the randomness required. ODP support three kinds of random data:
- ODP_RANDOM_BASIC
-
No specific requirement other than the data appear to be uniformly distributed. Suitable for load-balancing or other non-cryptographic use.
- ODP_RANDOM_CRYPTO
-
Data suitable for cryptographic use. This is a more stringent requirement that the data pass tests for statistical randomness.
- ODP_RANDOM_TRUE
-
Data generated from a hardware entropy source rather than any software generated pseudo-random data. May not be available on all platforms.
These form a hierarchy with BASIC being the lowest kind of random and TRUE being the highest. The main API for accessing random data is:
int32_t odp_random_data(uint8_t buf, uint32_t len, odp_random_kind_t kind);
The expectation is that lesser-quality random is easier and faster to generate while higher-quality random may take more time. Implementations are always free to substitute a higher kind of random than the one requested if they are able to do so more efficiently, however calls must return a failure indicator (rc < 0) if a higher kind of data is requested than the implementation can provide. This is most likely the case for ODP_RANDOM_TRUE since not all platforms have access to a true hardware random number generator.
The odp_random_max_kind()
API returns the highest kind of random data
available on this implementation.
For testing purposes it is often desirable to generate repeatable sequences of "random" data. To address this need ODP provides the additional API:
int32_t odp_random_test_data(uint8_t buf, uint32_t len, uint64_t *seed);
This operates the same as odp_random_data()
except that it always returns
data of kind ODP_RANDOM_BASIC
and an additional thread-local seed
parameter is provide that specifies a seed value to use in generating the
data. This value is updated on each call, so repeated calls with the same
variable will generate a sequence of random data starting from the initial
specified seed. If another sequence of calls is made starting with the same
initial seed value, then odp_random_test_data()
will return the same
sequence of data bytes.
14.4. Capability inquiries
ODP provides the API odp_crypto_capability()
to inquire the implementation’s
crypto capabilities. This interface returns a the maximum number of crypto
sessions supported as well as bitmasks for supported algorithms and hardware
backed algorithms.
15. IPsec services
In addition to general cryptographic services, ODP offers offload support for the IPsec protocol. IPsec is a general term referencing a suite of protocols and packet formats and as such a full discussion of IPsec is beyond the scope of this document. See RFC 4301 and related RFCs for more detail. This section assumes the reader is already familiar with IPsec and focuses on explaining the ODP APIs that support it.
ODP provides APIs for the following IPsec services:
-
General IPsec configuration
-
Security Association (SA) configuration and lifecycle management
-
Synchronous and Asynchronous IPsec lookaside processing
-
Inline processing for full IPsec RX and/or TX offload
-
Pipelining for RX traffic
-
Fragmentation support for TX traffic
-
IPsec event management
15.1. IPsec Capabilities and Configuration
As with other features, ODP provides APIs that permit applications to query
platform-specific IPsec capabilities. The odp_ipsec_capability()
API queries
the general IPsec features available while the odp_ipsec_cipher_capability()
and odp_ipsec_auth_capability()
APIs provide detail on the range of
cipher and authentication algorithms supported by IPsec on this platform.
General IPsec capabilities that are reported include:
-
The IPsec operation modes supported by this implementation. Different operation modes may be not supported, supported, or preferred. A preferred form means that this mode takes advantage of hardware acceleration features to achieve best performance.
-
Whether IPsec AH processing is supported. All ODP platforms must provide support for IPsec ESP processing, however since AH is relatively rare, it may not be supported, or supported only via software emulation (e.g., be non-preferred).
-
Whether IPsec headers can be retained on decrypt for inbound inline operations.
-
Whether classification pipelining is supported (to be discussed below).
In addition, capabilities also inform the application of the maximum number of destination queues and classification CoS targets supported. These will be discussed further later.
15.1.1. IPsec Operation Modes
IPsec operates in one of three modes: Synchronous, Asynchronous, and Inline.
15.1.2. Lookaside Processing
Synchronous and Asynchronous are types of lookaside processing. Which of these forms may be used depends on the IPsec operation mode. So synchronous APIs may only be used when operating in synchronous mode, and asynchronous APIs may only be used when operating in asynchronous mode.
In lookaside mode, the application receives (or creates) an IPsec packet and then uses ODP to perform one of two functions:
-
To decrypt an IPsec packet into a "normal" packet
-
To take a "normal" packet and encrypt it into an IPsec packet.
This process may be performed synchronously with the APIs odp_ipsec_in()
(to decrypt) and odp_ipsec_out()
(to encrypt). Upon return from these calls
the requested packet transformation is complete, or an error return code
indicates that it could not be performed (e.g., packet decryption failed).
Synchronous processing may be preferred if the application has a large number of worker threads so that blocking any individual worker while IPsec processing is performed represents a reasonable design. The alternative is to use asynchronous forms of these APIs:
-
odp_ipsec_in_enq()
for decrypt -
odp_ipsec_out_enq()
for encrypt
These simply pass packets to IPsec for processing. When this processing is complete, the resulting packets are sent to the completion queue associated with the SA used by the operation, serving as IPsec completion events as shown here:
If the operation fails because SA lookup failed for inbound processing, then
these result packets are sent to the default queue specified as part of the
odp_ipsec_inbound_config_t
used in the odp_ipsec_config()
call.
Following an asynchronous IPsec call, the worker thread moves on to process
other events until the IPsec completion shows up. At that point the worker
thread sees whether the operation was successful or not and continues
processing for that packet. These events may be direct-polled with
odp_queue_deq()
if the completion queue was created as a plain queue, or
processed via the ODP scheduler if the completion queue was created as a
scheduled queue.
15.1.3. Inline Processing
While lookaside processing offers flexibility, it still requires extra processing steps not required by modern hardware. To avoid this overhead ODP also offers inline processing support for IPsec. In this mode the processing of IPsec packets on the RX and TX paths is fully offloaded as shown here:
It is worth noting that, depending on the implementation and application needs, inline processing may be enabled only for one direction (inbound or outbound) or for both directions.
On the receive side, once configured for inline processing, arriving IPsec
packets that are recognized at the PktIO interface are decrypted automatically
before the application ever sees them. On the transmit side, the application
calls odp_ipsec_out_inline()
and the packet is encrypted and queued for
transmission as a single operation without further application involvement.
Note that if an inbound IPsec packet is not recognized (e.g., it belongs to
an unknown SA) then it will be presented to the application as-is without
further processing. The application may then use a lookaside call to process
the packet if it is able to supply a matching SA by other means.
On the receive side, after an IPsec packet is decrypted, it may be pipelined to the ODP classifier or added to a poll queue, as the application wishes. The advantage of classification pipelining is that inbound IPsec traffic is automatically decrypted and classified into appropriate flow-based queues for ease of processing.
On the transmit side, since IPsec encryption and tunneling may exceed an output MTU, ODP also offers support for MTU configuration and automatic IPsec TX fragmentation.
Both classification pipelining and TX fragmentation support are support
features that are indicated by odp_ipsec_capability()
.
Note that at present inline IPsec output support sends resulting packets
directly to an output PktIO. If it’s desired to send them to the ODP
Traffic Manager for shaping prior to transmission, use the lookaside APIs
to perform the IPsec encrypt and then call odp_tm_enq()
on the resulting
packet.
15.2. IPsec Configuration
Prior to making use of IPsec services, the odp_ipsec_config()
API is used to
configure IPsec processing options. This API takes a pointer to an
odp_ipsec_config_t
struct as its argument.
The odp_ipsec_config_t
struct specifies the inbound and outbound processing
modes (SYNC, ASYNC, or INLINE) that the application plans to use, the maximum
number of Security Associations it will use, and sets inbound and outbound
processing options.
15.2.1. IPsec Inbound Configuration
Inbound configuration options for IPsec specify the default odp_queue_t
to
be used for processing global events like SA lookup failures, how Security
Parameter Index (SPI) lookup is to be performed, and whether the application
requires ODP to retain outer headers for decrypted IPsec packets.
Parsing options specify how "deep" decrypted packets are to be parsed after IPsec processing by specifying the packet layers of interest to the application (None, L2, L3, L4, or All). And which checksums should be verified on decrypted packets.
15.2.2. IPsec Outbound Configuration
Outbound configuration options for IPsec specify checksum insertion processing that should be performed prior to encryption.
15.3. IPsec Events
IPsec introduces one new event type and one new event subtype. These are:
-
IPsec packet events. These are events of type
ODP_EVENT_PACKET
that have subtypeODP_EVENT_PACKET_IPSEC
. These are packets that carry additional IPsec-related metadata in the form of anodp_ipsec_packet_result_t
struct that can be retrieved from the packet via theodp_ipsec_result()
API. -
IPsec status notifications. These are events of type
ODP_EVENT_IPSEC_STATUS
that indicate status events not associated with any particular IPsec packet. Such events carry status in the form of anodp_ipsec_status_t
struct that is retrieved from the event via theodp_ipsec_status()
API.
IPsec-related events are thus part of normal and exception processing when working with IPsec.
15.4. Security Associations (SAs)
The fundamental "building block" for IPsec processing is the Security
Association (SA). Similar to a crypto session, the SA encapsulates the keying
material and context needed to perform IPsec protocol processing for inbound
or outbound packets on a given flow, as well as additional processing options
that control how IPsec is to be used for packets processed under this
SA. Security Associations are unidirectional (RX or TX) so a flow that
requires both inbound (decrypt) and outbound (encrypt) IPsec functions will
have two SAs associated with it. SAs in ODP are represented by the
abstract type odp_ipsec_sa_t
.
After ODP initialization, IPsec support is dormant until it is configured
by a call to odp_ipsec_config()
as described earlier. Once configured,
SAs may be created by calling odp_ipsec_sa_create()
.
15.4.1. SA Creation and Configuration
The odp_ipsec_sa_create()
API takes an odp_ipsec_sa_param_t
argument that
describes the SA to be created. Use the odp_ipsec_sa_param_init()
API to
initialize this to its default state and then override selected fields within
the param struct as needed.
Items specified in the odp_ipsec_sa_param_t
struct include:
-
The direction of the SA (inbound or outbound).
-
The IPsec protocol being used (ESP or AH).
-
The IPsec protocol mode (Transport or Tunnel).
-
The parameters needed for the crypto and authentication algorithms to be used by this SA.
-
Miscellaneous SA options that control behavior such as use of Extended Sequence Numbers (ESNs), the use of UDP encapsulation, various copy options for header fields, and whether the TTL (Hop Limit) field should be decremented when operating in tunnel mode.
-
Parameters controlling the SA lifetime.
-
The Security Parameter Index (SPI) that packets will use to indicate that they belong to this SA.
-
The pipeline mode used by this SA.
-
The destination
odp_queue_t
to be used for events associated with this SA. -
The user context pointer (and length) associated with this SA for application use.
In addition, there are specific direction-specific parameters that vary based on whether the SA is for inbound or outbound use. For inbound SAs:
-
Controls for how this SA is to be looked up.
-
The minimum size of the anti-replay window to be used.
-
The default CoS to use when classification pipelining packets matching this SA.
For outbound SAs:
-
Tunnel parameters to use when doing outbound processing in tunnel mode.
-
The fragmentation mode to be used.
-
The MTU to be used to control the maximum length IP packets that outbound IPsec operations may produce. This can be changed dynamically by the
odp_ipsec_sa_mtu_update()
API.
As can be seen, SAs have a large degree of configurability.
15.4.2. SA Lifecycle Management
In discussing the lifecycle of an SA and the operations it supports, it is useful to refer to the following sequence diagram for IPsec configuration, SA management, and IPsec operations:
After creation, IPsec services are active for this Security Association. The specific APIs that can be used on this SA depends on the IPsec operating mode that has been configured.
IPsec Lookaside Processing
If IPsec is operating in lookaside mode for the SA’s direction (the
odp_ipsec_op_mode_t
is ODP_IPSEC_OP_MODE_SYNC
or ODP_IPSEC_OP_MODE_ASYNC
),
then inbound or outbound lookaside operations may be performed. Asynchronous
lookaside operations are also permitted if the SA is operating in inline
mode, as described in the next section.
The synchronous forms of these APIs are:
-
odp_ipsec_in()
-
odp_ipsec_out()
Upon return from these calls, the return code tells the application the number
of number of input packets that were consumed by the operation. The result of
the operation is determined by calling the odp_ipsec_result()
API for each
output packet to retrieve its associated odp_ipsec_result_t
.
The asynchronous forms of these APIs are:
-
odp_ipsec_in_enq()
-
odp_ipsec_out_enq()
Here again, the return code indicates how many input packets were
processed. The success or failure is determined by inspecting the
odp_ipsec_result_t
associated with each packet completion event. These are
presented as events of type ODP_EVENT_PACKET
with subtype
ODP_EVENT_PACKET_IPSEC
.
For both synchronous and asynchronous IPsec operations an input packet array
is transformed into an output packet array as specified by a controlling
parameter struct. For inbound operations, the odp_ipsec_in_param_t
is
used to specify how SA processing is to be performed for the requested
operation. The caller may say that SA lookup processing should be performed
for each input packet, a single (specified) SA should be used for all packets,
or that each packet has a specified individual SA.
For outbound lookaside operations, a corresponding odp_ipsec_out_param_t
serves a similar role, but here the SA must be specified since the input
packet(s) are non-IPsec packets. Again the option is to use a single SA for
all input packets or one per input packet.
For outbound operations, an associated array of odp_ipsec_out_opt_t
structs
is also used to control the fragmentation mode to be used as part of the
outbound processing. Options here are to not fragment, to fragment before
IPsec processing, after IPsec processing, or to only check whether IP
fragmentation is needed but not to perform it. For check processing, the mtu
status error bit in the odp_ipsec_packet_result_t
is set if check processing
detects that the resulting packet will not fit into the configured MTU. Note
that the MTU associated with a given SA is set at SA creation and can be
changed at any time via the odp_ipsec_sa_mtu_update()
API.
Once an asynchronous lookaside operation has been initiated, the worker thread
that issued the asynchronous call can handle other events while waiting for
the operation to complete. Completion of an asynchronous operation is
indicated by the worker receiving an ODP_EVENT_PACKET
that has subtype
ODP_EVENT_PACKET_IPSEC
. These events can be retrieved directly by polling
the completion queue associated with the SA, or (more typically) via the ODP
scheduler. Typical code for such completion processing would look as follows:
while (1) {
ev = odp_schedule(&queue, ODP_SCHED_WAIT);
ev_type = odp_event_types(ev, &ev_subtype);
switch (ev_type) {
case ODP_EVENT_PACKET:
switch (ev_subtype) {
case ODP_EVENT_PACKET_IPSEC:
pkt = odp_packet_from_event(ev);
if (odp_unlikely(odp_ipsec_result(&result, pkt) != 0)) {
/* Stale event, discard */
odp_event_free(ev);
continue;
}
if (odp_unlikely(result.status.all != ODP_IPSEC_OK)) {
if (result.status.error != ODP_IPSEC_OK) {
...process error result
odp_event_free(ev);
continue;
} else {
...process packet warnings
}
}
my_context = odp_ipsec_sa_context(result.sa);
if (result.flag.inline_mode) {
...process inline inbound packet
} else {
...process the async completion event
}
...
break;
case ...
}
break;
case ODP_EVENT_IPSEC_STATUS:
...process IPsec status event
break;
}
}
IPsec Inline Processing
When IPsec is configured to operate in ODP_IPSEC_OP_MODE_INLINE
mode,
inbound processing is implicit. The application never sees these packets until
after IPsec has already decrypted them. As shown in the code sketch above,
such packets appear as events of subtype ODP_EVENT_PACKET_IPSEC
and the
flag
field in the associated odp_ipsec_packet_result_t
indicates
inline_mode
.
For outbound IPsec processing, the odp_ipsec_out_inline()
API operates as
a "fire and forget" API. A success return code from this call indicates that
the packet will be encrypted and transmitted to the odp_pktio_t
indicated
in the odp_ipsec_out_inline_param_t
specified at the time of the call without
any further application involvement. Only if a problem arises will the packet
be returned to the application with an odp_ipsec_packet_result_t
indicating
the nature of the problem.
Note that while operating in inline mode, asynchronous lookaside operations are also permitted. This provide the application with additional flexibility if, for example, some packets need additional handling that cannot be supported directly with inline IPsec processing.
15.4.3. SA Lifetimes
A fundamental principle of good security is that the keying material associated with sessions has a limited lifetime. In effect, keys grow "stale" over time or due to being used to encrypt too much data. The metrics used to limit effective SA lifetimes are:
-
Duration (time)
-
Usage (volume of traffic using the keys)
Associated with each of these metrics are "soft" and "hard" limits. When a hard limit is reached, the SA is expired and cannot be used further. To support graceful transition to a replacement SA, soft limits are used. A soft limit is similar to a "low fuel" warning light on a car. It alerts the application that the SA is nearing the end of its useful life and should be renegotiated even as the SA continues to work normally.
ODP support for SA limits is based on packet/byte counts. Applications that wish to use time-based SA limits may do so on their own using the timing facilities that ODP provides. However, since especially with inline IPsec processing, the application may not have explicit knowledge of the traffic volumes associated with a given SA, support for usage-based limits is integrated into ODP IPsec support.
At odp_ipsec_sa_create()
time, one of the fields in the
odp_ipsec_sa_param_t
struct is the odp_ipsec_lifetime_t
sub-structure.
This struct allows hard and/or soft limits to be specified in terms of total
bytes encrypted/decrypted, total packet count, or both. A limit specification
of 0 indicates no limit for that metric. If either is specified, the limit
is triggered on whichever occurs first. Given the defined behavior of hard vs.
soft limits, the soft limits, if used, should always be specified as lower
than the hard limits. These should be sufficiently lower to enable adequate
time to switch over to a replacement SA before the hard limit is reached.
As noted, when an SA hard limit is reached the SA immediately enters the
expired state and attempts to use it further are failed with an
odp_ipsec_result_t
that indicates a hard expiration limit. When a soft
limit is reached for packets sent via odp_ipsec_out_inline()
, this results
in an ODP_EVENT_IPSEC_STATUS
event being sent to the application on the
queue associated with the SA that has reached the soft limit. This status
event has an odp_ipsec_status_id_t
of ODP_IPSEC_STATUS_WARN
with a
odp_ipsec_warn_t
bits set to indicate the type of soft expiration reached.
Receipt of this event alerts the application that the SA is nearing the end of
its useful life and that it should be replaced. It is the application’s
responsibility to heed this warning. It is implementation-defined how many
such warnings are issued when a soft limit is exceeded (once, first N packets,
or all packets beyond the limit), so applications should be written to
allow for possible repeated warnings.
When operating in lookaside mode, expiration limits are carried as a warning
in the odp_op_status_t
section of the odp_ipsec_result_t
struct. The same
is true for inline inbound packets. When the soft limit is reached, these
packets will carry a warning flag indicating this condition.
15.4.4. SA Disablement and Destruction
When it is time to retire an SA, the application does so by first issuing a
call to the odp_ipsec_sa_disable()
API. This call initiates termination
processing for an SA by stopping use of the SA for new operations while still
allowing those that are "in flight" to complete processing. Following this call
the application continues to receive and process IPsec events as normal.
Disable completion is indicated by the application seeing an event of type
ODP_EVENT_IPSEC_STATUS
for this SA that contains an odp_ipsec_status_id_t
of ODP_IPSEC_STATUS_SA_DISABLE
. For inbound SAs, receipt of this event means
that the application has seen all IPsec packets associated with this SA that
were pending at the time of the disable call. For outbound SAs, receipt of
this event means that the application has seen all result events associated
with packets sent via this SA.
Note that once a packet has been "seen" by the application, it becomes the application’s responsibility to ensure that it is fully processed before attempting to destroy its associated SA. The disable call exists to give the application assurance that there are no pending IPsec events for this SA associated with packets that it has not seen before.
So after disabling the SA, the application can process pending packets
normally until it sees the disable status event. At that point it knows that
all pending packets that arrived before the disable have been seen and it is
safe for the application to destroy it via odp_ipsec_sa_destroy()
, thus
completing the SA lifecycle.
16. Compression services
ODP provides APIs to perform compression and decompression operations required by applications. ODP compression APIs are session based and provide compression algorithm offload services, with and without associated integrity hashing. This section covers the main compression APIs.
ODP provides support for the following compression algorithms:
ODP_COMP_ALG_NONE
-
The null compression algorithm. Used for testing as well as to specify hash-only operations.
ODP_COMP_ALG_DEFLATE
-
The deflate compression algorithm specified by RFC 1951.
ODP_COMP_ALG_ZLIB
-
The ZLIB compression algorithm specified by RFC 1950.
ODP_COMP_ALG_LZS
-
The LZS compression algorithm as specified by ANSI X3.241.
The following hash algorithms are also defined to be used in conjunction with these compression algorithms:
ODP_COMP_HASH_ALG_NONE
-
A dummy that specifies no associated hashing is to be performed.
ODP_COMP_HASH_ALG_SHA1
-
SHA-1 hashing with a 64-bit digest length.
ODP_COMP_HASH_ALG_SHA256
-
SHA-2 hashing with a 256-bit digest length.
16.1. Compression Sessions
ODP compression services are session based and operate on input packets and
deliver output packets. A compression session (odp_comp_session_t
) provides
the context for controlling the operations performed on packets. All of the
packets processed by a session share the parameters that define the
session.
ODP supports synchronous and asynchronous compression sessions. For asynchronous sessions, the output of a compression operation is posted to a queue defined as the completion queue in its session parameters.
Other session parameters include: the type of operation (compression or decompression), the operating mode (synchronous or asynchronous), the compression and hashing algorithms to be used, as well as any parameters needed by those algorithms to configure them. For asynchronous compression sessions, the application also specifies whether queue order must be maintained. Additional throughput may be achieved in some implementations if strict ordering is not required.
The parameters that describe the characteristics of a compression session
are encoded in the odp_comp_session_param_t
struct that is passed to the
odp_comp_session_create()
API. A successful call returns an
odp_comp_session_t
handle that is then used as an input parameter to
compression operation calls.
When an application is finished with a compression session, the
odp_comp_session_destroy()
API is used to release the resources
associated with an odp_comp_session_t
.
16.2. Compression operations
After session creation, a compression operation can be applied to a packet in one of two ways: synchronous and asynchronous, depending on how the session was created.
16.2.1. Synchronous compression operations
Synchronous compression operations take the following form:
int odp_comp_op(const odp_packet_t pkt_in[], odp_packet_t pkt_out[],
int num_pkt, const odp_comp_packet_op_param_t param[]);
An input packet array is compressed/decompressed into a supplied output
packet array under the control of a supplied parameter struct
odp_comp_packet_op_param_t
. Note that this struct points to the session used
to control the operation and specifies the input and output packet data ranges
to be used for the operation. For input, the output data range must be
sufficiently sized to contain the result of the operation to avoid an out of
space error. Upon output, this range is updated to reflect the actual data
written. This information can then be used to trim off any excess padding before
continuing processing of the output packet(s).
16.2.2. Asynchronous compression operations
Asynchronous compression operations are invoked with a slightly different API:
int odp_comp_op_enq(const odp_packet_t pkt_in[], odp_packet_t pkt_out[],
int num_pkt, const odp_comp_packet_op_param_t param[]);
Here the session pointed to by the odp_comp_packet_op_param_t
indicates
the completion queue to be used for the operation, so a zero return from
odp_comp_op_enq()
means only that the operation was successfully
initiated.
The resulting completion queue can then be polled either directly
via odp_queue_deq()
or indirectly via the scheduler. The result is
presented as an event of type ODP_EVENT_PACKET
with subtype
ODP_EVENT_PACKET_COMP
.
When receiving this event, the odp_comp_packet_from_event()
API is used to
convert the event into a usable odp_packet_t
, and the odp_comp_result()
API is used to retrieve the odp_comp_packet_result_t
metadata associated
with this packet.
Note that if the originating odp_comp_op_enq()
call specified an array of
input packets, each of these generates a separate result event. The order of
these events on the completion queue associated with the compression session is
controlled by the session’s packet_order
flag. If this flag is set then the
results will be in the same order as the original input list. If not, then
results are free to be reordered to make them available as soon as possible.
17. Traffic Manager (TM)
The TM subsystem is a general packet scheduling system that accepts packets from input queues and applies strict priority scheduling, weighted fair queueing scheduling and/or bandwidth controls to decide which input packet should be chosen as the next output packet and when this output packet can be sent onwards.
A given platform supporting this TM API could support one or more pure hardware based packet scheduling systems, one or more pure software based systems or one or more hybrid systems - where because of hardware constraints some of the packet scheduling is done in hardware and some is done in software. In addition, there may also be additional APIs beyond those described here for:
-
controlling advanced capabilities supported by specific hardware, software or hybrid subsystems
-
dealing with constraints and limitations of specific implementations.
The intention here is to be the simplest API that covers the vast majority of packet scheduling requirements.
Often a TM subsystem’s output(s) will be directly connected to a device’s physical (or virtual) output interfaces/links, in which case sometimes such a system will be called an Egress Packet Scheduler or an Output Link Shaper, etc.. While the TM subsystems configured by this API can be used in such a way, this API equally well supports the ability to have the TM subsystem’s outputs connect to other TM subsystem input queues or general software queues or even some combination of these three cases.
17.1. TM Algorithms
The packet scheduling/dropping techniques that can be applied to input traffic include any mixture of the following:
-
Strict Priority scheduling.
-
Weighted Fair Queueing scheduling (WFQ).
-
Bandwidth Shaping.
-
Weighted Random Early Discard (WRED).
Note that Bandwidth Shaping is the only feature that can cause packets to be "delayed", and Weighted Random Early Discard is the only feature (other than input queues becoming full) that can cause packets to be dropped.
17.1.1. Strict Priority Scheduling
Strict Priority Scheduling (or just priority for short), is a technique where input queues and the packets from them, are assigned a priority value in the range 0 .. ODP_TM_MAX_PRIORITIES - 1. At all times packets with the smallest priority value will be chosen ahead of packets with a numerically larger priority value. This is called strict priority scheduling because the algorithm strictly enforces the scheduling of higher priority packets over lower priority packets.
17.1.2. Bandwidth Shaping
Bandwidth Shaping (or often just Shaping) is the term used here for the idea of controlling packet rates using single rate and/or dual rate token bucket algorithms. For single rate shaping a rate (the commit rate) and a "burst size" (the maximum commit count) are configured. Then an internal signed integer counter called the commitCnt is maintained such that if the commitCnt is positive then packets are eligible to be sent. When such a packet is actually sent then its commitCnt is decremented (usually by its length, but one could decrement by 1 for each packet instead). The commitCnt is then incremented periodically based upon the configured rate, so that this technique causes the traffic to be limited to the commit rate over the long term, while allowing some ability to exceed this rate for a very short time (based on the burst size) in order to catch up if the traffic input temporarily drops below the commit rate.
Dual Rate Shaping is designed to allow certain traffic flows to fairly send more than their assigned commit rate when the scheduler has excess capacity. The idea being that it may be better to allow some types of traffic to send more than their committed bandwidth rather than letting the TM outputs be idle. The configuration of Dual Rate Shaping requires additionally a peak rate and a peak burst size. The peak rate must be greater than the related commit rate, but the burst sizes have no similar constraint. Also for every input priority that has Dual Rate shaping enabled, there needs to be an additional equal or lower priority (equal or higher numeric priority value) assigned. Then if the traffic exceeds its commit rate but not its peak rate, the "excess" traffic will be sent at the lower priority level - which by the strict priority algorithm should cause no degradation of the higher priority traffic, while allowing for less idle outputs.
17.1.3. Weighted Fair Queuing
Weighted Fair Queuing (WFQ) is used to arbitrate among multiple input packets with the same priority. Each input can be assigned a weight in the range MIN_WFQ_WEIGHT..MAX_WFQ_WEIGHT (nominally 1..255) that affects the way the algorithm chooses the next packet. If all of the weights are equal AND all of the input packets are the same length then the algorithm is equivalent to a round robin scheduling. If all of the weights are equal but the packets have different lengths then the WFQ algorithm will attempt to choose the packet such that inputs each get a fair share of the bandwidth - in other words it implements a weighted round robin algorithm where the weighting is based on frame length.
When the input weights are not all equal and the input packet lengths vary then the WFQ algorithm will schedule packets such that the packet with the lowest "Virtual Finish Time" is chosen first. An input packet’s Virtual Finish Time is roughly calculated based on the WFQ object’s base Virtual Finish Time when the packet becomes the first packet in its queue plus its frame length divided by its weight.
virtualFinishTime = wfqVirtualTimeBase + (pktLength / wfqWeight)
In a system running at full capacity with no bandwidth limits - over the long term - each input fan-in’s average transmit rate will be the same fraction of the output bandwidth as the fraction of its weight divided by the sum of all of the WFQ fan-in weights. Hence larger WFQ weights result in better "service" for a given fan-in.
totalWfqWeight = 0;
for (each fan-in entity - fanIn - feeding this WFQ scheduler)
totalWfqWeight += fanIn->sfqWeight;
fanIn->avgTransmitRate = avgOutputRatefanIn->sfqWeight / totalWfqWeight;
17.1.4. Weighted Random Early Discard
The Weighted Random Early Discard (WRED) algorithm deals with the situation where an input packet rate exceeds some output rate (including the case where Bandwidth Shaping limits some output rates). Without WRED enabled and configured, the TM system will just implement a tail dropping scheme whereby whichever packet is unlucky enough to arrive when an TM input queue is full will be discarded regardless of priority or any other consideration. WRED allows one to configure the system to use a better/fairer algorithm than simple tail dropping. It works by measuring the "fullness" of various packet queues and converting this percentage into a probability of random packet dropping with the help of some configurable parameters. Then a random number is picked and together with the drop probability, a decision is made to accept the packet or drop it. A basic parameterization of WRED requires three parameters:
-
the maximum queue level (which could be either a maximum number of packets or a maximum amount of memory (i.e. bytes/buffers) used),
-
a starting threshold - which is a number in the range 0..100 representing a percentage of the maximum queue level at which the drop probability becomes non-zero,
-
a drop probability - which is a number in the range 0..100 representing a probability (0 means no drop and 100 means certain drop) - which is used when the queue is near 100% full.
Note that all packet drops for a TM system only occur when a new packet arrives at a given TM system input queue. At that time either the WRED algorithm, if enabled for this input queue, or the "input queue full" tail drop algorithm will make a drop/no drop decision. After this point, any packets not dropped, will at some point be sent out a TM output - assuming that the topology is fully connected and enabled.
17.2. Hierarchical Scheduling and tm_nodes
This API supports the ability to do Hierarchical Scheduling whereby the final scheduling decision is controlled by equal priority schedulers, strict priority multiplexers, bandwidth shapers - at multiple levels - all forming a tree rooted at a single egress object. In other words, all tm_queues and tm_nodes have the property that their logical "output" feeds into one fan-in of a subsequent tm_node or egress object - forming a proper tree.
Multi-level/hierarchical scheduling adds both great control and significant complexity. Logically, despite the implication of the tm_node tree diagrams, there are no queues between the levels of hierarchy. Instead all packets are held in their input queue, until such time that the totality of all of the tm_nodes in the single path from input queue to output object agrees that this packet should be the next to be chosen to leave the TM system through the output object "portal". Hence what flows from level to level is the "local choice" of what packet/tm_queue should next be serviced.
17.2.1. tm_nodes
Tm_nodes are the main "entity"/object that a TM system is composed of. Each tm_node is a mini-TM subsystem of its own, but the interconnection and interplay of a multi-level "tree" of tm_nodes can allow the user to specify some very sophisticated behaviors. Each tm_node can contain a set of scheduler (one per strict priority level), a strict priority multiplexer, a bandwidth shaper and a WRED component - or a subset of these.
In its full generality an tm_node consists of a set of "fan-in" connections to preceding tm_queues or tm_nodes. The fan-in for a single tm_node can range from 1 to many many thousands. This fan-in is divided first into a WFQ scheduler per priority level. So if 4 priority levels are implemented by this tm_node, there would be 4 WFQ schedulers - each with its own unique fan-in. After the WFQ schedulers a priority chooser comes next - where it will always choose the highest priority WFQ output available. The output of the priority chooser then feeds a bandwidth shaper function which then finally uses the shaper’s propagation table to determine its output packet and its priority. This output could then be remapped via a priority map profile and then becomes one of the input fan-in to perhaps another level of tm_nodes, and so on.
During this process it is important to remember that the bandwidth shaping function never causes packets to be dropped. Instead all packet drops occur because of tm_queue fullness or be running the WRED algorithm at the time a new packet attempts to be appended to the end of some input queue.
The WRED profile associated with an tm_node considers the entire set of tm_queues feeding directly or indirectly into it as its measure of queue fullness.
17.2.2. tm_queues
tm_queues are the second major type of "entity"/object that a TM system is composed of. All packets MUST first enter the TM system via some tm_queue. Then logically, the head packets of all of the tm_queues are examined simultaneously by the entire TM system, and ONE tm_queue is chosen send its head packet out of the TM system’s egress. Abstractly packets stay in the tm_queue until they are chosen at which time they are instantly transferred from tm_queue to/through the corresponding TM egress. It is also important to note that packets in the same tm_queue MUST always stay in order. In other words, the second packet in an tm_queue must never leave the TM system through a TM egress spigot before the first packet has left the system. So tm_queue packet order must always be maintained.
17.2.3. TM egress
Note that TM egress objects are NOT referred to as queues, because in many/most cases they don’t have multi-packet structure but instead are viewed as a port/spigot through which the TM system schedules and finally transfers input packets through.
17.3. Ideal versus Actual Behavior
It is important to recognize the difference between the "abstract" mathematical model of the prescribed behavior and real implementations. The model describes the Ideal, but theoretically desired behavior, but such an Ideal is generally not practical to implement. Instead, one understands that virtually all Real TM systems attempt to approximate the Ideal behavior as given by the TM configuration as best as they can - while still attaining high packet processing performance. The idea is that instead of trying too hard to be "perfect" at the granularity of say microseconds, it may be better to instead try to match the long term Ideal behavior over a much more reasonable period of time like a millisecond. It is generally better to have a stable implementation that when averaged over a period of several milliseconds matches the Ideal behavior very closely than to have an implementation that is perhaps more accurate over a period of microseconds, but whose millisecond averaged behavior drifts away from the Ideal case.
17.4. Other TM Concepts
17.4.1. Profiles
This specification often packages related TM system parameters into records/objects called profiles. These profiles can then be associated with various entities like tm_nodes and tm_queue’s. This way the amount of storage associated with setting related parameters can be reduced and in addition it is common to re-use the same set of parameter set over and over again, and also to be able to change the parameter set once and have it affect lots of entities with which it is associated with/applied to.
17.4.2. Absolute Limits versus odp_tm_capability_t
This header file defines some constants representing the absolute maximum settings for any TM system, though in most cases a TM system can (and should) be created/instantiated with smaller values, since lower values will often result in faster operation and/or less memory used.
17.4.3. Packet Marking
The Packet Marking API is used to mark the packet based upon the final packet color assigned to it when it reaches the egress node. This is an optional feature and if available on the platform is used to reflect the packet color on IPv4/IPv6 DiffServ filed in accordance with RFC 2474. There are three different packet marking fields supported they are, 1). Assured forwarding in accordance with RFC 2597, the DSCP is marked to set the packet Drop precedence in accordance with the color, i.e High Drop precedence for RED, Medium Drop precedence for YELLOW and leave the DSCP unchanged if the color is GREEN. 2). Explicit Congestion Notification protocol per RFC 3168, where a router encountering congestion can notify it by setting the lower 2 bits in DiffServ field to "11" Congestion Encountered code, which will ultimately reduce the transmission rate of the packet sender. 3). In IEEE 802.1q VLAN tag header contains a DE - Drop Eligibility bit for marking a packet for Downstream switches, and is valid for Ethernet packet containing a VLAN tag.
RFC 3168 is only valid for TCP packets whereas RFC 2597 is valid for IPv4/IPv6 traffic.
The values are set per color and hence the implementation may support these parameters only for a specific colors. marking_colors_supported field in capabilities structure can be used to check which colors are supported for marking.
17.4.4. Vlan Marking.
This vlan marking is used to enable the drop eligibility on the packet
based on the packet color. If drop eligibility is enabled then the
implementation will set the one bit VLAN Drop Eligibility Indicator (DEI)
field (but only for packets that already carry a VLAN tag) of a packet based
upon the final packet color assigned to the packet when it reaches the egress
node. When drop_eligible_enabled is false, then the given color has
no effect on the VLAN fields. See IEEE 802.1q for more details.
vlan_marking_supported
boolean in capability structure indicates whether this
feature is supported by the implementation.
17.4.5. Explicit Congestion Notification Marking.
The odp_tm_ecn_marking()
function allows one to configure the TM
egress so that the two bit ECN subfield of the eight bit TOS field of an
IPv4 packet OR the eight bit Traffic Class (TC) field of an IPv6 packet can be
selectively modified based upon the final color assigned to the packet when it
reaches the egress. Note that the IPv4 header checksum will be updated -
but only if the IPv4 TOS field actually changes as a result of this
setting or the odp_tm_drop_prec_marking()
setting. For IPv6, since there is
no header checksum, nothing needs to be done. If ECN is enabled for a
particular color then ECN subfield will be set to ECN_CE i.e., congestion
experienced.
ecn_marking_supported
boolean in capability structure indicates whether this
feature is supported by the implementation.
17.4.6. Drop Precedence Marking.
The Drop precedence marking allows one to configure the TM egress to support Assured forwarding in accordance with RFC 2597. The Drop Precedence bits are contained within the six bit Differentiated Services Code Point subfield of the IPv4 TOS field or the IPv6 Traffic Class (TC) field. Specifically the Drop Precedence sub-subfield can be accessed with a DSCP bit mask of 0x06. When enabled for a given color, these two bits will be set to Medium Drop Precedence (value 0x4) if the color is ODP_PACKET_YELLOW, set to High Drop Precedence (value 0x6) if the color is ODP_PACKET_RED.
Note that the IPv4 header checksum will be updated - but only if the
IPv4 TOS field actually changes as a result of this setting or the
odp_tm_ecn_marking()
setting. For IPv6, since there is no header checksum,
nothing else needs to be done.
drop_prec_marking_supported
boolean in capability structure indicates whether
this feature is supported by the implementation.
18. Classification (CLS)
ODP is a framework for software-based packet forwarding/filtering applications, and the purpose of the Packet Classification API is to enable applications to program the platform hardware or software implementation to assist in prioritization, classification and scheduling of each packet, so that the software application can run faster, scale better and adhere to QoS requirements.
The following API abstraction are not modeled after any existing product implementation, but is instead defined in terms of what a typical data-plane application may require from such a platform, without sacrificing simplicity and avoiding ambiguity. Certain terms that are being used within the context of existing products in relation to packet parsing and classification, such as access lists are avoided such that not to suggest any relationship between the abstraction used within this API and any particular manner in which they may be implemented in hardware.
18.1. Functional Description
Following is the functionality that is required of the classification API, and its underlying implementation. The details and order of the following paragraph is informative, and is only intended to help convey the functional scope of a classifier and provide context for the API. In reality, implementations may execute many of these steps concurrently, or in different order while maintaining the evident dependencies:
-
Apply a set of classification rules to the header of an incoming packet, identify the header fields, e.g. ethertype, IP version, IP protocol, transport layer port numbers, IP DiffServ, VLAN id, 802.1p priority.
-
Store these fields as packet meta data for application use, and for the remainder of parser operations. The odp_pktio is also stored as one of the meta data fields for subsequent use.
-
Compute an odp_cos (Class of Service) value from a subset of supported fields from 1) above.
-
Based on the odp_cos from 3) above, select the odp_queue through which the packet is delivered to the application.
-
Validate the packet data integrity (checksums, FCS) and correctness (e.g., length fields) and store the validation result, along with optional error layer and type indicator, in packet meta data. Optionally, if a packet fails validation, override the odp_cos selection in step 3 to a class of service designated for errored packets.
-
Based on the odp_cos from 3) above, select the odp_buffer_pool that should be used to acquire a buffer to store the packet data and meta data.
-
Allocate a buffer from odp_buffer_pool selected in 6) above and logically[1] store the packet data and meta data to the allocated buffer, or in accordance with class-of-service drop policy and subject to pool buffer availability, optionally discard the packet.
-
Enqueue the buffer into the odp_queue selected in 4) above.
The above is an abstract description of the classifier functionality, and may be applied to a variety of applications in many different ways. The ultimate meaning of how this functionality applies to an application also depends on other ODP modules, so the above may not complete a full depiction. For instance, the exact meaning of priority, which is a per-queue attribute is influenced by the ODP scheduler semantics, and the system behavior under stress depends on the ODP buffer pool module behavior.
For the sole purpose of illustrating the above abstract functionality, here is an example of a Layer-2 (IEEE 802.1D) bridge application: Such a forwarding application that also adheres to IEEE 802.1p/q priority, which has 8 traffic priority levels, might create 8 odp_buffer_pool instances, one for each PCP priority level, and 8 odp_queue instances one per priority level. Incoming packets will be inspected for a VLAN header; the PCP field will be extracted, and used to select both the pool and the queue. Because each queue will be assigned a priority value, the packets with highest PCP values will be scheduled before any packet with a lower PCP value. Also, in a case of congestion, buffer pools for lower priority packets will be depleted earlier than the pools containing packets of the high priority, and hence the lower priority packets will be dropped (assuming that is the only flow control method that is supported in the platform) while higher priority packets will continue to be received into buffers and processed.
18.2. Class of Service Creation and Binding
To program the classifier, a class-of-service instance must be created, which will contain the packet filtering resources that it may require. All subsequent calls refer to one or more of these resources.
Each class of service instance must be associated with a single queue or queue group, which will be the destination of all packets matching that particular filter. The queue assignment is implemented as a separate function call such that the queue may be modified at any time, without tearing down the filters that define the class of service. In other words, it is possible to change the destination queue for a class of service defined by its filters quickly and dynamically.
Optionally, on platforms that support multiple packet buffer pools, each class of service may be assigned a different pool such that when buffers are exhausted for one class of service, other classes are not negatively impacted and continue to be processed.
18.3. Default packet handling
There is a odp_cos_t
assigned to each port with the
odp_pktio_default_cos_set() function, which will function as the default
class-of-service for all packets received from an ingress port,
that do not match any of the filters defined subsequently.
At minimum this default class-of-service must have a queue and a
buffer pool assigned to it on platforms that support multiple packet buffer
pools. Multiple odp_pktio instances (i.e., multiple ports) may each have their
own default odp_cos, or may share a odp_cos with other ports, based on
application requirements.
18.4. Error packet handling
Error class of service is assigned to an ingress port using the function
odp_pktio_error_cos_set()
. All the packets received with error from this
specific ingress port are assigned to this error class-of-service.
At minimum this error class-of-service must have a queue and a buffer pool
assigned to it. Multiple pktio instances (i.e., multiple ports) may each have
their own error class of service, or may share an error CoS with other ports,
based on application requirements.
18.5. Packet dropping
Each class of service has a drop_policy
configured during creation. The
valid value are ODP_COS_DROP_POOL and ODP_COS_DROP_NEVER. If the drop_policy
is set to ODP_COS_DROP_POOL then the packets assigned to the CoS follows the
drop policy of the associated pool i.e., depending on the Random Early Discard
or any other configuration of the pool the packet might get dropped. If the
drop_policy
is set to ODP_COS_DROP_NEVER then the Random Early Discard of the
pool is ignored.
During creation of the class of service if the pool or queue is set as INVALID using ODP_POOL_INVALID or ODP_QUEUE_INVALID field then any packet assigned to the specific CoS are dropped.
18.6. Packet Classification
For each odp_pktio port, the API allows the assignment of a class-of-service to a packet. Application can program a number of pattern matching rules that assign a class-of-service for packets with header fields matching specified values. Using these matching rules the application should be able for example to identify all packets containing VoIP traffic based on the protocol being UDP, and a specific destination or source port numbers, and appropriately assign these packets a class-of-service that maps to a higher priority queue, assuring voice packets a lower and bound latency.
18.7. Packet meta data Elements
Here are the specific information elements that are stored within the packet meta data structure:
-
Protocol fields that are decoded and extracted by the parsing phase
-
The pool identifier that is selected for the packet
-
The ingress port identifier
-
The result of packet validation, including an indication of the type of error detected, if any
The ODP packet API module provides accessors for retrieving the above meta data fields from the container buffer in an implementation-independent manner.
18.8. Example configuration
CoS configuration can be best illustrated by drawing a tree, where each CoS is the vertex, and each link between any two vertices is a PMR. The root node for the tree is the default CoS which is attached with the pktio interface. All of the CoS vertices can be final for some packets, if these packets do not match any of the link PMRs.
odp_pktio_default_cos_set(odp_pktio_t pktio, odp_cos_t default_cos);
pmr1 = odp_cls_pmr_create(pmr_match1, default_cos, cos1);
pmr2 = odp_cls_pmr_create(pmr_match2, default_cos, cos2);
pmr3 = odp_cls_pmr_create(pmr_match3, default_cos, cos3);
pmr11 = odp_cls_pmr_create(pmr_match11, cos1, cos11);
pmr12 = odp_cls_pmr_create(pmr_match12, cos1, cos12);
pmr21 = odp_cls_pmr_create(pmr_match11, cos2, cos21);
pmr31 = odp_cls_pmr_create(pmr_match11, cos3, cos31);
The above configuration DOES imply order - a packet that matches pmr_match1 will then be applied to pmr_match11 and pmr_match12, and as a result could terminate with either cost1, cos11, cos12. In this case the packet was subjected to two match attempts in total.
The remaining two lines illustrate how a packet that matches pmr_match11 could end up with either cos11, cos21 or cos31, depending on whether it matches pmr_march1, pmr_march2 or pmr_match3.
18.9. Practical example
Let’s look at DNS packets, these are identified by using UDP port 53, but each UDP packet may run atop of IPv4 or IPv6, and in turn an IP packet might be received as either multicast or unicast,
PMR-L2 = match all multicast/broadcast packets based on DMAC address
PMR_L3_IP4 = match all IPv4 packets
PMR_L3_IP6 = match all IPv6 packets
PMR_L4_UDP = match all UDP packets
PMR_L4_53 = match all packets with dest port = 53
odp_cls_pmr_create(PMR_L2, default_cos, default_cos_mc);
odp_cls_pmr_create(PMR_L3_IP4, default_cos, default_cos_ip4_uc);
odp_cls_pmr_create(PMR_L3_IP6, default_cos, default_cos_ip6_uc);
odp_cls_pmr_create(PMR_L3_IP4, default_cos_mc, default_cos_ip4_mc);
odp_cls_pmr_create(PMR_L3_IP6, default_cos_mc, default_cos_ip6_mc);
odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip4_uc, cos_udp4_uc);
odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip4_mc, cos_udp4_mc);
odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip6_uc, cos_udp6_uc);
odp_cls_pmr_create(PMR_L4_UDP, default_cos_ip6_mc, cos_udp6_mc);
odp_cls_pmr_create(PMR_L4_53, cos_udp4_uc, dns4_uc);
odp_cls_pmr_create(PMR_L4_53, cos_udp4_mc, dns4_mc);
odp_cls_pmr_create(PMR_L4_53, cos_udp6_uc, dns6_uc);
odp_cls_pmr_create(PMR_L4_53, cos_udp6_mc, dns6_mc);
In this case, a packet may change CoS between 0 and 5 times, meaning that up to 5 PMRs may be applied in series, and the order
Another interesting point is that an implementation will probably impose on a limit of how many PMRs can be applied to a packet in series, so in the above example, if an implementation limit on the number of consecutive classification steps is 4, then all the DNS packets may only reach cos_udp?_?c set of vertices.
19. Utilities and examples
19.1. PcapNg capture
If compiled using --enable-pcapng-support
ODP will offer packet capturing
functionality in PcapNg format. If the /var/run/odp directory exists prior to
launching the application ODP will create a fifo for each NIC queue.
Queue naming will be of the following format: <odp global pid>-<NIC
name>-flow-<queue number>. Linux dd application can be used for capturing a
sample of the live stream from the fifo. Killing ether the application or dd
will stop the capturing process.
-
./configure --enable-pcapng-support
-
sudo mkdir /var/run/odp
-
sudo ./test/performance/odp_packet_gen -i enp2s0 --eth_dst A0:F6:FD:AE:62:6C --ipv4_dst 192.168.49.20 --ipv4_src 192.168.49.4 -g 0
-
sudo dd if=/var/run/odp/26737-enp2s0-flow-0 of=~/test.pcap
-
ctrl^c
-
wireshark ~/test.pcap
20. Glossary
- worker thread
-
A worker is a type of ODP thread. It will usually be isolated from the scheduling of any host operating system and is intended for fast-path processing with a low and predictable latency. Worker threads will not generally receive interrupts and will run to completion.
- control thread
-
A control thread is a type of ODP thread. It will be isolated from the host operating system house keeping tasks but will be scheduled by it and may receive interrupts.
- ODP instantiation process
-
The process calling
odp_init_global()
, which is probably the first process which is started when an ODP application is started. There is one single such process per ODP instantiation. - thread
-
The word thread (without any further specification) refers to an ODP thread.
- ODP thread
-
An ODP thread is a flow of execution that belongs to ODP: Any "flow of execution" (i.e. OS process or OS thread) calling
odp_init_global()
, orodp_init_local()
becomes an ODP thread. This definition currently limits the number of ODP instances on a given machine to one. In the futureodp_init_global()
will return something like an ODP instance reference andodp_init_local()
will take such a reference in parameter, allowing threads to join any running ODP instance. Note that, in a Linux environment an ODP thread can be either a Linux process or a linux thread (i.e. a linux process callingodp_init_local()
will be referred as ODP thread, not ODP process). - event
-
An event is a notification that can be placed in a queue.
- queue
-
A communication channel that holds events