U.S. patent application number 14/182229 was filed with the patent office on 2014-06-12 for multi-stage large send offload.
This patent application is currently assigned to Microsoft Corporation. The applicant listed for this patent is Microsoft Corporation. Invention is credited to Keith L. Mange, John A. Starks.
Application Number | 20140161123 14/182229 |
Document ID | / |
Family ID | 44559934 |
Filed Date | 2014-06-12 |
United States Patent
Application |
20140161123 |
Kind Code |
A1 |
Starks; John A. ; et
al. |
June 12, 2014 |
MULTI-STAGE LARGE SEND OFFLOAD
Abstract
A network stack sends very large packets with large segment
offload (LSO) by performing multi-pass LSO. A first-stage LSO
filter is inserted between the network stack and the physical NIC.
The first-stage filter splits very large LSO packets into LSO
packets that are small enough for the NIC. The NIC then performs a
second pass of LSO by splitting these sub-packets into standard
MTU-sized networking packets for transmission on the network.
Inventors: |
Starks; John A.; (Seattle,
WA) ; Mange; Keith L.; (Bellevue, WA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Microsoft Corporation |
Redmond |
WA |
US |
|
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
44559934 |
Appl. No.: |
14/182229 |
Filed: |
February 17, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12722434 |
Mar 11, 2010 |
8654784 |
|
|
14182229 |
|
|
|
|
Current U.S.
Class: |
370/389 |
Current CPC
Class: |
H04L 47/365 20130101;
H04L 47/36 20130101 |
Class at
Publication: |
370/389 |
International
Class: |
H04L 12/805 20060101
H04L012/805 |
Claims
1. A method for transmitting packets over a network, comprising:
receiving at a first operating system operating on a computing
device, an indicator of a first large segment offload (LSO) packet
size wherein the first LSO packet size is a multiple of a second
LSO packet size that is supported by a network interface card
connected to the computing device; formatting data into a first
packet of a first LSO packet size; transferring the first packet to
a second operating system on the same computing device; splitting
the first packet on the second operating system into multiple LSO
packets of a second LSO packet size; sending the multiple LSO
packets to the network interface card for transmission on the
network in packets of a size supported by the network.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent
application Ser. No. 12/722,434, Attorney Docket No.
328846.01/104709.000629, entitled "MULTI-STAGE LARGE SEND OFFLOAD",
filed on Mar. 11, 2010, the entirety of which is incorporated
herein by reference.
BACKGROUND
[0002] When a guest computer system is emulated on a host computer
system, the guest computer system is called a "virtual machine" as
the guest computer system only exists in the host computer system
as a software representation of the operation of one specific
hardware configuration that may diverge from the native machine.
The virtual machine presents to the software operating on the
virtual machine an emulated hardware configuration.
[0003] A virtual machine management system (sometimes referred to
as a virtual machine monitor or a hypervisor) is also often
employed to manage one or more virtual machines so that multiple
virtual machines can run on a single computing device concurrently.
The virtual machine management system runs directly on the native
hardware and virtualizes the resources of the machine by exposing
interfaces to virtual machines for access to the underlying
hardware. A host operating system and a virtual machine management
system may run side-by-side on the same physical hardware. For
purposes of clarity will we use the abbreviation VMM to refer to
all incarnations of a virtual machine management system.
[0004] One problem that occurs in the operating system
virtualization context relates to computing resources such as data
storage devices, data input and output devices, networking devices
etc. Because each of host computing device's multiple operating
systems may have different functionality, there is a question as to
which computing resources should be apportioned to which of the
multiple operating systems. For example, a virtualized host
computing device may include only a single network interface card
(NIC) that enables the host computing device to communicate with
other networked computers. This scenario raises the question of
which of the multiple operating systems on the virtualized host
should be permitted to interact with and control the NIC.
[0005] When one of the operating systems controls the NIC, the
other operating systems sends it packets to the network through the
operating system that controls the NIC. In such a case, the packet
size accepted by the NIC may not be known. However, sending network
TCP packets through a network stack is computationally expensive.
Resources must be allocated for each packet, and each component in
the networking stack typically examines each packet. This problem
is compounded in a virtualization environment, because each packet
is also transferred between the guest VM to the root operating
system. This entails a fixed overhead per packet that can be quite
large. On the other hand, the networking stack packet size is
normally limited by the maximum transmission unit (MTU) size of the
connection, e.g, 1500 bytes. It is not typically feasible to
increase the MTU size since it is limited by network
infrastructure.
[0006] Hardware NICs provide a feature called "Large Send Offload"
(LSO) that allows larger TCP packets to travel through the stack
all the way to the NIC. Since most of the cost per packet is fixed,
this does a fairly good job, but NICs typically support packets
that are fairly small, around 62 KB. There is a need for the
transmission between operating systems of larger packets to reduce
overhead.
SUMMARY
[0007] The embodiments described allow a network stack to send very
large packets, larger than a physical NIC typically supports with
large segment offload (LSO). In general, this is accomplished by
performing multi-pass LSO. A first-stage LSO filter is inserted
somewhere between the network stack and the physical NIC. The
first-stage filter splits very large LSO packets into LSO packets
that are small enough for the NIC. The NIC then performs a second
pass of LSO by splitting these sub-packets into standard MTU-sized
networking packets for transmission on the network.
[0008] To that end, a first operating system operating on a
computing device receives an indicator of a first LSO packet size.
The first LSO packet size is a multiple of a second LSO packet size
that is supported by a network interface card connected to the
computing device. The first operating system formats data (e.g.,
from an application) into a first packet of a first LSO packet
size. The first packet is then transferred to a second operating
system on the same computing device that has access to a network
interface card. The first packet is then split on the second
operating system into multiple LSO packets of a second LSO packet
size that can be consumed by the network interface card. The
multiple LSO packets are sent to the network interface card for
transmission on the network in packets of a size supported by the
network.
[0009] In general, the first operating system is executing on a
virtual machine and the indicator of a first LSO packet size is
received from a hypervisor operating on the same computing device.
The virtual machine can be migrated to a second computing device
and another indicator of a first LSO packet size is received from a
hypervisor operating on the second computing device. The indicator
of the first LSO packet size received from the hypervisor operating
on the second computing device can different from the indicator of
the first LSO packet size received from the hypervisor on the
computing device. Consequently, the indicator of the first LSO size
received from each of the hypervisor operating on the computing
device and the hypervisor operating on the second computing device
can be tuned for the specific computing device's CPU usage,
throughput, latency or any combination thereof
[0010] In general, the first packet has a TCP header. The packet
header from the first packet is copied to the packets of second
LSO-sized packets when they are split out.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The foregoing summary, as well as the following detailed
description of preferred embodiments, is better understood when
read in conjunction with the appended drawings. For the purpose of
illustrating the invention, there is shown in the drawings
exemplary constructions of the invention; however, the invention is
not limited to the specific methods and instrumentalities
disclosed. In the drawings:
[0012] FIG. 1 is a block diagram representing a computer system in
which aspects of the present invention may be incorporated;
[0013] FIG. 2 illustrates a virtualized computing system
environment;
[0014] FIG. 3 illustrates the communication of networking across a
virtualization boundary;
[0015] FIG. 4 is a flow diagram of the packet processing in
accordance with an aspect of the invention; and
[0016] FIG. 5 is a flow diagram of the processing performed by the
virtual switch according to an aspect of the invention.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
[0017] The inventive subject matter is described with specificity
to meet statutory requirements. However, the description itself is
not intended to limit the scope of this patent. Rather, the
inventor has contemplated that the claimed subject matter might
also be embodied in other ways, to include different combinations
similar to the ones described in this document, in conjunction with
other present or future technologies.
[0018] Numerous embodiments of the present invention may execute on
a computer. FIG. 1 and the following discussion is intended to
provide a brief general description of a suitable computing
environment in which the invention may be implemented. Although not
required, the invention will be described in the general context of
computer executable instructions, such as program modules, being
executed by a computing device, such as a client workstation or a
server. Generally, program modules include routines, programs,
objects, components, data structures and the like that perform
particular tasks. Those skilled in the art will appreciate that the
invention may be practiced with other computer system
configurations, including hand held devices, multi processor
systems, microprocessor based or programmable consumer electronics,
network PCs, minicomputers, mainframe computers and the like. The
invention may also be practiced in distributed computing
environments where tasks are performed by remote processing devices
that are linked through a communications network. In a distributed
computing environment, program modules may be located in both local
and remote memory storage devices.
[0019] Referring now to FIG. 1, an exemplary general purpose
computing system is depicted. The general purpose computing system
can include a conventional computer 20 or the like, including at
least one processor or processing unit 21, a system memory 22, and
a system bus 23 that communicative couples various system
components including the system memory to the processing unit 21
when the system is in an operational state. The system bus 23 may
be any of several types of bus structures including a memory bus or
memory controller, a peripheral bus, and a local bus using any of a
variety of bus architectures. The system memory can include read
only memory (ROM) 24 and random access memory (RAM) 25. A basic
input/output system 26 (BIOS), containing the basic routines that
help to transfer information between elements within the computer
20, such as during start up, is stored in ROM 24. The computer 20
may further include a hard disk drive 27 for reading from and
writing to a hard disk (not shown), a magnetic disk drive 28 for
reading from or writing to a removable magnetic disk 29, and an
optical disk drive 30 for reading from or writing to a removable
optical disk 31 such as a CD ROM or other optical media. The hard
disk drive 27, magnetic disk drive 28, and optical disk drive 30
are shown as connected to the system bus 23 by a hard disk drive
interface 32, a magnetic disk drive interface 33, and an optical
drive interface 34, respectively. The drives and their associated
computer readable media provide non volatile storage of computer
readable instructions, data structures, program modules and other
data for the computer 20. Although the exemplary environment
described herein employs a hard disk, a removable magnetic disk 29
and a removable optical disk 31, it should be appreciated by those
skilled in the art that other types of computer readable media
which can store data that is accessible by a computer, such as
magnetic cassettes, flash memory cards, digital video disks,
Bernoulli cartridges, random access memories (RAMs), read only
memories (ROMs) and the like may also be used in the exemplary
operating environment. Generally, such computer readable storage
media can be used in some embodiments to store processor executable
instructions embodying aspects of the present disclosure.
[0020] A number of program modules comprising computer-readable
instructions may be stored on computer-readable media such as the
hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25,
including an operating system 35, one or more application programs
36, other program modules 37 and program data 38. Upon execution by
the processing unit, the computer-readable instructions cause the
actions described in more detail below to be carried out or cause
the various program modules to be instantiated. A user may enter
commands and information into the computer 20 through input devices
such as a keyboard 40 and pointing device 42. Other input devices
(not shown) may include a microphone, joystick, game pad, satellite
disk, scanner or the like. These and other input devices are often
connected to the processing unit 21 through a serial port interface
46 that is coupled to the system bus, but may be connected by other
interfaces, such as a parallel port, game port or universal serial
bus (USB). A display 47 or other type of display device can also be
connected to the system bus 23 via an interface, such as a video
adapter 48. In addition to the display 47, computers typically
include other peripheral output devices (not shown), such as
speakers and printers. The exemplary system of FIG. 1 also includes
a host adapter 55, Small Computer System Interface (SCSI) bus 56,
and an external storage device 62 connected to the SCSI bus 56.
[0021] The computer 20 may operate in a networked environment using
logical connections to one or more remote computers, such as a
remote computer 49. The remote computer 49 may be another computer,
a server, a router, a network PC, a peer device or other common
network node, and typically can include many or all of the elements
described above relative to the computer 20, although only a memory
storage device 50 has been illustrated in FIG. 1. The logical
connections depicted in FIG. 1 can include a local area network
(LAN) 51 and a wide area network (WAN) 52. Such networking
environments are commonplace in offices, enterprise wide computer
networks, intranets and the Internet.
[0022] When used in a LAN networking environment, the computer 20
can be connected to the LAN 51 through a network interface or
adapter 53. When used in a WAN networking environment, the computer
20 can typically include a modem 54 or other means for establishing
communications over the wide area network 52, such as the Internet.
The modem 54, which may be internal or external, can be connected
to the system bus 23 via the serial port interface 46. In a
networked environment, program modules depicted relative to the
computer 20, or portions thereof, may be stored in the remote
memory storage device. It will be appreciated that the network
connections shown are exemplary and other means of establishing a
communications link between the computers may be used. Moreover,
while it is envisioned that numerous embodiments of the present
disclosure are particularly well-suited for computerized systems,
nothing in this document is intended to limit the disclosure to
such embodiments.
[0023] Referring now to FIG. 2, it depicts a high level block
diagram of computer systems that can be used in embodiments of the
present disclosure. As shown by the figure, computer 20 (e.g.,
computer system described above) can include physical hardware
devices such as a storage device 208, e.g., a hard drive (such as
27 in FIG. 1), a network interface controller (NIC) 53, a graphics
processing unit 234 (such as would accompany video adapter 48 from
FIG. 1), at least one logical processor 212 (e.g., processing unit
21 from FIG. 1), random access memory (RAM) 25. One skilled in the
art can appreciate that while one logical processor is illustrated,
in other embodiments computer 20 may have multiple logical
processors, e.g., multiple execution cores per processor and/or
multiple processors that could each have multiple execution cores.
Depicted is a hypervisor 202 that may also be referred to in the
art as a virtual machine monitor or more generally as a virtual
machine manager. The hypervisor 202 in the depicted embodiment
includes executable instructions for controlling and arbitrating
access to the hardware of computer 20. Broadly, the hypervisor 202
can generate execution environments called partitions such as child
partition 1 through child partition N (where N is an integer
greater than 1). In embodiments a child partition can be considered
the basic unit of isolation supported by the hypervisor 202, that
is, each child partition can be mapped to a set of hardware
resources, e.g., memory, devices, logical processor cycles, etc.,
that is under control of the hypervisor 202 and/or the parent
partition. In embodiments the hypervisor 202 can be a stand-alone
software product, a part of an operating system, embedded within
firmware of the motherboard, specialized integrated circuits, or a
combination thereof
[0024] In the depicted example configuration, the computer 20
includes a parent partition 204 that can be configured to provide
resources to guest operating systems executing in the child
partitions 1-N by using virtualization service providers 228
(VSPs). In this example architecture the parent partition 204 can
gate access to the underlying hardware. Broadly, the VSPs 228 can
be used to multiplex the interfaces to the hardware resources by
way of virtualization service clients (VSCs). Each child partition
can include a virtual processor such as virtual processors 230
through 232 that guest operating systems 220 through 222 can manage
and schedule threads to execute thereon. Generally, the virtual
processors 230 through 232 are executable instructions and
associated state information that provide a representation of a
physical processor with a specific architecture. For example, one
virtual machine may have a virtual processor having characteristics
of an Intel .times.86 processor, whereas another virtual processor
may have the characteristics of a PowerPC processor. The virtual
processors in this example can be mapped to logical processors of
the computer system such that the instructions that effectuate the
virtual processors will be backed by logical processors. Thus, in
these example embodiments, multiple virtual processors can be
simultaneously executing while, for example, another logical
processor is executing hypervisor instructions. Generally speaking,
the combination of the virtual processors and various VSCs in a
partition can be considered a virtual machine.
[0025] Generally, guest operating systems 220 through 222 can
include any operating system such as, for example, operating
systems from Microsoft.RTM., Apple.RTM., the open source community,
etc. The guest operating systems can include user/kernel modes of
operation and can have kernels that can include schedulers, memory
managers, etc. Each guest operating system 220 through 222 can have
associated file systems that can have applications stored thereon
such as e-commerce servers, email servers, etc., and the guest
operating systems themselves. The guest operating systems 220-222
can schedule threads to execute on the virtual processors 230-232
and instances of such applications can be effectuated.
[0026] FIG. 3 is a block diagram representing an exemplary
virtualized computing device where a first operating system (host
OS 302) controls the Network Interface Device 53. Network Interface
Device 53 provides access to network 300. Network interface device
53 may be, for example, a network interface card (NIC). Network
interface device driver 310 provides code for accessing and
controlling network interface device 53. Host network stack 330 and
guest network stack 340 each provide one or more modules for
processing outgoing data for transmission over network 300 and for
processing incoming data that is received from network 300. Network
stacks 330 and 340 may, for example, include modules for processing
data in accordance with well known protocols such as Point to Point
protocol (PPP), Transmission Control Protocol (TCP), and Internet
Protocol (IP). Host networking application 350 and guest networking
application 360 are applications executing on host operating system
204 and guest operating system 220, respectively, that access
network 300.
[0027] As mentioned above, in conventional computing devices which
adhere to the traditional virtualization boundary, data does not
pass back and forth between virtualized operating systems. Thus,
for example, in conventional configurations, when data is
transferred between host networking application 350 and network
300, the data is passed directly from the host network stack 330 to
the network interface device driver 310. However, in the system of
FIG. 3, data does not pass directly from the host network stack 330
to the network interface device driver 310. Rather, the data is
intercepted by virtual switch 325. Virtual switch 325 provides
functionality according to an aspect of the invention.
[0028] Because the guest OS does not have direct access to the NIC,
when the virtual NIC starts, the hypervisor advertises an LSO size
to the networking stack indicating that the NIC is capable of LSO
with a large packet size. LSO increases throughput by reducing the
amount of processing that is necessary for smaller packet sizes. In
general, large packets are given to the NIC and the NIC breaks the
packets into smaller packet sizes in hardware, relieving the CPU of
the work. For example, a 64 KB LSO is segmented into smaller
segments and then sent out over the network through the NIC. By
advertising an LSO packet size to the virtual NIC on the guest OS
that is larger that the LSO-sized packets that are accepted by the
NIC, the networking stack will pass much larger packets to the
virtual NIC. The virtual NIC in turn will transfer the large
packets to the virtual switch.
[0029] This causes the networking stack to format and send packets
that are much larger than the MTU size supported by the underlying
networking infrastructure, and much larger than the physical NIC
that the virtual NIC is attached to supports. The packets are large
chunks of data that are larger than a standard TCP packet, but with
a TCP header. The precise LSO size is tuned to optimize for
performance: CPU use, throughput, and latency, whereas previous
solutions would choose the largest value expected to be supported
by the underlying hardware NIC.
[0030] Normally this packet is sent all the way to the hardware as
an LSO packet, or it is entirely split in software by a software
LSO engine to MTU size. Instead, at some point before sending the
packet to the hardware, it is split into multiple packets each with
a maximum size no greater than that supported by the hardware's LSO
engine, then send the new packets to the hardware NIC. This step
can occur any time before the packet is sent to hardware, but the
closer to the hardware that it is performed, the better the
performance.
[0031] This is accomplished with an LSO algorithm, by copying the
packet headers to each sub-packet and adjusting the TCP sequence
number, identification field (for IPv4), and header flags.
preferably, the IP or TCP checksums are not calculated as is
normally required by LSO, because that will be performed by the
hardware NIC. Similarly, the length field in the IP headers is not
updated, nor is the TCP pseudo-checksum, as this would interfere
with the NIC's later computation of these fields while performing
hardware LSO.
[0032] Finally, the software LSO driver must wait to complete the
full packet to the sender until all sub-packets have been sent by
the NIC and are completed. This is achieved by keeping a count of
outstanding sub-packets that have not yet completed, and completing
the full packet when this count reaches zero.
[0033] FIG. 4 demonstrates in conjunction with FIG. 3 more detail
the flow described above. In particular at 402, the data stream
from the network stack 340 arrives at the virtual NIC driver 342.
At 404, the virtual NIC driver 342 configures a packet that is as
large as the hypervisor will allow. The LSO size of the packet is
preferably much larger that the LSO packet size supported by NIC 53
in the host partition. The virtual NIC driver 342 then transfers
the data to the virtual switch 325 by communication services
provided by the hypervisor. At 406, the virtual switch 325 then
splits the large format LSO into LSO packets that conform to the
LSO packet size 408 supported by the NIC hardware 53.
[0034] At 410, the LSO engine of the NIC hardware 53 splits the LSO
packets into MTU-sized packets 412 supported by the network
infrastructure. Those packets are then transmitted over the
network.
[0035] FIG. 5 further illustrates the processing performed in
virtual switch 325 of FIG. 3. At 505, virtual switch 325 receives
the packet from virtual NIC driver 342. Thereafter at 507, virtual
switch 325 determines if the NIC hardware 53 supports LSO. If LSO
is supported, virtual switch 325 determines whether the packets is
less than the NIC LSO packet supported. If yes, at 511, the packet
is sent to the NIC hardware 53 without further processing. If no,
at 511, the oversized LSO packet is subdivided into LSO supported
packets. On the other hand, at 507, if NIC is not supported, then
the packets are divided into NIC supported packets at 511.
[0036] The techniques described allow the virtual machine to be
migrated from one system to another and maximize the performance on
each system, preferably tailored to the NIC hardware on each
system. To that end, when the virtual NIC driver loads on the
target system, the hypervisor provides a LSO packet size that is
then used to send the maximum sized packet to the partition that
controls the NIC hardware. This allows an oversized packet to be
determined for each system based on maximizing throughput or other
parameters that may be desirable on the target system.
[0037] The various systems, methods, and techniques described
herein may be implemented with hardware or software or, where
appropriate, with a combination of both. Thus, the methods and
apparatus of the present invention, or certain aspects or portions
thereof, may take the form of program code (i.e., instructions)
embodied in tangible media, such as floppy diskettes, CD-ROMs, hard
drives, or any other machine-readable storage medium, wherein, when
the program code is loaded into and executed by a machine, such as
a computer, the machine becomes an apparatus for practicing the
invention. In the case of program code execution on programmable
computers, the computer will generally include a processor, a
storage medium readable by the processor (including volatile and
non-volatile memory and/or storage elements), at least one input
device, and at least one output device. One or more programs are
preferably implemented in a high level procedural or object
oriented programming language to communicate with a computer
system. However, the program(s) can be implemented in assembly or
machine language, if desired. In any case, the language may be a
compiled or interpreted language, and combined with hardware
implementations.
[0038] The methods and apparatus of the present invention may also
be embodied in the form of program code that is transmitted over
some transmission medium, such as over electrical wiring or
cabling, through fiber optics, or via any other form of
transmission, wherein, when the program code is received and loaded
into and executed by a machine, such as an EPROM, a gate array, a
programmable logic device (PLD), a client computer, a video
recorder or the like, the machine becomes an apparatus for
practicing the invention. When implemented on a general-purpose
processor, the program code combines with the processor to provide
a unique apparatus that operates to perform the indexing
functionality of the present invention.
[0039] Consequently, the network stack can send very large packets,
larger than a physical NIC normally supports with LSO. This is
accomplished by performing multi-pass LSO; a first-stage LSO switch
is inserted somewhere between the network stack and the physical
NIC that splits very large LSO packets into LSO packets that are
small enough for the NIC. The NIC then performs a second pass of
LSO by splitting these sub-packets into standard MTU-sized
networking packets for transmission on the network.
[0040] While the present invention has been described in connection
with the preferred embodiments of the various figures, it is to be
understood that other similar embodiments may be used or
modifications and additions may be made to the described embodiment
for performing the same function of the present invention without
deviating there from. For example, while exemplary embodiments of
the invention are described in the context of digital devices
emulating the functionality of personal computers, one skilled in
the art will recognize that the present invention is not limited to
such digital devices, as described in the present application may
apply to any number of existing or emerging computing devices or
environments, such as a gaming console, handheld computer, portable
computer, etc. whether wired or wireless, and may be applied to any
number of such computing devices connected via a communications
network, and interacting across the network. Furthermore, it should
be emphasized that a variety of computer platforms, including
handheld device operating systems and other application specific
hardware/software interface systems, are herein contemplated,
especially as the number of wireless networked devices continues to
proliferate. Therefore, the present invention should not be limited
to any single embodiment, but rather construed in breadth and scope
in accordance with the appended claims.
[0041] Finally, the disclosed embodiments described herein may be
adapted for use in other processor architectures, computer-based
systems, or system virtualizations, and such embodiments are
expressly anticipated by the disclosures made herein and, thus, the
present invention should not be limited to specific embodiments
described herein but instead construed most broadly. Likewise, the
use of synthetic instructions for purposes other than processor
virtualization are also anticipated by the disclosures made herein,
and any such utilization of synthetic instructions in contexts
other than processor virtualization should be most broadly read
into the disclosures made herein.
* * * * *