Packet Compression for Network Packet Traffic Analysis Black; Richard John [Microsoft Corporation]

Packet Compression for Network Packet Traffic Analysis

Black; Richard John

Patent Application Summary

U.S. patent application number 12/848391 was filed with the patent office on 2010-11-18 for packet compression for network packet traffic analysis. This patent application is currently assigned to Microsoft Corporation. Invention is credited to Richard John Black.

Application Number	20100290364 12/848391
Document ID	/
Family ID	41267687
Filed Date	2010-11-18

United States Patent Application	20100290364
Kind Code	A1
Black; Richard John	November 18, 2010

Packet Compression for Network Packet Traffic Analysis

Abstract

Methods of capturing and compressing trace data for use in network packet traffic analysis are described. In an embodiment, when a packet is received, two records of the packet are created and stored. One record is stored in a file associated with the source address of the packet and the other record is stored in a file associated with the destination address of the packet. Various packet compression techniques are described and one example compares a newly received packet to the previous packet which has been stored in the same file and sets bits in the record which denote whether fields in the newly received packet are the same as the corresponding fields in the previous packet.

Inventors:	Black; Richard John; (Cambridge, GB)
Correspondence Address:	LEE & HAYES, PLLC 601 W. RIVERSIDE AVENUE, SUITE 1400 SPOKANE WA 99201 US
Assignee:	Microsoft Corporation Redmond WA
Family ID:	41267687
Appl. No.:	12/848391
Filed:	August 2, 2010

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
12118595	May 9, 2008	7793001
12848391

Current U.S. Class:	370/253
Current CPC Class:	H04L 43/026 20130101
Class at Publication:	370/253
International Class:	H04L 12/26 20060101 H04L012/26

Claims

1. A method of capturing trace data for use in network packet traffic analysis, the method comprising: under control of one or more processors configured with executable instructions: receiving a packet; and storing a representation of the packet in a file, said representation and said file being associated with one of a source address of the packet and a destination address of the packet.

2. A method according to claim 1, further comprising: storing a second representation of the packet in a second file, said second representation and said second file being associated with another of said source address of the packet and said destination address of the packet.

3. A method according to claim 2, wherein the first representation and the second representation are the same.

4. A method according to claim 1, wherein a representation of a packet comprises a truncated packet.

5. A method according to claim 1, wherein a representation of a packet comprises values of specified fields.

6. A method according to claim 1, wherein storing a representation of the packet in a file comprises: if the file associated with an address is empty, creating and storing the associated representation; if the file associated with an address comprises a representation of a previous packet, creating the associated representation based on a comparison of said packet and said previous packet and storing the associated representation.

7. A method according to claim 6, wherein creating the associated representation based on a comparison of said packet and said previous packet comprises: comparing a value of a first field in said packet with a value of said first field in said previous packet; setting a bit in the representation identifying if the values are the same; and repeating the comparing and setting steps for each field in a set of specified fields; and appending the value of each field where the value of the field in said packet is not the same as the value of the field in the previous packet.

8. A method according to claim 1, wherein storing a representation of the packet in a file comprises: creating a representation of the packet; setting a flag in the representation indicating whether the address associated with the representation is the source address or the destination address; mapping a value of a field in the representation based on said flag; and storing the representation.

9. A method according to claim 1, wherein each representation comprises at least one of a source address and a destination address and wherein storing two representations of the packet in separate files comprises, for each representation: replacing an address in the representation with an identifier; storing a mapping between the address and the identifier in a separate file; and storing the representation.

10. A method according to claim 1, further comprising comparing the representation of the packet to a representation of a previously received packet which has been stored in the file and which sets bits which denote whether fields in the packet are the same as corresponding fields in the previously received packet.

11. One or more tangible device-readable media with device-executable instructions for performing acts comprising: on receipt of a packet, creating a packet record for use in network packet traffic analysis, the packet record comprising a plurality of fields; and discarding the packet.

12. One or more tangible device-readable media according to claim 11, further comprising device-executable instructions for performing acts comprising: creating a first and second version of the packet record; and storing the first version in a file associated with a source address of the packet; and storing the second version of the packet record in a file associated with a destination address of the packet.

13. One or more tangible device-readable media according to claim 12, wherein creating a first and second version of the packet record further comprises, for each version: setting a flag in the version of the packet record according to a direction of travel of the packet); and mapping at least one of the plurality of fields based on said flag.

14. One or more tangible device-readable media according to claim 11, further comprising device-executable instructions for performing act comprising: compressing each packet record based on a comparison of the packet and a packet previously stored in the file.

15. One or more tangible device-readable media according to claim 14, wherein compressing each packet record based on a comparison of the packet and a packet previously stored in the file comprises: comparing the packet to a previous packet; setting a flag in the packet record for each of the plurality of fields, the flag indicating whether a value of the field in the packet is different to a value of the field in the previous packet; and for each field where the value of the field in the packet is different to a value of the field in the previous packet, appending the value of the field to the packet record.

16. One or more tangible device-readable media according to claim 14, further comprising device-executable instructions for performing acts comprising: creating an instance of a record creation and compression method for each file; passing a received packet to at least one of: an instance of the method corresponding to a source address of the packet and an instance of the method corresponding to a source address of the packet.

17. One or more tangible device-readable media according to claim 11, further comprising device-executable instructions for performing acts comprising: replacing an address in the packet record with an identifier; and storing a mapping between the address and the identifier in a dictionary.

18. One or more tangible device-readable media with device-executable instructions for performing acts comprising: accessing a file comprising a plurality of compressed packet records, each compressed packet record comprising a flag byte; reading a flag byte from the file; and generating an uncompressed packet record comprising a packet time and a plurality of fields by: determining the packet time based on a timestamp within said flag byte; and determining the plurality of fields based on a plurality of flags in the flag byte, each flag corresponding to one of the plurality of fields.

19. One or more tangible device-readable media according to claim 18, wherein determining the plurality of fields based on a plurality of flags in the flag byte comprises: reading a first flag from the flag byte; if said first flag is set, reading a value of a first field from the file; if said first flag is not set, setting the value of the first field to a value of a first field in a previous uncompressed packet record; and repeating the steps for each of the plurality of flags.

20. One or more tangible device-readable media according to claim 18, wherein determining the packet time based on a timestamp within said flag byte comprises: examining the timestamp; and if the timestamp is equal to a first value, reading a time from the file and setting the packet time to said time; and if the timestamp is equal to one of a set of values, reading one or more bytes from the file and setting the packet time based on said one or more bytes and a packet time of a previous uncompressed packet record.

Description

RELATED APPLICATION

[0001] This application is a continuation of U.S. patent application Ser. No. 12/118,595, filed on May 9, 2008, the entirety of which is incorporated by reference.

BACKGROUND

[0002] Network packet traffic analysis may be performed in a number of different ways: the analysis may be performed in real time (on-line) or from stored data (off-line) and the data analyzed may represent a substantially complete record of packet activity or the data may be sampled from the network and therefore represent only a small fraction of the packets in the network. A substantially complete record of packet activity is known as a trace.

[0003] A capturing agent may be used to capture a trace by capturing data and storing it on disk. Where the speed at which data arrives exceeds the speed that the data can be written to a disk, packets may be truncated and truncated packets stored. Each truncated packet is shorter than a complete packet and comprises the front portion of a packet without the end portion of the packet. The length of the truncated packet is known as the snap-length. Having captured a trace, the data may be analyzed in many different ways and many different aspects of the data maybe investigated.

[0004] Typically a trace comprises complete or truncated packets captured over a short period of time or the trace comprises a statistical sampling of the number and temporal distribution of packets sent between machines.

[0005] The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known methods of capturing trace data.

SUMMARY

[0006] The following presents a simplified summary of the disclosure in order to provide a basic understanding to the reader. This summary is not an extensive overview of the disclosure and it does not identify key/critical elements of the invention or delineate the scope of the invention. Its sole purpose is to present some concepts disclosed herein in a simplified form as a prelude to the more detailed description that is presented later.

[0007] Methods of capturing and compressing trace data for use in network packet traffic analysis are described. In an embodiment, when a packet is received, two records of the packet are created and stored. One record is stored in a file associated with the source address of the packet and the other record is stored in a file associated with the destination address of the packet. Various packet compression techniques are described and one example compares a newly received packet to the previous packet which has been stored in the same file and sets bits in the record which denote whether fields in the newly received packet are the same as the corresponding fields in the previous packet.

[0008] Many of the attendant features will be more readily appreciated as the same becomes better understood by reference to the following detailed description considered in connection with the accompanying drawings.

DESCRIPTION OF THE DRAWINGS

[0009] The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein:

[0010] FIG. 1 is a schematic diagram of a network;

[0011] FIG. 2 is a schematic diagram of a packet in a computer network;

[0012] FIG. 3 is a flow diagram of an example method of capturing trace data;

[0013] FIGS. 4-8 are flow diagrams of example methods of compressing trace data;

[0014] FIG. 9 shows example implementations of two method blocks from FIG. 8 in more detail;

[0015] FIG. 10 shows a comparison between two formats of a packet record;

[0016] FIG. 11 shows a flow diagram of a method of converting a stream of packets into multiple files using a multiplicity of instances of the method shown in FIG. 8;

[0017] FIG. 12 illustrates the process of decompressing a compressed trace file; and

[0018] FIG. 13 illustrates an exemplary computing-based device in which embodiments of the methods described herein may be implemented. Like reference numerals are used to designate like parts in the accompanying drawings.

DETAILED DESCRIPTION

[0019] The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present example may be constructed or utilized. The description sets forth the functions of the example and the sequence of steps for constructing and operating the example. However, the same or equivalent functions and sequences may be accomplished by different examples.

[0020] Trace captures are often taken at a SPAN (switched port analyzer) port on a router or switch which provides a stream of packets representing a copy of packets between one or more computers on each of one or more other ports. FIG. 1 is a schematic diagram of a network comprising a router/switch 101 and a number of computers 102 which may be interconnected in any way. Port mirroring may be used on the router/switch 101 to send a copy of all packets that pass through the router/switch 101 to a port 103 which is connected to a computer 104 which runs a capturing agent that captures the trace. The computer 104 receives the packets on a network card 105.

[0021] A typical packet 200 of interest in a computer network, as shown in FIG. 2, comprises an Ethernet header 201 of fourteen bytes, followed by an Internet Protocol (IP) header 202 which may be variable in length but which is at least twenty bytes and usually but not always twenty bytes. This IP header indicates the protocol of the packet which is usually either the unreliable datagram protocol (UDP) or the transport control protocol (TCP). If the UDP is indicated then the IP header is followed by a twenty byte UDP header 203. If the TCP is indicated then the IP header is followed by a variable length TCP header 203 which is at least twenty bytes and usually but not always twenty bytes. These headers 201-203 are followed by the data 204 which might be the user's data and/or a header belonging to some higher level protocol. The amount of data is variable, but the total length of the packet is less than the frame limit which is usually fifteen hundred and fourteen bytes.

[0022] FIG. 3 is a flow diagram of an example method of capturing trace data. A packet is received (block 301) and data is appended to the packet (block 302). A first version of the packet with the appended data is then stored in a file associated with the source of the packet (block 303) and a second version of the packet with the appended data is stored in a file associated with the destination of the packet (block 304). The first and second versions of the packet with appended data may be identical or they may be different (as described in more detail below).

[0023] When data is appended to the packet (in block 302), data which is in the packet, such as some or all of the user's data, may also be removed e.g. packets may be truncated (or snapped) to conserve disk space, or disk bandwidth, or to reduce the privacy impact of storing the portion of the packet comprising data rather than headers. In an example, the computer records the current time when the packet arrives, the length of the packet and the amount of the data which will be snapped (i.e. captured) and not discarded. The computer then appends the time (usually requiring eight bytes) and the two lengths (usually requiring four bytes each) to the packet (snapped to at least 14+20+20 but frequently a value such as 128 bytes) and this is output to the storage device, e.g. a disk.

[0024] The versions of the packets stored (in blocks 303 and 304) may comprise packet records instead of the packets themselves (either full or truncated packets). Examples of packet records are described in more detail below.

[0025] Using the method shown in FIG. 3, multiple files are used to record packet information. A multiplicity of files is used, with in many cases one file for each network address, though a file may be used for several addresses. Each packet record is written to two files, the file for the source address of the packet (in block 303), and the file for the destination address of the packet (in block 304). For broadcast packets, the file for the destination address may comprise a file for the broadcast address. Alternatively, broadcast packets may be stored in the files for multiple destination addresses; however this is less efficient.

[0026] By dividing the trace data into multiple files, the packets relevant to a computation for a single computer (or a group of computers) can be processed more efficiently since only the file for that computer must be read; or, if broadcast packets are relevant, the files for the address of that computer and the file for the broadcast address of the portion of the network where the computer is present. Additionally the files are more easily processed in parallel on a cluster, since the files required for any particular processing component are much reduced.

[0027] When a packet is stored as part of a captured trace, either using the method of FIG. 3 or another method of capturing trace data, the packet may be compressed. As described above, the packet may be snapped (or truncated) to a particular length (the snap length) but in other examples, further compression techniques may be used. The following description describes a number of different compression techniques which may be used together (in any combination) or independently.

[0028] FIG. 4 is a flow diagram of an example method of compressing trace data. A packet is received (block 401) and a packet record is created for the packet which comprises specified fields associated with the packet (block 402). The record may encode these specified fields. Having created the packet record (in block 402), the packet is discarded (block 403). The specified fields which are encoded (or otherwise stored) in the packet record may be those fields associated with the original received packet which are generally used in the analysis of network data. These fields may, for example, comprise one or more of: the time of the frame, the source and destination addresses, whether the packet is UDP or TCP, the port numbers, and whether the packet was empty or not (e.g. whether the packet 200, as shown in FIG. 2, comprises any data 204). In some implementations some encoding of TCP's protocol flags may also be present in the packet record. In other examples, different fields may be included within the packet record which is generated.

[0029] FIG. 5 is a flow diagram of another example method of compressing trace data. A packet is received (block 501) and if it is the first packet to be stored in a trace file (`Yes` in block 502), a packet record is created and stored for the packet (block 503). This packet record may, as described above, comprise specified fields which are to be used in analysis of the trace. The received packet is then discarded (block 504). If however, the packet is not the first to be stored in a trace file (`No` in block 502), the packet is compared to the previously stored packet record (block 505) and a packet record for the newly received packet is created which comprises flags which are set to denote whether each of the specified fields have changed from the previously stored packet record (block 506). These flags which denote whether fields have changed between packets are encoded using bits grouped together in one of more bytes which may be referred to as a `flag byte` or `flag bytes`. Where fields have changed, these changed fields are appended to the packet record (block 507) and the received packet is then discarded (block 504). The method may then be repeated for each packet received.

[0030] This method provides a compression format for the fields (which are themselves larger than one bit) from one packet to the next within a file in which a single bit is used to indicate whether a field would have the same value as the field in the previous packet. This provides significant compression of the size of the trace.

[0031] In addition to compressing the fields, as described above, the time of the packet can also be compressed. Instead of storing the absolute time at which the packet was observed, the difference in time from one packet to the next may be stored. Such a difference is likely to be a smaller value than the absolute time of arrival. For example, two bits in the flag byte may be used to encode whether the time is represented by a one byte, two byte or four byte difference, or by an eight byte absolute time. This significantly reduces the size of the stored file and improves the performance of analysis through the much reduced data volumes.

[0032] FIG. 6 is a flow diagram of a further example method of compressing trace data. A packet is received (block 601) and two packet records are created (block 602)--one associated with the source address and one associated with the destination address. Within each packet record a direction bit is set (block 603) which indicates whether the packet is being sent to or from the address with which the packet record is associated, i.e. in the packet record associated with the source address, the direction bit will indicate that the packet was being sent from the associated address and in the packet record associated with the destination address, the direction bit will indicate that the packet was being sent to the associated address. The two packet records created are therefore not the same. The fields in the packet record are then mapped according to the direction bit (block 604). For example, instead of having source port and destination port fields, the record encodes local port and remote port fields. If the packet is an input at the address represented in the current file then the destination port is represented in the local port field; if the packet is an output at the address then the destination port would be represented in the remote port field. The created packet records are then stored (block 605) and the received packet discarded (block 606). The packets may be stored in two separate files, one associated with the source address and one associated with the destination address (e.g. as described above with reference to FIG. 3).

[0033] Another optimization for collecting trace data is shown in FIG. 7, in which the address of the peer host for a packet is not stored directly in the file, instead an indirect identifier is stored (and potentially compressed using the method and system described above). For a packet or packet record associated with a source address, the peer host address is the destination address or vice versa. When a packet is received (block 701), the source and/or destination address is replaced by an identifier for each replaced address (block 702). The mapping between the actual address and the identifier is stored (block 703), for example in a library or dictionary, and the amended packet or a packet record for the packet (as described above) is stored (block 704). Where a packet record is stored, the original received packet may be discarded (not shown in FIG. 7).

[0034] The use of an identifier instead of an actual address, as shown in FIG. 7, enables additional compression because the number of addresses present on the network of interest is likely to be smaller than the actual size of an address. Furthermore, the dictionary consulted to find the address represented by the identifiers need not be made available to the persons (or machines) processing the packet data; or alternatively a different dictionary to which a prefix conserving anonymization technique has been applied can be provided instead. Thus the private details of the original addresses in the trace can be completely and easily separated from the main trace data on which the computations will be carried out, thereby improving privacy. Previously two copies of the trace were stored--one containing the actual addresses and one containing anonymized addresses (e.g. addresses to which a prefix conserving anonymization technique has been applied) and therefore use of this method also reduces storage requirements.

[0035] FIG. 8 illustrates the process of creating a compressed trace file for a single address (or group of addresses) from a source of original packets and the packet records stored are referred to herein as being in `Reduced Packet Format` (RPF). It will be appreciated that the method will be implemented in parallel for both the source and destination addresses; however FIG. 8 only a single process flow is shown for purposes of clarity. The method shown in FIG. 8 uses all of the compression techniques and optimizations described above, although in other examples only some of those techniques and/or optimizations may be employed. FIG. 9 shows flow diagrams of example implementations of blocks 808 and 810 of the method shown in FIG. 8 in more detail.

[0036] When a packet is received (block 801), the method checks if this is the first packet in the file (block 802), i.e. whether this is the first packet received having this particular address as either the source or the destination address. If the packet received is the first packet in the file (`Yes` in block 802), a flag byte indicating an escape of type absolute time is written (block 803) and the absolute time is also written (block 804). A flag byte is written indicating a minimum sized time delta and that all other fields are present (block 805) followed by each of the fields (block 806) and these fields may be written in a predefined order.

[0037] If, however, the packet is not the first in the file (`No` in block 802), the time of the packet received (in block 801) is compared to the time of the previous packet, a packet record for which has been previously stored in the file, to determine the magnitude of the time difference between the two packets (block 807). If the time difference (also referred to as the time delta) does not exceed four bytes (`No` in block 808), then a flag byte is constructed for the specified fields (block 809). As described above with reference to FIG. 5 and shown in more detail in FIG. 9, this flag byte is constructed by converting the source and destination addresses and port numbers into local and remote values (block 901) and setting the direction bit (block 902). The values of the specified fields (e.g. peer address, local port and remote port) in the current packet (received in block 801) are compared to the corresponding values in the previous packet and bits in the flag byte are set to denote whether the values have changed (block 903). In order to perform this comparison, the previous uncompressed packet record may be stored. In addition the protocol and empty bits are set (block 904).

[0038] The flag byte, including the time difference between the current and previous packets, is then written to the file (block 810) followed by the values of any of the fields which have changed (block 811), as shown in more detail in FIG. 9. If the peer changed bit is set (`Yes` in block 910) then an identifier for the peer is written to the file (block 911). This may require consulting the dictionary including adding a new identifier to the dictionary if the address has not previously been seen (not shown in FIG. 9). If the local port changed bit is set (`Yes` in block 912), the local port is written to the file (block 913) and if the remote port changed bit is set (`Yes` in block 914), the remote port is written to the file (block 91 5).

[0039] If the time difference is larger than will fit in four bytes (`Yes` in block 808), then a flag byte is written indicating an escape of type absolute time (block 812) followed by an eight byte absolute time (block 813). The time difference is then set to zero (block 814) and a second flag byte is constructed for the specified fields (block 809 and as shown in more detail in FIG. 9). This second flag byte, including the time difference between the current and previous packets (which was set to zero in block 814) is then written to the file (block 810) followed by the values of any of the fields which have changed (block 811 and as shown in more detail in FIG. 9). The method may then be repeated for subsequent packets received.

[0040] FIG. 10 shows a comparison between the Reduced Packet Format 1010 of a packet record and an expanded format 1000 of a packet record. In the example shown, the expanded format 1000 comprises values for each of the following: the time of the frame (or packet) 1001, the remote port number 1002, the local port number 1003, a protocol field 1004 (e.g. indicating whether TCP or UDP is used), a peer address identifier 1005, a direction bit 1006 and a field indicating whether the packet comprised any user data or not 1007. The data elements may be arranged in any order. The RPF 1010 comprises the flag byte 1011 and values for any fields that have changed 1019 (e.g. as written in block 810). The flag byte comprises a direction bit 1012, a protocol bit 1013, an empty bit 1014, a time stamp 1018 (which may comprise two bits, as described above) and bits 1015-1017 indicating whether the peer address, remote port number and local port number have changed since the previous packet.

[0041] The RPF 1010 is the format in which the trace is stored and the expanded format 1000 may be the format which is used when the trace is processed. The decompression process (i.e. conversion from expanded format 1000 to RPF 1010) is described below with reference to FIG. 12.

[0042] Since many communication patterns in real networks involve bursts bidirectional communication between computers, and since the encoding from source and destination has been transformed using the method shown in FIG. 8 to a direction and local and remote it can be seen that adjacent packets in the file are very likely to have identical values of local and remote port numbers and peer address and so these will compress very densely to a single flag each (bits 1015-1017 in the example in FIG. 10) indicating that they are the same as the previous packet.

[0043] Moreover, even on a busy server computer with many simultaneous connections the local port indicating the service on the computer which is receiving and transmitting many packets will also be identical from packet to packet and hence can also be condensed to a single bit 1017. Not only does this save very large amounts of storage space in the file, and save on the time to read and write the file to disk, but it can also be used to optimize actual processing of the file since several forms of processing can be enhanced if they can more rapidly identify that certain aspects of a packet are the same as previously. For example, packet processing may begin by determining the flow or channel associated with a packet and this step can be elided if can rapidly be determined that the flow or channel of some packet will be the same as the flow or channel determined for the previous packet.

[0044] As described above, and shown in FIG. 3, a stream of packets may be stored in multiple files, with each file being associated with an address (or group of addresses). FIG. 11 shows a flow diagram of a method of converting a stream of packets into multiple files using a multiplicity of instances of the method shown in FIG. 8. When a packet is received (block 1101), the method checks to see if an instance of the method of FIG. 8 exists for the source address of the packet (block 1102). If such an instance does not exist (`No` in block 1102) then one is created and the dictionary is updated (block 1 103) to include a new mapping between an indirect identifier (as described above) and the actual address. The packet is then given to the instance of the method for the source address (block 1 104). The method then checks to see if an instance of the method of FIG. 8 exists for the destination address of the packet (block 1105). If such an instance does not exist (`No` in block 1105) then one is created and the dictionary is updated (block 1106) to include a new mapping between an indirect identifier (as described above) and the actual destination address. The packet is then given to the instance of the method for the destination address (block 1107). The method is then repeated for subsequent packets received.

[0045] Whilst the above description in relation to FIG. 11 describes generating packets associated with each address, in many examples, files may only be created for a subset of addresses. In such an example, the method shown in FIG. 11 may be modified such that for each new source/destination address the dictionary is updated (e.g. in blocks 1103 and 1106) but that new instances are generated only for the subset of addresses where files are required.

[0046] The method and system described above enables the processing of packets into files to be performed with an exceptionally high degree of parallelism using all the available processing cores or computers, since the only time that coordination is required is when the dictionary is to be updated (in blocks 1103 and 1 106 in FIG. 11) and that happens only rarely on the first time that an address is seen in the network. In an example, packets may be received in blocks (e.g. in 10 GB slices) and each core may deal with a separate block of packets. Typically the packets are received in blocks which are divided in time by the capture device such that each block is not bigger than a disk. Each core may run an instance of the method of FIG. 8 (the RPF writer) for each address such that each core generates a file for each address which is a source or destination address for one of the packets it processes. This results in multiple files for a single address (e.g. where files are generated by different cores) and these may subsequently be combined into a single file per address.

[0047] Whilst this provides one example parallelization technique, in another example, one machine or core may perform the method shown in FIG. 11 and different machines/cores may perform different instances of the method of FIG. 8. As described above, each machine/core involved refers back to and updates a central dictionary.

[0048] Where a trace file is used for multiple addresses, the methods used may be the same as described above. Alternatively, an extra field or flag may be used to indicate which of the multiple addresses is the source/destination address for a particular packet record. Use of such a field or flag enables the trace file to be separately divided into individual files for each of the addresses. A trace file for multiple addresses may, for example, be compiled where a single machine has multiple addresses on a single interface (e.g. IPv4 and IPv6) or on multiple interfaces.

[0049] Experimental results have demonstrated the compression which is achievable using the methods described above. When the methods were applied to a 4500 GB dataset it was reduced it to a mere 70 GB. This represents a compression ratio in excess of 5000%, well beyond what is achievable with general purpose compression techniques. In addition the resulting dataset was easier to process and parallelize on a cluster of computers.

[0050] The methods described herein and the resulting large compression ratio enable a trace to be captured which provides data on all packets captured over a large period of time and/or for a large number of machines.

[0051] In the methods described above, all the packets received result in a packet or packet record being stored. In other examples, the packets received may be filtered e.g. so that only packets of interest are captured and stored in the trace. In an example, the filtering may be performed based on round trip time (RTT).

[0052] FIG. 12 illustrates the process of decompressing a compressed trace file (which may be referred to as a Reduced Packet Format file), which may have been generated using the method shown in FIG. 8. This decompression may be performed for the purpose of processing the file e.g. to perform network packet analysis.

[0053] The method starts by reading a flag byte from the file (block 1201). If the byte is determined (in block 1202) to be an escape, as indicated by the two timestamp bits being both set, then the method checks if it is an absolute timestamp escape (block 1203). If it is (`Yes` block 1203) then an eight byte absolute time is read, the current time is set to this value (block 1204) and the method returns to the start and reads the next flag byte (block 1201). If it is some other type of escape (`No` in block 1203) then additional escape specific processing is performed (block 1205). If it was not an escape then the two bits which encode the size of the timestamp are examined (in block 1202) and if they are 00 then a single byte is read and added to the current time (block 1206). If the two bits are 01 then a two-byte value is read and added it to the current time (block 1207) and if they are 10 then a four-byte value is read and added to the current time (block 1208).

[0054] Where the two bits which encode the size of the timestamp are 00, 01 or 10, the method continues (following blocks 1206-1208) by checking to see if the peer present bit is set (block 1210) and if it is, the current peer is updated by reading a peer address identifier (block 1211). The method then checks to see if the local port present bit is set (block 1220) and if it is the current local port is updated by reading a port value (block 1221). The method then checks to see if the remote port present bit is set (block 1230) and if it is, the current remote port is updated by reading a port value (block 1231).

[0055] Having updated the values of the peer address identifier, local port and remote port if required (blocks 1210, 1211, 1220, 1221, 1230 and 1231), the current packet descriptor record (which is in expanded format 1000, as shown in FIG. 10) is made available for processing (block 1240) and then if the end of file has not been reached (`No` in block 1250) the method is repeated. The method stops (block 1260) when the end of the file is reached (`Yes` in block 1250).

[0056] The output comprises expanded format records and each expanded format record comprises a value for each field and each record stands independently (unlike RPF which is a comparison with a previous packet). The method of FIG. 12 (referred to as an RPF reader) hides the compression form the processing engine which performs the network packet traffic analysis.

[0057] Whilst the method of decompressing a compressed trace file shown in FIG. 12 demonstrates decompression of a compressed packet which has been created using all the compression techniques described above, it will be appreciated that in some examples the compressed packet may have been created using only a subset of the techniques described above. In such an example, a corresponding decompression method may comprise only a subset of the steps shown in FIG. 12.

[0058] FIG. 13 illustrates various components of an exemplary computing-based device 1300 which may be implemented as any form of a computing and/or electronic device, and in which embodiments of the methods described above may be implemented.

[0059] Computing-based device 1300 comprises one or more processors 1301 which may be microprocessors, controllers or any other suitable type of processors for processing computing executable instructions to control the operation of the device in order to generate and/or process packet traces. Platform software comprising an operating system 1302 or any other suitable platform software may be provided at the computing-based device to enable application software 1303-1305 to be executed on the device. The application software may comprise a RPF writer 1304 (e.g. which performs one or more of the methods shown in FIGS. 3-9 and 11) and/or a RPF reader (e.g. which performs the method shown in FIG. 12).

[0060] The computer executable instructions may be provided using any computer-readable media, such as memory 1306. The memory is of any suitable type such as random access memory (RAM), a disk storage device of any type such as a magnetic or optical storage device, a hard disk drive, or a CD, DVD or other disc drive. Flash memory, EPROM or EEPROM may also be used. The memory may also be used to provide a data store 1307 which may, for example, be used to store the compressed trace files and/or the decompressed trace files.

[0061] The computing-based device 1200 also comprises a network interface 1308 for receiving packets and may also comprise additional inputs and outputs (not shown in FIG. 13).

[0062] Although the present examples are described and illustrated herein as being implemented in a system as shown in FIG. 1, the system described is provided as an example and not a limitation. As those skilled in the art will appreciate, the present examples are suitable for application in a variety of different types of systems which comprise more than one computing device and which may be interconnected in any way.

[0063] The term `computer` is used herein to refer to any device with processing capability such that it can execute instructions. Those skilled in the art will realize that such processing capabilities are incorporated into many different devices and therefore the term `computer` includes PCs, servers, mobile telephones, personal digital assistants and many other devices.

[0064] The methods described herein may be performed by software in machine readable form on a tangible storage medium. The software can be suitable for execution on a parallel processor or a serial processor such that the method steps may be carried out in any suitable order, or simultaneously.

[0065] This acknowledges that software can be a valuable, separately tradable commodity. It is intended to encompass software, which runs on or controls "dumb" or standard hardware, to carry out the desired functions. It is also intended to encompass software which "describes" or defines the configuration of hardware, such as HDL (hardware description language) software, as is used for designing silicon chips, or for configuring universal programmable chips, to carry out desired functions.

[0066] Those skilled in the art will realize that storage devices utilized to store program instructions can be distributed across a network. For example, a remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some software instructions at the local terminal and some at the remote computer (or computer network). Those skilled in the art will also realize that by utilizing conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.

[0067] Any range or device value given herein may be extended or altered without losing the effect sought, as will be apparent to the skilled person.

[0068] It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. It will further be understood that reference to `an` item refers to one or more of those items.

[0069] The steps of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

[0070] The term `comprising` is used herein to mean including the method blocks or elements identified, but that such blocks or elements do not comprise an exclusive list and a method or apparatus may contain additional blocks or elements.

[0071] It will be understood that the above description of a preferred embodiment is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments of the invention. Although various embodiments of the invention have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit or scope of this invention.

* * * * *