一、网络协议栈和dpdk介绍
1. 网络数据通信流程

1.网卡
把物理层的模拟信号与数据链路层的数字信号AD<->DA互相转换。
网卡接收到数据产生中断通知cpu,cpu使用驱动将网卡中的缓存信息读取到内存中,
后续各协议栈、应用层因此解析读取此信息
2.网卡驱动driver
负责网卡正常工作的(供电,各种参数,传输速率等), 会在数据的传输过程中为数据维护一个sk_buff(相当于牛套上了绳)将这个地址传给协议栈。
主要分为两部分1、Nic框架 2、根据Nic框架下实现自己网卡驱动
3.协议栈tcp/ip
进行以太网、ip协议解析、udp解析、tcp解析。等待用户态的读取
*tcp/ip解析*的时候直接解析这个sk_buff就好了
4.posix api
提供用户态和内核态的交互
5.app(Redis\Nginx)
用户态读取数据
2. dpdk作用
dpdk本质是接管网卡到驱动的通信交由dbdk自己处理,dpdk接管通信后有两种处理方式:
1.基于dbdk实现自己的用户态协议栈;
2.将数据继续交与内核协议栈处理。
dpdk是用来处理大包的,可以从网卡中把数据截取,把性能做高,适用于高性能网络开发,DPDK绕过了Linux网络协议栈,直接在用户空间中对数据包处理过程,以减少数据包处理的延迟和提高吞吐量
应用:数据备份、网关防火墙、高性能网络开发
思考问题
- DPDK能否提升Redis网络并发量QPS(每秒处理性能)?
可能性有限:dpdk主要在于网络连接的优化, 因此并发主要跟协议栈有关。 - DPDK能否减少低延迟问题(从发包到回包延迟)?
可以显著减少:dpdk的优化体现在:绕过内核协议栈, 直接用户态处理数据包,减少上下文切换。
零拷贝技术:可以减少数据从网卡到应用的内存复制次数。
批量处理与轮询模式:降低中断频率,提升处理效率。 - DPDK能否提升吞吐量?
可以显著提升, dpdk的核心优势在于高效的数据包处理,支持多队列、负载均衡和流水线化处理;硬件加速:利用网卡的RSS接收端缩放、TCP分段卸载等功能。
3. dpdk环境配置
多队列网卡/ hugepage/ 虚拟机网络模式
-
多队列网卡配置
//先查看虚拟机网卡号 : ensthxx $ ifconfig $ cat /proc/interrupts | grep ensth【xx】 //虚拟机不支持多队列网卡时 配置文件添加这两句:ethernet1.virtualDev = "vmxnet3" ethernet1.wakeOnPcktRcv ="TRUE"
不支持多队列网卡情况:DPKD环境配置
-
hugepage 巨页配置
内存分配里一般4k为一页,dpdk里巨页提供了两种模式:2M或1G为一页,虚拟机因为内存有限选择2M即可,物理机可选择1G.
//查看/etc/default/grub文件配置 $ cat /etc/default/grub //在GRUB_CMDLINE_LINUX=""添加配置内容 配置设置内容: GRUB_CMDLINE_LINUX="default_hugepagesz = 16 hugepagesz = 2M hugepages = 1024" $ vim /etc/default/grub (shift+:wq 保存退出) //更新grub文件,后再重启虚拟机ok $ sudo update grub
-
虚拟机的网络模式
(1)桥接模式:跟物理网卡平级的,假如原始机ip:192.168.0.123,则虚拟机跟原始机同一个网段,都为192开头。
(2)NAT:在win主机上开辟了一个子网(虚拟网卡),相当于win作为网关。
(3)Host-only:跟NAT很像,但选择这个模式后虚拟机只能跟主机通信,不能跟外界通信。
dpdk源码及环境配置
下载完 dpdk源码19.08.2 ,进入dpdk文件目录
// 0、解压dpdk-19.08.2.tar.xz文件
$ tar Jxvf dpdk-19.08.2.tar.xz
// 1、进入dpdk-setup.sh目录
$ ./usertools/dpdk-setup.sh
+ 选择[39]x86_64-native-linux-gcc(native代表本机编译本机运行),注意要改dpdk源码选择linux,不改的话选择linuxapp(只需编译一次)
// 2、环境变量设置 (sudo su ==> root权限下操作)
$ export RTE_SDK=/home/weifeng/share/dpdk-stable-19.08.2
$ export RTE_TARGET=x86_64-native-linux-gcc
+ 设置环境变量时 RTE_SDK的路径它应该指向DPDK安装目录的绝对路径
+ 导出`RTE_SDK`和`RTE_TARGET两个环境变量。(每次都需要重新配置)`
// 3、设置linux环境(root权限)
+ 其中43、44、45、46、47、49(每次都要重新配置)
依次执行到60,至此dpdk接管了物理网卡
43(加载IGB UIO module,是一种drive,dpdk接管网卡的方式)
44(加载VFIO module,是一种driver,dpdk接管网卡的方式)
45(加载KNI module,是内核网络接口,将数据写回内核协议栈)
46(设置巨页,可以不需要频繁页交换),可输入512
47(设置巨页),可输入512
49(执行之前需要eth0 down掉,执行sudo ifconfig eth0 down)pci地址=对应eth0的(如0000:03:00.0)
60(退出)
IP和MAC地址映射配置
BUG整理
4. 用户态协议栈ustack实现
用户态协议栈运行流程
1. rte_eal_init() 初始化DPDK环境抽象层
2. rte_pktmbuf_pool_create() 创建一个创建内存池mbuf_pool
3. rte_eth_dev_configure() 配置以太网设备,包括配置队列的个数、接口的配置信息
4. rte_eth_rx_queue_setup() 分配并设置以太网设备的接收队列
5. rte_eth_tx_queue_setup() 分配并设置以太网设备的发送队列
6. rte_eth_dev_start() 启动以太网设备
7. rte_eth_rx_burst() 从以太网接口接收数据包
代码实现:ustack.c
#include <rte_eal.h> // DPDK环境抽象层头文件
#include <rte_ethdev.h> // DPDK以太网设备操作头文件
#include <arpa/inet.h> // 网络地址转换相关函数
int global_portid = 0; // 全局变量,指定使用的网卡端口号(默认0号端口)
#define NUM_MBUFS 4096 // 内存池中mbuf(内存缓冲区)的数量
#define BURST_SIZE 128 // 单次收发数据包的最大突发数量
// 默认网卡配置结构体
static const struct rte_eth_conf port_conf_default = {
.rxmode = {
.max_rx_pkt_len = RTE_ETHER_MAX_LEN // 设置最大接收包长为标准以太网帧长度(1518字节)
}
};
/* 初始化指定网卡端口 */
static int ustack_init_port(struct rte_mempool *mbuf_pool) {
// 获取系统可用网卡数量
uint16_t nb_sys_ports = rte_eth_dev_count_avail();
if (nb_sys_ports == 0) {
rte_exit(EXIT_FAILURE, "No Supported eth found\n");
}
// 配置网卡参数(1个RX队列,0个TX队列)
const int num_rx_queues = 1;
const int num_tx_queues = 0;
if (rte_eth_dev_configure(global_portid, num_rx_queues,
num_tx_queues, &port_conf_default) < 0) {
rte_exit(EXIT_FAILURE, "Port configuration failed\n");
}
// 设置接收队列(队列号0,队列长度128,绑定到当前socket的内存池)
if (rte_eth_rx_queue_setup(global_portid, 0, 128,
rte_eth_dev_socket_id(global_portid),
NULL, mbuf_pool) < 0) {
rte_exit(EXIT_FAILURE, "Could not setup RX queue\n");
}
// 启动网卡设备
if (rte_eth_dev_start(global_portid) < 0) {
rte_exit(EXIT_FAILURE, "Could not start port\n");
}
return 0;
}
int main(int argc, char *argv[]) {
// 初始化DPDK环境抽象层(EAL)
if (rte_eal_init(argc, argv) < 0) {
rte_exit(EXIT_FAILURE, "Error with EAL init\n");
}
// 创建mbuf内存池(名称,元素数量,缓存大小,私有数据大小,每个缓冲区大小,socket ID)
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
"mbuf pool", NUM_MBUFS,
0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
rte_socket_id());
if (mbuf_pool == NULL) {
rte_exit(EXIT_FAILURE, "Could not create mbuf pool\n");
}
// 初始化网卡端口
ustack_init_port(mbuf_pool);
// 主数据包处理循环
while (1) {
struct rte_mbuf *mbufs[BURST_SIZE] = {0}; // 接收缓冲区数组
// 从指定端口和队列批量接收数据包(返回实际接收数量)
uint16_t num_recvd = rte_eth_rx_burst(
global_portid, 0, mbufs, BURST_SIZE);
if (num_recvd > BURST_SIZE) { // DPDK保证不会超过BURST_SIZE,此处为冗余检查
rte_exit(EXIT_FAILURE, "Error receiving from eth\n");
}
// 处理每个接收到的数据包
for (int i = 0; i < num_recvd; i++) {
// 获取以太网头部(rte_pktmbuf_mtod将mbuf指针转换为数据指针)
struct rte_ether_hdr *ethhdr = rte_pktmbuf_mtod(
mbufs[i], struct rte_ether_hdr *);
// 只处理IPv4数据包(注意转换字节序)
if (ethhdr->ether_type != rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) {
continue; // 跳过非IPv4包
}
// 获取IP头部(以太网头之后的位置)
struct rte_ipv4_hdr *iphdr = rte_pktmbuf_mtod_offset(
mbufs[i], struct rte_ipv4_hdr *,
sizeof(struct rte_ether_hdr));
// 处理UDP协议数据包
if (iphdr->next_proto_id == IPPROTO_UDP) {
// 获取UDP头部(IP头之后的位置)
struct rte_udp_hdr *udphdr = (struct rte_udp_hdr *)(iphdr + 1);
// 打印UDP载荷(UDP头之后的部分)
printf("UDP Payload: %s\n", (char *)(udphdr + 1));
}
rte_pktmbuf_free(mbufs[i]); // 释放mbuf(实际代码可能需要延迟释放)
}
}
// 注意:DPDK应用通常不会执行到这里,需要手动终止
printf("hello dpdk\n");
return 0;
}
编译指令
$ mkdir dpdk \ mkdir ustack
$ cp ../dpdk-stable-19.08.2/examples/helloworld/Makefile ./
$ vim Makefile (修改生成文件名ustack + shift+: wq退出vim编辑)
$ make
二、基于dpdk实现udp/tcp数据收发
1. udp数据接收流程分析
2. udp收发源码
#include <stdio.h>
#include <rte_eal.h> // DPDK环境抽象层头文件
#include <rte_ethdev.h> // DPDK以太网设备操作头文件
#include <arpa/inet.h> // 网络地址转换相关函数
int global_portid = 0; // 全局变量,指定使用的网卡端口号(默认0号端口)
#define NUM_MBUFS 4096 // 内存池中mbuf(内存缓冲区)的数量
#define BURST_SIZE 128 // 单次收发数据包的最大突发数量
#define ENABLE_SEND 1
#if ENABLE_SEND
uint8_t global_smac[RTE_ETHER_ADDR_LEN];
uint8_t global_dmac[RTE_ETHER_ADDR_LEN];
uint32_t global_sip;
uint32_t global_dip;
uint16_t global_sport;
uint16_t global_dport;
#endif
// 默认网卡配置结构体
static const struct rte_eth_conf port_conf_default = {
.rxmode = {
.max_rx_pkt_len = RTE_ETHER_MAX_LEN // 设置最大接收包长为标准以太网帧长度(1518字节)
}
};
/* 初始化指定网卡端口 */
static int ustack_init_port(struct rte_mempool *mbuf_pool) {
// 获取系统可用网卡数量
uint16_t nb_sys_ports = rte_eth_dev_count_avail();
if (nb_sys_ports == 0) {
rte_exit(EXIT_FAILURE, "No Supported eth found\n");
}
struct rte_eth_dev_info dev_info;
rte_eth_dev_info_get(global_portid, &dev_info);
// 配置网卡参数(1个RX队列,0个TX队列)
const int num_rx_queues = 1;
#if ENABLE_SEND
const int num_tx_queues = 1;
#else
const int num_tx_queues = 0;
#endif
if (rte_eth_dev_configure(global_portid, num_rx_queues,
num_tx_queues, &port_conf_default) < 0) {
rte_exit(EXIT_FAILURE, "Port configuration failed\n");
}
// 设置接收队列(队列号0,队列长度128,绑定到当前socket的内存池)
if (rte_eth_rx_queue_setup(global_portid, 0, 128,rte_eth_dev_socket_id(global_portid),NULL, mbuf_pool) < 0) {
rte_exit(EXIT_FAILURE, "Could not setup RX queue\n");
}
#if ENABLE_SEND
struct rte_eth_txconf txq_conf = dev_info.default_txconf;
txq_conf.offloads = port_conf_default.rxmode.offloads;//
if(rte_eth_tx_queue_setup(global_portid, 0, 512, rte_eth_dev_socket_id(global_portid), &txq_conf) < 0){
rte_exit(EXIT_FAILURE, "Could not setup TX queue\n");
};
#endif
// 启动网卡设备
if (rte_eth_dev_start(global_portid) < 0) {
rte_exit(EXIT_FAILURE, "Could not start port\n");
}
return 0;
}
//打包函数
static int ustack_encode_udp_pkt(uint8_t *msg, uint8_t *data, uint16_t total_len){
//1 ether header 以太网头
struct rte_ether_hdr *eth = (struct rte_ether_hdr *)msg;//强行转换,把内存的头部强行转换成以太网头
//目的地址,源地址,协议类型
rte_memcpy(eth->d_addr.addr_bytes, global_dmac, RTE_ETHER_ADDR_LEN);
rte_memcpy(eth->s_addr.addr_bytes, global_smac, RTE_ETHER_ADDR_LEN);
eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
//IP头
struct rte_ipv4_hdr *ip = (struct rte_ipv4_hdr *)(eth + 1);//msg + sizeof(struct rte_ether_hdr);,以太网头后面接着IP头
ip->version_ihl = 0x45;//4位版本号
ip->type_of_service = 0;
ip->total_length = htons(total_len - sizeof(struct rte_ether_hdr));// 总长度
ip->packet_id = 0;//16位标识符
ip->fragment_offset = 0;
ip->time_to_live = 64 ;//数据包每经过一个网关减一,用来描述一个包的生命时长的。
ip->next_proto_id = IPPROTO_UDP;
ip->src_addr = global_sip;
ip->dst_addr = global_dip;
//IP头校验,先清零
ip->hdr_checksum = 0;
ip->hdr_checksum = rte_ipv4_cksum(ip);
//UDP头
struct rte_udp_hdr *udp = (struct rte_udp_hdr *)(ip + 1);
udp->src_port = global_sport;
udp->dst_port = global_dport;
uint16_t udplen = total_len - sizeof(struct rte_ether_hdr) - sizeof(struct rte_ipv4_hdr);
udp->dgram_len = htons(udplen);
rte_memcpy((uint8_t*)(udp + 1), data, udplen);//数据放在后面
udp->dgram_cksum = 0;
udp->dgram_cksum = rte_ipv4_udptcp_cksum(ip, udp);
return 0;
}
int main(int argc, char *argv[]) {
// 初始化DPDK环境抽象层(EAL)
if (rte_eal_init(argc, argv) < 0) {
rte_exit(EXIT_FAILURE, "Error with EAL init\n");
}
// 创建mbuf内存池(名称,元素数量,缓存大小,私有数据大小,每个缓冲区大小,socket ID)
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
"mbuf pool", NUM_MBUFS,
0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
rte_socket_id());
if (mbuf_pool == NULL) {
rte_exit(EXIT_FAILURE, "Could not create mbuf pool\n");
}
// 初始化网卡端口
ustack_init_port(mbuf_pool);
// 主数据包处理循环
while (1) {
struct rte_mbuf *mbufs[BURST_SIZE] = {0}; // 接收缓冲区数组
// 从指定端口和队列批量接收数据包(返回实际接收数量)
uint16_t num_recvd = rte_eth_rx_burst(
global_portid, 0, mbufs, BURST_SIZE);
if (num_recvd > BURST_SIZE) { // DPDK保证不会超过BURST_SIZE,此处为冗余检查
rte_exit(EXIT_FAILURE, "Error receiving from eth\n");
}
// 处理每个接收到的数据包
for (int i = 0; i < num_recvd; i++) {
// 获取以太网头部(rte_pktmbuf_mtod将mbuf指针转换为数据指针)
struct rte_ether_hdr *ethhdr = rte_pktmbuf_mtod(
mbufs[i], struct rte_ether_hdr *);
// 只处理IPv4数据包(注意转换字节序)
if (ethhdr->ether_type != rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) {
continue; // 跳过非IPv4包
}
// 获取IP头部(以太网头之后的位置)
struct rte_ipv4_hdr *iphdr = rte_pktmbuf_mtod_offset(
mbufs[i], struct rte_ipv4_hdr *,
sizeof(struct rte_ether_hdr));
// 处理UDP协议数据包
if (iphdr->next_proto_id == IPPROTO_UDP) {
// 获取UDP头部(IP头之后的位置)
struct rte_udp_hdr *udphdr = (struct rte_udp_hdr *)(iphdr + 1);
#if ENABLE_SEND
rte_memcpy(global_smac, ethhdr->d_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(global_dmac, ethhdr->s_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(&global_sip, &iphdr->dst_addr, sizeof(uint32_t));
rte_memcpy(&global_dip, &iphdr->src_addr, sizeof(uint32_t));
rte_memcpy(&global_sport, &udphdr->dst_port, sizeof(uint32_t));
rte_memcpy(&global_dport, &udphdr->src_port, sizeof(uint32_t));
struct in_addr addr;
addr.s_addr = iphdr->src_addr;
printf("sip %s:%d -->", inet_ntoa(addr),ntohs(udphdr->src_port));
addr.s_addr = iphdr->dst_addr;
printf("dip %s:%d -->", inet_ntoa(addr),ntohs(udphdr->dst_port));
struct rte_mbuf *mbuf = rte_pktmbuf_alloc(mbuf_pool);
if(!mbuf){
rte_exit(EXIT_FAILURE, "rte_pktmbuf_alloc\n");
}
uint16_t length = ntohs(udphdr->dgram_len);
uint16_t total_len = length + sizeof(struct rte_ipv4_hdr) + sizeof(struct rte_ether_hdr);
mbuf->pkt_len = total_len;
mbuf->data_len = total_len;
uint8_t *msg = rte_pktmbuf_mtod(mbuf, uint8_t *);
ustack_encode_udp_pkt(msg, (uint8_t *)(udphdr + 1), total_len);
rte_eth_tx_burst(global_portid, 0, &mbuf, 1);
#endif
// 打印UDP载荷(UDP头之后的部分)
printf("UDP Payload: %s\n", (char *)(udphdr + 1));
}
rte_pktmbuf_free(mbufs[i]); // 释放mbuf(实际代码可能需要延迟释放)
}
}
// 注意:DPDK应用通常不会执行到这里,需要手动终止
printf("hello dpdk\n");
return 0;
}
重点ustack_encode_udp_pkt函数:编写以太网头,ip头,udp头
搭配IP数据报格式理解
wireshark抓包工具
3. tcp发包源码
tcp.c
#include <stdio.h>
#include <rte_eal.h>
#include <rte_ethdev.h>
#include <arpa/inet.h>
int global_portid = 0;
#define NUM_MBUFS 4096
#define BURST_SIZE 128
#define ENABLE_SEND 1
#if ENABLE_SEND
uint8_t global_smac[RTE_ETHER_ADDR_LEN];
uint8_t global_dmac[RTE_ETHER_ADDR_LEN];
uint32_t global_sip;
uint32_t global_dip;
uint16_t global_sport;
uint16_t global_dport;
#endif
static const struct rte_eth_conf port_conf_default = {
.rxmode = { .max_rx_pkt_len = RTE_ETHER_MAX_LEN }
};
static int ustack_init_port(struct rte_mempool *mbuf_pool) {
//number
uint16_t nb_sys_ports = rte_eth_dev_count_avail();
if (nb_sys_ports == 0) {
rte_exit(EXIT_FAILURE, "No Supported eth found\n");
}
struct rte_eth_dev_info dev_info;
rte_eth_dev_info_get(global_portid, &dev_info);
// eth0,
const int num_rx_queues = 1;
#if ENABLE_SEND
const int num_tx_queues = 1;
#else
const int num_tx_queues = 0;
#endif
rte_eth_dev_configure(global_portid, num_rx_queues, num_tx_queues, &port_conf_default);
if (rte_eth_rx_queue_setup(global_portid, 0, 128, rte_eth_dev_socket_id(global_portid), NULL, mbuf_pool) < 0) {
rte_exit(EXIT_FAILURE, "Could not setup RX queue\n");
}
#if ENABLE_SEND
struct rte_eth_txconf txq_conf = dev_info.default_txconf;
txq_conf.offloads = port_conf_default.rxmode.offloads;
if (rte_eth_tx_queue_setup(global_portid, 0, 512, rte_eth_dev_socket_id(global_portid), &txq_conf) < 0) {
rte_exit(EXIT_FAILURE, "Could not setup TX queue\n");
}
#endif
if (rte_eth_dev_start(global_portid) < 0) {
rte_exit(EXIT_FAILURE, "Could not start\n");
}
return 0;
}
// msg
static int ustack_encode_udp_pkt(uint8_t *msg, uint8_t *data, uint16_t total_len) {
//1 ether header
struct rte_ether_hdr *eth = (struct rte_ether_hdr *)msg;
rte_memcpy(eth->d_addr.addr_bytes, global_dmac, RTE_ETHER_ADDR_LEN);
rte_memcpy(eth->s_addr.addr_bytes, global_smac, RTE_ETHER_ADDR_LEN);
eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
//1 ip header
struct rte_ipv4_hdr *ip = (struct rte_ipv4_hdr*)(eth + 1); //msg + sizeof(struct rte_ether_hdr);
ip->version_ihl = 0x45;
ip->type_of_service = 0;
ip->total_length = htons(total_len - sizeof(struct rte_ether_hdr));
ip->packet_id = 0;
ip->fragment_offset = 0;
ip->time_to_live = 64;
ip->next_proto_id = IPPROTO_UDP;
ip->src_addr = global_sip;
ip->dst_addr = global_dip;
ip->hdr_checksum = 0;
ip->hdr_checksum = rte_ipv4_cksum(ip);
//1 udp header
struct rte_udp_hdr *udp = (struct rte_udp_hdr *)(ip + 1);
udp->src_port = global_sport;
udp->dst_port = global_dport;
uint16_t udplen = total_len - sizeof(struct rte_ether_hdr) - sizeof(struct rte_ipv4_hdr);
udp->dgram_len = htons(udplen);
rte_memcpy((uint8_t*)(udp+1), data, udplen);
udp->dgram_cksum = 0;
udp->dgram_cksum = rte_ipv4_udptcp_cksum(ip, udp);
return 0;
}
// msg
static int ustack_encode_tcp_pkt(uint8_t *msg, uint16_t total_len) {
//1 ether header
struct rte_ether_hdr *eth = (struct rte_ether_hdr *)msg;
rte_memcpy(eth->d_addr.addr_bytes, global_dmac, RTE_ETHER_ADDR_LEN);
rte_memcpy(eth->s_addr.addr_bytes, global_smac, RTE_ETHER_ADDR_LEN);
eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
//1 ip header
struct rte_ipv4_hdr *ip = (struct rte_ipv4_hdr*)(eth + 1); //msg + sizeof(struct rte_ether_hdr);
ip->version_ihl = 0x45;
ip->type_of_service = 0;
ip->total_length = htons(total_len - sizeof(struct rte_ether_hdr));
ip->packet_id = 0;
ip->fragment_offset = 0;
ip->time_to_live = 64;
ip->next_proto_id = IPPROTO_TCP;
ip->src_addr = global_sip;
ip->dst_addr = global_dip;
ip->hdr_checksum = 0;
ip->hdr_checksum = rte_ipv4_cksum(ip);
//1 tcp header
struct rte_tcp_hdr *tcp = (struct rte_tcp_hdr *)(ip + 1);
tcp->src_port = global_sport;
tcp->dst_port = global_dport;
tcp->sent_seq = htonl(12345);//初始值(第一个包)是随机值
tcp->recv_ack = 0x0;
tcp->data_off = 0x50;
tcp->tcp_flags = 0x1 << 1;//左移一位,发了一个同步头
tcp->rx_win = htons(4096); // rmem,告诉对方我还能接受的总长度是多少
tcp->cksum = 0;
tcp->cksum = rte_ipv4_udptcp_cksum(ip, tcp);
return 0;
}
int main(int argc, char *argv[]) {
if (rte_eal_init(argc, argv) < 0) {
rte_exit(EXIT_FAILURE, "Error with EAL init\n");
}
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create("mbuf pool", NUM_MBUFS, 0, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
if (mbuf_pool == NULL) {
rte_exit(EXIT_FAILURE, "Could not create mbuf pool\n");
}
ustack_init_port(mbuf_pool);
while (1) {
struct rte_mbuf *mbufs[BURST_SIZE] = {0};
uint16_t num_recvd = rte_eth_rx_burst(global_portid, 0, mbufs, BURST_SIZE);
if (num_recvd > BURST_SIZE) {
rte_exit(EXIT_FAILURE, "Error receiving from eth\n");
}
int i = 0;
for (i = 0;i < num_recvd;i ++) {
struct rte_ether_hdr *ethhdr = rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr *);
if (ethhdr->ether_type != rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) {
continue;
}
struct rte_ipv4_hdr *iphdr = rte_pktmbuf_mtod_offset(mbufs[i], struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr));
if (iphdr->next_proto_id == IPPROTO_UDP) {
struct rte_udp_hdr *udphdr = (struct rte_udp_hdr *)(iphdr + 1);
#if ENABLE_SEND
rte_memcpy(global_smac, ethhdr->d_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(global_dmac, ethhdr->s_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(&global_sip, &iphdr->dst_addr, sizeof(uint32_t));
rte_memcpy(&global_dip, &iphdr->src_addr, sizeof(uint32_t));
rte_memcpy(&global_sport, &udphdr->dst_port, sizeof(uint16_t));
rte_memcpy(&global_dport, &udphdr->src_port, sizeof(uint16_t));
struct in_addr addr;
addr.s_addr = iphdr->src_addr;
printf("sip %s:%d --> ", inet_ntoa(addr), ntohs(udphdr->src_port));
addr.s_addr = iphdr->dst_addr;
printf("dip %s:%d --> ", inet_ntoa(addr), ntohs(udphdr->dst_port));
uint16_t length = ntohs(udphdr->dgram_len);
uint16_t total_len = length + sizeof(struct rte_ipv4_hdr) + sizeof(struct rte_ether_hdr);
struct rte_mbuf *mbuf = rte_pktmbuf_alloc(mbuf_pool);
if (!mbuf) {
rte_exit(EXIT_FAILURE, "Error rte_pktmbuf_alloc\n");
}
mbuf->pkt_len = total_len;
mbuf->data_len = total_len;
uint8_t *msg = rte_pktmbuf_mtod(mbuf, uint8_t *);
ustack_encode_udp_pkt(msg, (uint8_t*)(udphdr+1), total_len);
rte_eth_tx_burst(global_portid, 0, &mbuf, 1);
#endif
printf("udp : %s\n", (char*)(udphdr+1));
} else if (iphdr->next_proto_id == IPPROTO_TCP) {
struct rte_tcp_hdr *tcphdr = (struct rte_tcp_hdr *)(iphdr + 1);
struct in_addr addr;
addr.s_addr = iphdr->src_addr;
printf("tcp pkt sip %s:%d --> ", inet_ntoa(addr), ntohs(tcphdr->src_port));
addr.s_addr = iphdr->dst_addr;
printf("dip %s:%d \n", inet_ntoa(addr), ntohs(tcphdr->dst_port));
rte_memcpy(global_smac, ethhdr->d_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(global_dmac, ethhdr->s_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(&global_sip, &iphdr->dst_addr, sizeof(uint32_t));
rte_memcpy(&global_dip, &iphdr->src_addr, sizeof(uint32_t));
rte_memcpy(&global_sport, &tcphdr->dst_port, sizeof(uint16_t));
rte_memcpy(&global_dport, &tcphdr->src_port, sizeof(uint16_t));
uint16_t total_len = sizeof(struct rte_tcp_hdr) + sizeof(struct rte_ipv4_hdr) + sizeof(struct rte_ether_hdr);
struct rte_mbuf *mbuf = rte_pktmbuf_alloc(mbuf_pool);
if (!mbuf) {
rte_exit(EXIT_FAILURE, "Error rte_pktmbuf_alloc\n");
}
mbuf->pkt_len = total_len;
mbuf->data_len = total_len;
uint8_t *msg = rte_pktmbuf_mtod(mbuf, uint8_t *);
ustack_encode_tcp_pkt(msg, total_len);
rte_eth_tx_burst(global_portid, 0, &mbuf, 1);
}
}
}
printf("hello dpdk\n");
}
重点在增加了ustack_encode_tcp_pkt函数:
tcp协议头编写配合tcp协议头理解
tcp协议头
为什么udp有长度tcp不需要长度?
UDP面向数据报,每个UDP数据报都是相互独立的,因此需要自己维护一个数据报的长度
TCP面向字节流,发送方和接收方通过seq和ack动态管理数据流的传输,无需预先划分长度
4. 应用层Posix api实现
知识点回顾: 网络编程(三):Posix API与网络协议栈
- TCP实现三次握手: 源码ustack.c
- tcp/ip定时器与滑动窗口: 源码tcp.c
三、epoll设计实现
1. epoll解决tcp协议栈的并发功能
-
源码主要逻辑结构
1.rx_burst/tx_brust 从网卡队列里获取数据/传送数据 2.pkt_process 循环做的是网络数据解析,包括整个tcp协议解析、tcp/ip协议栈 3.sendbuf和rcvbuf 是只对应一个fd的,所以在两个客户端同时连进来时会发生coredump
-
选择epoll的原因
2. 红黑树数据结构
3. 协议栈epoll接口实现
源码详情参考tcp_concurrency.c
epoll_create/ epoll_ctl/ epoll_wait 接口实现
// eventpoll --> tcp_table->ep;
int nepoll_create(int size) {
if (size <= 0) return -1;
// epfd --> struct eventpoll
int epfd = get_fd_frombitmap(); //tcp, udp
struct eventpoll *ep = (struct eventpoll*)rte_malloc("eventpoll", sizeof(struct eventpoll), 0);
if (!ep) {
set_fd_frombitmap(epfd);
return -1;
}
struct ng_tcp_table *table = tcpInstance();
table->ep = ep;
ep->fd = epfd;
ep->rbcnt = 0;
RB_INIT(&ep->rbr);
LIST_INIT(&ep->rdlist);
if (pthread_mutex_init(&ep->mtx, NULL)) {
free(ep);
set_fd_frombitmap(epfd);
return -2;
}
if (pthread_mutex_init(&ep->cdmtx, NULL)) {
pthread_mutex_destroy(&ep->mtx);
free(ep);
set_fd_frombitmap(epfd);
return -2;
}
if (pthread_cond_init(&ep->cond, NULL)) {
pthread_mutex_destroy(&ep->cdmtx);
pthread_mutex_destroy(&ep->mtx);
free(ep);
set_fd_frombitmap(epfd);
return -2;
}
if (pthread_spin_init(&ep->lock, PTHREAD_PROCESS_SHARED)) {
pthread_cond_destroy(&ep->cond);
pthread_mutex_destroy(&ep->cdmtx);
pthread_mutex_destroy(&ep->mtx);
free(ep);
set_fd_frombitmap(epfd);
return -2;
}
return epfd;
}
int nepoll_ctl(int epfd, int op, int sockid, struct epoll_event *event) {
struct eventpoll *ep = (struct eventpoll*)get_hostinfo_fromfd(epfd);
if (!ep || (!event && op != EPOLL_CTL_DEL)) {
errno = -EINVAL;
return -1;
}
if (op == EPOLL_CTL_ADD) {
pthread_mutex_lock(&ep->mtx);
struct epitem tmp;
tmp.sockfd = sockid;
struct epitem *epi = RB_FIND(_epoll_rb_socket, &ep->rbr, &tmp);
if (epi) {
pthread_mutex_unlock(&ep->mtx);
return -1;
}
epi = (struct epitem*)rte_malloc("epitem", sizeof(struct epitem), 0);
if (!epi) {
pthread_mutex_unlock(&ep->mtx);
rte_errno = -ENOMEM;
return -1;
}
epi->sockfd = sockid;
memcpy(&epi->event, event, sizeof(struct epoll_event));
epi = RB_INSERT(_epoll_rb_socket, &ep->rbr, epi);
ep->rbcnt ++;
pthread_mutex_unlock(&ep->mtx);
} else if (op == EPOLL_CTL_DEL) {
pthread_mutex_lock(&ep->mtx);
struct epitem tmp;
tmp.sockfd = sockid;
struct epitem *epi = RB_FIND(_epoll_rb_socket, &ep->rbr, &tmp);
if (!epi) {
pthread_mutex_unlock(&ep->mtx);
return -1;
}
epi = RB_REMOVE(_epoll_rb_socket, &ep->rbr, epi);
if (!epi) {
pthread_mutex_unlock(&ep->mtx);
return -1;
}
ep->rbcnt --;
free(epi);
pthread_mutex_unlock(&ep->mtx);
} else if (op == EPOLL_CTL_MOD) {
struct epitem tmp;
tmp.sockfd = sockid;
struct epitem *epi = RB_FIND(_epoll_rb_socket, &ep->rbr, &tmp);
if (epi) {
epi->event.events = event->events;
epi->event.events |= EPOLLERR | EPOLLHUP;
} else {
rte_errno = -ENOENT;
return -1;
}
}
return 0;
}
int nepoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout) {
struct eventpoll *ep = (struct eventpoll*)get_hostinfo_fromfd(epfd);;
if (!ep || !events || maxevents <= 0) {
rte_errno = -EINVAL;
return -1;
}
if (pthread_mutex_lock(&ep->cdmtx)) {
if (rte_errno == EDEADLK) {
printf("epoll lock blocked\n");
}
}
while (ep->rdnum == 0 && timeout != 0) {
ep->waiting = 1;
if (timeout > 0) {
struct timespec deadline;
clock_gettime(CLOCK_REALTIME, &deadline);
if (timeout >= 1000) {
int sec;
sec = timeout / 1000;
deadline.tv_sec += sec;
timeout -= sec * 1000;
}
deadline.tv_nsec += timeout * 1000000;
if (deadline.tv_nsec >= 1000000000) {
deadline.tv_sec++;
deadline.tv_nsec -= 1000000000;
}
int ret = pthread_cond_timedwait(&ep->cond, &ep->cdmtx, &deadline);
if (ret && ret != ETIMEDOUT) {
printf("pthread_cond_timewait\n");
pthread_mutex_unlock(&ep->cdmtx);
return -1;
}
timeout = 0;
} else if (timeout < 0) {
int ret = pthread_cond_wait(&ep->cond, &ep->cdmtx);
if (ret) {
printf("pthread_cond_wait\n");
pthread_mutex_unlock(&ep->cdmtx);
return -1;
}
}
ep->waiting = 0;
}
4. 协议栈与epoll模块通信过程
1. 数据接收与状态更新
当网络设备接收到数据后,通过中断机制唤醒内核,数据被传递到协议栈相应的层(如 TCP 层),进而更新对应 socket 的状态(如有数据可读)
2. poll方法检测状态
每个 socket 都有一个 poll 方法,当协议栈更新状态后,这个方法会被调用,检查是否有事件满足 epoll 注册时设置的条件(例如 EPOLLIN 表示可读)
3. 就绪事件通知
如果条件满足,socket 会通过内部机制(如调用 __wake_up() 或 wake_up_interruptible())将等待在 epoll_wait 上的进程唤醒,并将该 socket 对应的事件添加到 epoll 的就绪列表中
4. 用户态进程处理
最终,当用户进程调用 epoll_wait 时,就会从就绪列表中获取这些已经发生的事件,并据此进行相应的处理。
5. 什么条件触发事件从整集->就绪状态?
- 数据到达或可读取 :当 socket 接收到数据、对端关闭连接或发生其他与读相关的事件时,如果注册了 EPOLLIN,则 poll 方法检测到状态变化,文件描述符变为就绪。
- 缓冲区可写 :当发送缓冲区有足够空间可供写入数据时,且应用程序注册了 EPOLLOUT,这时对应文件描述符会被标记为可写,就绪通知就会触发。
- 异常或错误情况 :当发生错误(如 EPOLLERR)、挂起(EPOLLHUP)或远端关闭(例如 EPOLLRDHUP)等异常状态时,即使数据本身没有变化,也会导致对应的文件描述符被认为是就绪的。
- 触发模式差异 :水平触发(LT):只要条件持续满足,epoll_wait 每次都会返回这个文件描述符; 边沿触发(ET):只有状态从不满足到满足时才触发通知,后续即使条件依旧,也不会重复通 知,直到状态发生变化后再次满足条件。
优秀笔记:
1. 手把手设计实现epoll
2. TCP通信连接断开知识点总结
3. 基于DPDK实现UDP的数据接收
参考学习: https://github.com/0voice