For packetizing the mp3 data I used the following scheme:
- the server reads data from the file and fills up two buffers: a frame header buffer and a frame data buffer;
- when either there are enough frames (i.e., headers in the header buffer) to fill up 200 ms or the packet size approaches one MTU (1300 bytes), the server builds and sends an RTP packet;
- the RTP packet contains an RTP header, a media specific header, all the frame headers one after the other and all the frame data;
- if the packet was sent in less than 200 ms (because the packet would otherwise become to large), the interval until the next packet would be sent is adjusted correspondingly;
- the client waits until it receives enough packets to build an integer number of frames, then writes them to disk;
- the client also uses two buffers, a header buffer and a data buffer - whenever a packet loss is detected, the client fills up the data and header buffers correspondingly, before writing to disk.
The media specific header contains the following elements:
- the frame number, starting with zero, of the first frame in the packet, used to count lost frames;
- the number of frames in the packet, used to delimit the frame headers zone from the packet;
- the offset of the first frame in the next packet, used to compute the length of the last frame in the current packet – used only when a packet loss is detected.