Difference between revisions of "Compression"

From Terranigma Wiki
Jump to navigation Jump to search
(described compression in detail)
(described exact compression of game data)
Line 66: Line 66:
In the table, each '''cc''' stands for a control code bit. The second data byte is divided into 5+3 bits for mode '''01'''.
In the table, each '''cc''' stands for a control code bit. The second data byte is divided into 5+3 bits for mode '''01'''.


== Compression used for the Game ==
In general there is more than one way to compress a chunk of data with the given format.
But the game's developers used a very specific ''greedy'' algorithm to compress all the data in the game.
It can be reproduced with the following algorithm:
# Append 0x100 zero bytes to the (uncompressed) input data. These don't count into '''input_size''' and are only there for padding!
# Output the header, using 00, '''input_size''', input[0]
# '''input_pos''' = 1
# While there is still input to process, repeat the following:
## Find the longest match of data for the current input_pos:
##* Look in the range '''max(0,input_pos-0x2000)..(input_pos-1)''' for a match of up to '''0x100''' bytes.
##* Call the position at which the match is found '''best_offset''' and the length of the match '''best_length'''.
##* If there are multiple matches of the same length, use the one with the highest offset.
## if '''2 <= best_length <= 5''' and '''best_offset''' is no less than '''inpus_pos-0x100''', then encode the match using mode '''00'''.<br />else if '''best_length <= 2''', then encode the current input byte as literal and set '''best_length = 1'''.<br />else if '''best_length <= 9''', then encode the match using mode '''01'''.<br />else encode the match using mode '''01L'''.
## '''inpus_pos''' += '''best_length'''


== Tool ==
== Tool ==

Revision as of 17:12, 6 March 2016

Terranigma uses a compression for most of the data.

Format

Compressed data always starts with a 4 byte header:

Offset Length Description
0 1 Chunk Counter ??, always 0 (not used in the game)
1 2 Uncompressed data size (LE)
3 1 First byte of uncompressed data

After that follows the compressed data. It consists of data bytes intersparsed with a control code bitstream, both use MSB first (BE) encoding. Data bytes are always aligned at byte boundary. The control code stream is fit in between the data bytes so that a new cc byte is only placed when a bit needs to be used and the previous cc byte is full.

Terranigma Compression Format
cc cc cc cc 8 bit 5 bit 3 bit 8 bit Description
1 - Literal - Literal Byte
0 1 - Offset Length != 0 - Copy Length+2 bytes from next_output_pos-0x2000+offset to the output
0 Length != 0 Copy Length+1 bytes from next_output_pos-0x2000+offset to the output
0 End of Data (usually in this case Offset is also 0)
0 Length Offset - Copy Length+2 bytes from next_output_pos-0x100+offset to the output

In the table, each cc stands for a control code bit. The second data byte is divided into 5+3 bits for mode 01.

Compression used for the Game

In general there is more than one way to compress a chunk of data with the given format. But the game's developers used a very specific greedy algorithm to compress all the data in the game. It can be reproduced with the following algorithm:

  1. Append 0x100 zero bytes to the (uncompressed) input data. These don't count into input_size and are only there for padding!
  2. Output the header, using 00, input_size, input[0]
  3. input_pos = 1
  4. While there is still input to process, repeat the following:
    1. Find the longest match of data for the current input_pos:
      • Look in the range max(0,input_pos-0x2000)..(input_pos-1) for a match of up to 0x100 bytes.
      • Call the position at which the match is found best_offset and the length of the match best_length.
      • If there are multiple matches of the same length, use the one with the highest offset.
    2. if 2 <= best_length <= 5 and best_offset is no less than inpus_pos-0x100, then encode the match using mode 00.
      else if best_length <= 2, then encode the current input byte as literal and set best_length = 1.
      else if best_length <= 9, then encode the match using mode 01.
      else encode the match using mode 01L.
    3. inpus_pos += best_length

Tool

Crediar re'ed the decompression from the code and translated it to C code.

You can download a tool with the source to decompress data from any given offset here.