Posted By
Harry Potter on 2024-10-07 09:52:24
| Re: printtok being updated: ideas?
I'm currently working on PrintTok2 and believe it's doing pretty well, considering the test text is pretty small and poorly compressible. I've been working on a Z-Machine-style 5-bit approach with a lot of enhancements such as tokenization, support for more punctuation marks on letter dictionaries at the cost of one bit on them and several lesser-used letters, an extra bit on dictionary-swapping to indicate whether is just for the current char or several chars and the removal of an extra bit per word compressed. I am also working on a naive literals technique, where literals aren't compressed. I'm using tokenization and a form of BPE there. The tokens are up to 128 one-byte and 128 two-byte tokens, and I can add more two-byte tokens. My version of BPE borrows up to 32 one-byte tokens to act as an offset to the last repeat of two chars. I'm asking if anybody here has any ideas to better this.
|