Converting Binary Data to Digits
Getting the most out of your random data.
Normally when you want a random number you want the number to fall within a range. If you are manually generating the number (archive)
from a binary source you are only worried about having enough bits. Any extra bits are discarded. For the purpose of
generating a stream of random digits a more bit conservative approach is better.
My current favorite is this
method (archive)
by Andreas Krey. Their method was simple and
wonderful. Use the decimal value of the byte (0-255). If it is 000 - 199 use the last 2 digits. If it is 200 - 249 use
the last digit. If it is 250 - 255 discard the byte. On average this will give you 1.7578 digits per byte. That is more
than 1 digit (normally) and it does not introduce any bias.
Rewitten:
< 200 2 digits
< 250 1 digit
>= 250 discard
2*200/256 + 1*50/256 = 1.7578 digits per byte (on average).
Let's do the same thing with 2 bytes instead of 1.
< 60000 4 digits
< 65000 3 digits
< 65500 2 digits
< 65530 1 digit
>= 65530 discard
4*60000/65536 + 3*5000/65536 + 2*500/65536 + 1*30/65536 = 3.9067 digits per 2 bytes or 1.95335
digits per byte (on average).
2 bytes at a time is clearly better than 1. For what it is worth 3 is even better, and 4 is even better than that.
Is there a perfect number of bytes to read?
No. There is no byte count (or bit count) where all of the bits can be used without introducing bias. The perfect
combination would be all ones in binary converting to all nines in decimal. Rick Regan does an excellent job explaining why this can not happen in his
post Nines in Binary (archive). If you
look at his table you will see the pattern. There will always be a zero before the series of ones at the end. All nines
can not convert to all ones. Since there is no perfect number of bits or bytes to read we will have to compromise in some
way.
How many bytes should you read at a time?
If you are limited to 32 bit numbers then 4 bytes is the best you can do. If you can use 64 bit numbers 5 bytes is your
best option. 5 does better than 6, 7 or 8 bytes at a time.
Here is a table showing the number of bytes and the average digits per byte for 32 and 64 bit numbers:
Bytes | Digits/Byte |
1 | 1.7578125 |
2 | 1.9533539 |
3 | 2.1815580 |
4 | 2.2269530 |
5 | 2.3619721 |
6 | 2.2838879 |
7 | 2.2813902 |
8 | 2.3143676 |
If you were to continue that table you would find that 5 bytes outperforms 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 14, 17, 19 and
22 bytes.
If you are not limited to 32 or 64 bit integers you can do better. For pyOTP I chose 49 byte blocks which provide an average of 2.4076 digits per byte. There is a longer table below.
Bit Recovery
It is possible to recover a bit or two from the rejected numbers. The last digit of 28n-1 is always a 5.
If the rejected number ends in 0-3 you can keep the last 2 bits. If the rejected number ends in a 4 or 5
you can keep the last bit only.
Decimal | Binary | Recovered Bits |
xxx0 | xxx1010 | 2 |
xxx1 | xxx1011 | 2 |
xxx2 | xxx1100 | 2 |
xxx3 | xxx1101 | 2 |
xxx4 | xxx1110 | 1 |
xxx5 | xxx1111 | 1 |
Depending on your block size and data stream length this might not be worth doing. In fact this is probably only worth
doing if you are working with 1 byte at a time or less. Keep in mind the recovered bytes will have the same rejection
rate.
Bytes | % Rejection | Bytes Recovered/kB |
0.5 | 37.5 | 160.0 |
1 | 2.34 | 5.0 |
2 | 0.0092 | 0.0098 |
3 | 0.000036 | 0.000025 |
4 | 0.00000014 | 0.000000075 |
5 | 0.00000000055 | 0.00000000023 |
Sample Implementation
Python:
from binascii import hexlify as _hexlify
def chartodigit(char):
seed = int(_hexlify(char), 16)
max = str(256**len(char)-1)
count = len(max)
for i in range(1, count):
div = int(max[:i].ljust(count, '0'))
if seed < div:
return str(seed).zfill(count-i)[-(count-i):]
return ''
Example:
rng = open('/dev/urandom', 'rb')
for i in range(3):
print chartodigit(rng.read(5))
rng.close
# 488501925387
# 780406901368
# 905036030054
Other ways
Byte mod 10. - Never do this! This will add a bias to your output. 0-5 will occur more often than 6-9.
Keep ASCII 48-57 (tr -dc 0-9 </dev/urandom). - This will provide 0.03906 digits per byte.
If byte is <250 use byte mod 10. - This will provide 0.97656 digits per byte.
Split byte into 2 nibbles (0-15) and drop anything >9. - This will provide 1.25000 digits per
byte.
Base64, translate letters to numbers (base64 </dev/urandom | tr 'A-Za-x' '0-90-90-90-90-9' | tr -dc 0-9). Bytes must be
read in multiples of 3 or the last group will be biased. - This will provide 1.27083 digits
per byte.
Bytes to Digits Table
This table contains only the bytes <100,000 that provide more digits per byte than previous bytes.
Bytes | Digits | Digits/Byte | Decimal Prefix | Byte Pattern |
0.5 | 0.6 | 1.2500000000000 | 1 | |
1 | 1.8 | 1.7578125000000 | 25 | |
2 | 3.9 | 1.9533538816500 | 65 | |
3 | 6.5 | 2.1815580125667 | 16 | |
4 | 8.9 | 2.2269530153250 | 42 | |
5 | 11.8 | 2.3619721283400 | 109 | |
10 | 23.8 | 2.3811621381700 | 12 | |
13 | 31.0 | 2.3821499364462 | 20 | |
18 | 42.9 | 2.3824023664556 | 22 | |
25 | 59.6 | 2.3845221559400 | 16 | |
27 | 64.9 | 2.4035557773111 | 105 | |
49 | 118.0 | 2.4076198531347 | 1008 | |
98 | 236.0 | 2.4077330214735 | 101 | 49*2 |
147 | 353.9 | 2.4077708072585 | 102 | 49*3 |
196 | 471.9 | 2.4077891540730 | 103 | 49*4 |
245 | 589.9 | 2.4078001232318 | 104 | 49*5 |
267 | 643 | 2.4082370089060 | 10001 | |
534 | 1286 | 2.4082371983963 | 10003 | 267*2 |
1068 | 2572 | 2.4082372090436 | 10006 | 267*4 |
1335 | 3215 | 2.4082372448467 | 10008 | 267*5 |
1869 | 4501 | 2.4082377718967 | 1001 | 267+1602 |
3471 | 8359 | 2.4082378276556 | 1002 | 267+1602*2 |
5073 | 12217 | 2.4082378310175 | 1003 | 267+1602*3 |
6675 | 16075 | 2.4082378489406 | 1004 | 267+1602*4 |
8277 | 19933 | 2.4082378584684 | 1005 | 267+1602*5 |
9879 | 23791 | 2.4082378647275 | 1006 | 267+1602*6 |
11481 | 27649 | 2.4082378697417 | 1007 | 267+1602*7 |
13083 | 31507 | 2.4082378743368 | 1008 | 267+1602*8 |
14369 | 34604 | 2.4082399175154 | 10001 | 267+14102 |
28471 | 68565 | 2.4082399443059 | 10001 | 267+14102*2 |
42573 | 102526 | 2.4082399511681 | 100009 | 267+14102*3 |
56675 | 136487 | 2.4082399564350 | 100007 | 267+14102*4 |
70777 | 170448 | 2.4082399593609 | 100005 | 267+14102*5 |
84879 | 204409 | 2.4082399614766 | 100003 | 267+14102*6 |
98981 | 238370 | 2.4082399637119 | 100001 | 267+14102*7 |
| 2.4082399653118 | | |
The theoretical perfect digits per byte is log10 256.
Bits to Digits Table
If it is convenient to use bits instead of bytes you can get more bang for your buck. 30 bits does better than the best
even byte 32 and 64 bit numbers. 50 bits does better than that. The first >2.4 digits per byte occurs at 103 bits while
it takes 216 bits (27 bytes) to do the same thing using even bytes.
The best bit and even byte options for 32 and 64 bit numbers:
Bits | Bytes | Digits/Byte |
30 | | 2.3622474618933 |
32 | 4 | 2.2269530153250 |
40 | 5 | 2.3619721283400 |
50 | | 2.3774459001280 |
The bit counts that provide more digits per byte than previous bit counts.
Bits | Bytes | Digits | Digits/Byte | Decimal Prefix | Bit Pattern |
4 | 0.5 | 0.6 | 1.2500000000000 | 1 | |
5 | | 0.9 | 1.5000000000000 | 3 | |
7 | | 1.7 | 1.9642857142857 | 12 | |
10 | | 2.9 | 2.3593750000000 | 102 | |
20 | | 5.9 | 2.3594169614800 | 104 | |
30 | | 8.9 | 2.3622474618933 | 107 | |
50 | | 14.9 | 2.3774459001280 | 11 | |
80 | 10 | 23.8 | 2.3811621381700 | 12 | |
103 | | 31 | 2.4052776017398 | 101 | |
196 | | 59 | 2.4076191729469 | 1004 | |
392 | 49 | 118 | 2.4076198531347 | 1008 | 196*2 |
588 | | 177 | 2.4077693320776 | 101 | 196*3 |
681 | | 205 | 2.4081030890314 | 1003 | 196+485 |
1166 | | 351 | 2.4081850477640 | 1002 | 196+485*2 |
1651 | | 497 | 2.4082189023389 | 1001 | 196+485*3 |
2136 | 267 | 643 | 2.4082370089060 | 10001 | 196+485*4 |
4272 | 534 | 1286 | 2.4082371983963 | 10003 | 2136*2 |
8544 | 1068 | 2572 | 2.4082372090436 | 10006 | 2136*4 |
10680 | 1335 | 3215 | 2.4082372448467 | 10008 | 2136*5 |
14952 | 1869 | 4501 | 2.4082377718967 | 1001 | 2136*7 |
15437 | | 4647 | 2.4082396797495 | 100009 | 2136+13301 |
28738 | | 8651 | 2.4082399090878 | 100003 | 2136+13301*2 |
57476 | | 17302 | 2.4082399102388 | 100007 | 4272+13301*4 |
70777 | | 21306 | 2.4082399591166 | 1000007 | 4272+13301*5 |
141554 | | 42612 | 2.4082399594464 | 100001 | 70777*2 |
325147 | | 97879 | 2.4082399642812 | 10000003 | 42039+70777*4 From A073733 |
650294 | | 195758 | 2.4082399642824 | 10000007 | 325147*2 |
975441 | | 293637 | 2.4082399642902 | 1000001 | 325147*3 |
| 2.4082399653118 | |