Converting Binary Data to Digits

Getting the most out of your random data.

Normally when you want a random number you want the number to fall within a range. If you are manually generating the number (archive) from a binary source you are only worried about having enough bits. Any extra bits are discarded. For the purpose of generating a stream of random digits a more bit conservative approach is better.

My current favorite is this method (archive) by Andreas Krey. Their method was simple and wonderful. Use the decimal value of the byte (0-255). If it is 000 - 199 use the last 2 digits. If it is 200 - 249 use the last digit. If it is 250 - 255 discard the byte. On average this will give you 1.7578 digits per byte. That is more than 1 digit (normally) and it does not introduce any bias.

Rewitten:
< 200 2 digits
< 250 1 digit
>= 250 discard
2*200/256 + 1*50/256 = 1.7578 digits per byte (on average).

Let's do the same thing with 2 bytes instead of 1.
< 60000 4 digits
< 65000 3 digits
< 65500 2 digits
< 65530 1 digit
>= 65530 discard
4*60000/65536 + 3*5000/65536 + 2*500/65536 + 1*30/65536 = 3.9067 digits per 2 bytes or 1.95335 digits per byte (on average).

2 bytes at a time is clearly better than 1. For what it is worth 3 is even better, and 4 is even better than that.

Is there a perfect number of bytes to read?

No. There is no byte count (or bit count) where all of the bits can be used without introducing bias. The perfect combination would be all ones in binary converting to all nines in decimal. Rick Regan does an excellent job explaining why this can not happen in his post Nines in Binary (archive). If you look at his table you will see the pattern. There will always be a zero before the series of ones at the end. All nines can not convert to all ones. Since there is no perfect number of bits or bytes to read we will have to compromise in some way.

How many bytes should you read at a time?

If you are limited to 32 bit numbers then 4 bytes is the best you can do. If you can use 64 bit numbers 5 bytes is your best option. 5 does better than 6, 7 or 8 bytes at a time.

Here is a table showing the number of bytes and the average digits per byte for 32 and 64 bit numbers:
BytesDigits/Byte
11.7578125
21.9533539
32.1815580
42.2269530
52.3619721
62.2838879
72.2813902
82.3143676

If you were to continue that table you would find that 5 bytes outperforms 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 14, 17, 19 and 22 bytes.
If you are not limited to 32 or 64 bit integers you can do better. For pyOTP I chose 49 byte blocks which provide an average of 2.4076 digits per byte. There is a longer table below.

Bit Recovery

It is possible to recover a bit or two from the rejected numbers. The last digit of 28n-1 is always a 5. If the rejected number ends in 0-3 you can keep the last 2 bits. If the rejected number ends in a 4 or 5 you can keep the last bit only.
DecimalBinaryRecovered Bits
xxx0xxx10102
xxx1xxx10112
xxx2xxx11002
xxx3xxx11012
xxx4xxx11101
xxx5xxx11111

Depending on your block size and data stream length this might not be worth doing. In fact this is probably only worth doing if you are working with 1 byte at a time or less. Keep in mind the recovered bytes will have the same rejection rate.
Bytes% RejectionBytes Recovered/kB
0.537.5160.0
12.345.0
20.00920.0098
30.0000360.000025
40.000000140.000000075
50.000000000550.00000000023

Sample Implementation

Python:
from binascii import hexlify as _hexlify

def chartodigit(char):
  seed = int(_hexlify(char), 16)
  max = str(256**len(char)-1)
  count = len(max)
  for i in range(1, count):
    div = int(max[:i].ljust(count, '0'))
    if seed < div:
      return str(seed).zfill(count-i)[-(count-i):]
  return ''
Example:
rng = open('/dev/urandom', 'rb')
for i in range(3):
  print chartodigit(rng.read(5))
rng.close

# 488501925387
# 780406901368
# 905036030054

Other ways

Byte mod 10. - Never do this! This will add a bias to your output. 0-5 will occur more often than 6-9.

Keep ASCII 48-57 (tr -dc 0-9 </dev/urandom). - This will provide 0.03906 digits per byte.

If byte is <250 use byte mod 10. - This will provide 0.97656 digits per byte.

Split byte into 2 nibbles (0-15) and drop anything >9. - This will provide 1.25000 digits per byte.

Base64, translate letters to numbers (base64 </dev/urandom | tr 'A-Za-x' '0-90-90-90-90-9' | tr -dc 0-9). Bytes must be read in multiples of 3 or the last group will be biased. - This will provide 1.27083 digits per byte.

Bytes to Digits Table

This table contains only the bytes <100,000 that provide more digits per byte than previous bytes.
BytesDigitsDigits/ByteDecimal PrefixByte Pattern
0.50.61.25000000000001 
11.81.757812500000025 
23.91.953353881650065 
36.52.181558012566716 
48.92.226953015325042 
511.82.3619721283400109 
1023.82.381162138170012 
1331.02.382149936446220 
1842.92.382402366455622 
2559.62.384522155940016 
2764.92.4035557773111105 
49118.02.40761985313471008 
98236.02.407733021473510149*2
147353.92.407770807258510249*3
196471.92.407789154073010349*4
245589.92.407800123231810449*5
2676432.408237008906010001 
53412862.408237198396310003267*2
106825722.408237209043610006267*4
133532152.408237244846710008267*5
186945012.40823777189671001267+1602
347183592.40823782765561002267+1602*2
5073122172.40823783101751003267+1602*3
6675160752.40823784894061004267+1602*4
8277199332.40823785846841005267+1602*5
9879237912.40823786472751006267+1602*6
11481276492.40823786974171007267+1602*7
13083315072.40823787433681008267+1602*8
14369346042.408239917515410001267+14102
28471685652.408239944305910001267+14102*2
425731025262.4082399511681100009267+14102*3
566751364872.4082399564350100007267+14102*4
707771704482.4082399593609100005267+14102*5
848792044092.4082399614766100003267+14102*6
989812383702.4082399637119100001267+14102*7
 2.4082399653118  
The theoretical perfect digits per byte is log10 256.

Bits to Digits Table

If it is convenient to use bits instead of bytes you can get more bang for your buck. 30 bits does better than the best even byte 32 and 64 bit numbers. 50 bits does better than that. The first >2.4 digits per byte occurs at 103 bits while it takes 216 bits (27 bytes) to do the same thing using even bytes.

The best bit and even byte options for 32 and 64 bit numbers:
BitsBytesDigits/Byte
30 2.3622474618933
3242.2269530153250
4052.3619721283400
50 2.3774459001280

The bit counts that provide more digits per byte than previous bit counts.
BitsBytesDigitsDigits/ByteDecimal PrefixBit Pattern
40.50.61.25000000000001 
5 0.91.50000000000003 
7 1.71.964285714285712 
10 2.92.3593750000000102 
20 5.92.3594169614800104 
30 8.92.3622474618933107 
50 14.92.377445900128011 
801023.82.381162138170012 
103 312.4052776017398101 
196 592.40761917294691004 
392491182.40761985313471008196*2
588 1772.4077693320776101196*3
681 2052.40810308903141003196+485
1166 3512.40818504776401002196+485*2
1651 4972.40821890233891001196+485*3
21362676432.408237008906010001196+485*4
427253412862.4082371983963100032136*2
8544106825722.4082372090436100062136*4
10680133532152.4082372448467100082136*5
14952186945012.408237771896710012136*7
15437 46472.40823967974951000092136+13301
28738 86512.40823990908781000032136+13301*2
57476 173022.40823991023881000074272+13301*4
70777 213062.408239959116610000074272+13301*5
141554 426122.408239959446410000170777*2
325147 978792.40823996428121000000342039+70777*4 From A073733
650294 1957582.408239964282410000007325147*2
975441 2936372.40823996429021000001325147*3
 2.4082399653118