Evil interview questions: Unique & Random C

time to read 4 min | 652 words

Writing in C (and using only the C std lib as building blocks, which explicitly exclude C++ and all its stuff), generate 1 million unique random numbers.

For reference, here is the code in C#:

   1: var random = new Random();

   2: var set = new HashSet<int>();

3:

   4: var sp = Stopwatch.StartNew();

5:

   6: while (set.Count < 1000 * 1000)

   7: {

   8:     set.Add(random.Next(0, int.MaxValue));

   9: }

10:

  11: Console.WriteLine(sp.ElapsedMilliseconds);

It is a brute force approach, I’ll admit, but it completes in about 150ms on my end. Solution must run in under 10 seconds.

This question just looks stupid, it actually can tell you quite a bit about the developer.

Tweet Share Share 45 comments

Tags:

miscellaneous

Comments

02 Oct 2013
09:23 AM

Marco

Considering http://xkcd.com/221/ returning numbers from 0 to 1M -1 should be fine :) If numbers do not have to be integers I'd sum rand ([0,1[) to every number of the previous random set just make them different at every run :D Am i a cheater?

02 Oct 2013
09:34 AM

Rafal

= 500)
        {
            cnt++;
            //printf("%d\n",num); we don't have to print, we have to find...
        }
    }
    printf("found 1000000 unique random numbers, but don't know what to do with them\n");
}

Assuming nothing about the distribution of these numbers, and without having to print them all, this runs in a fraction of a second.

02 Oct 2013
09:35 AM

Rafal

Heh, code formatting in comments sucks

02 Oct 2013
09:36 AM

Ramon Smits

You have indeed a unique set. But you don't actually test it the set contains 1.000.000 items. If you have dupes from the random number generator then what?

02 Oct 2013
09:41 AM

James

@Ramon - yes it does. That would be the while on the add...

02 Oct 2013
09:48 AM

tobi

1) We need to call rand() twice for each number because it returns [0, 32k]

2) We have no built-in hash table, so we'll use a bool[1000000] to keep track of the duplicates. I'd now point out the various reasons this is a hack and problematic and ask the interviewer whether he wants me to fix that or whether it is enough.

02 Oct 2013
09:52 AM

Frans Bouma

Easy.

var sw = new Stopwatch(); var store = new SortedDictionary<Guid, int>(); sw.Start(); for(int i=0;i<1000000;i++) { store.Add(Guid.NewGuid(), i); } sw.Stop(); Console.WriteLine("Elapsed ms: {0}", sw.ElapsedMilliseconds);

// show the results ;) int amount = 0; foreach(var kvp in store) { Console.WriteLine("Value: {0}", kvp.Value); amount++; if(amount>100) { break; } }

The idea is to sort on the guid. It's not that efficient, but the question is rather stupid, and it does guarantee 1000000 numbers in (pseudo) random order.

02 Oct 2013
09:53 AM

Frans Bouma

(oops, that formatting went seriously wrong).

02 Oct 2013
09:55 AM

Ayende Rahien

Marco, How would you get random distribution properly there? And how do you ensure uniqueness.

02 Oct 2013
09:56 AM

Ayende Rahien

Rafal, Can you do a gist?

02 Oct 2013
09:56 AM

Ayende Rahien

Ramon, Because it is a unique set, duplicates generated will be ignored.

02 Oct 2013
09:59 AM

Ayende Rahien

Tobi, Assume that you need full 32 bits for the numbers, so you can't just have an array for all the items. What do you do then?

02 Oct 2013
10:00 AM

Ayende Rahien

Frans, There is a reason I said C, not C#. And your code would produce only numbers 1 .. 1,000,000, I want full 32 bits range.

02 Oct 2013
10:07 AM

Marco

Integers from 1 to 1M-1 are unique and normally distrubuted, while if I sum a random floating number between 0 and 1 (not included) for each integers from 0 to 1M - 1, I get a set of unique random floating numbers (sort of normally distributed).

It's a cheat I know but it solves the given problem respecting the constraints.

02 Oct 2013
10:17 AM

Marco

sorry I meant uniformly distributed not normally

02 Oct 2013
10:17 AM

Sergey Shumov

The idea is to divide all integer numbers into N chunks. Then we choose a random number from each chunk. Here is the code: https://gist.github.com/SHSE/6791622

02 Oct 2013
10:18 AM

Rafal

http://pastebin.com/VdScTnaa Same idea as Marco's from the first comment

02 Oct 2013
10:36 AM

Ramon Smits

@James: Ah yes! I missed that :) thanks

02 Oct 2013
10:44 AM

tobi

In that case I'd build myself a crude hash table that is a LinkedList<int>[1000000]. Only adding is needed and I can rely on malloc to build the linked list. That's the simplest hash table possible and can be built as a quick hack.

Alternatively, I'd generate 10m random numbers, sort them with qsort and implement a simple distinct algorithm which is easy for a sorted array. I'd wrap this in a retry-loop in case I do not get 1m unique numbers out of the operation.

02 Oct 2013
11:29 AM

Rafal

Ayende, you have to stop doing that. I really can't resist this kind of algorithmic exercises and this hurts my productivity. Here's another shot at it, this time taking into consideration the requirement that the output must be uniformly distributed in full 32 bit range. I made use of the fact that a cartesian product of two uniformly distributed sets will also be uniformly distributed. I mean, 1000000 is 1000x1000 so I can produce two sets of 1000 unique numbers and combine them.

http://pastebin.com/8NXDUSG5

Sorry if it crashes, didn't really wait until it spits out all these numbers ;)

02 Oct 2013
11:35 AM

Rafal

sorry, the previous version contained an error (two random numbers were incorrectly combined into one in the 'printf' line)

http://pastebin.com/LCedvFNucklwar fecuring

02 Oct 2013
11:37 AM

Rafal

AAARGH!

http://pastebin.com/LCedvFNu

02 Oct 2013
11:42 AM

Andy Masters

Unless external input is given all solutions will be pseudo-random. Truely random would need to acquire data from a random physical source. For 1m numbers you would need quite a lot of entropy. maybe hook up my webcam to observe a lava lamp and take the least significant bits from some pseudo random pixels. Now the c part....

02 Oct 2013
11:45 AM

Ruben Bartelink

I know its beside the point but you could speed up the C# segment:

6: while (set.Count < 1000 * 1000) 7: { 8: set.Add(random.Next(0, int.MaxValue)); 9: }

by leveraging the result of HashSet.Add

int todo = 1000 * 1000; while(todo != 0) if( set.Add(random.Next(0, int.MaxValue)) --todo;

or if you don't want the C# to be terser but [arguably] less legible:

for(int todo = 1000 * 1000; todo > 0; ) todo -= (int) set.Add(random.Next(0, int.MaxValue);

02 Oct 2013
12:10 PM

Rafal

errrr... a final version fixing few bugs (incorrect parentheses + uninitialized array that created some undesired entropy in run time)

http://pastebin.com/e9Sf3eJA

02 Oct 2013
12:50 PM

Mika Kolari

I don't know C, but I guess this would be easy to translate

public static int[] getRandoms(int count, int max) { int[] results = new int[count]; var mask = new BitArray(max + 1); int counter = 0; while (counter < count) { int r = random.Next(max + 1); if (!mask[r]) { results[counter] = r; mask[r] = true; counter++; } }

return results;

}

02 Oct 2013
13:24 PM

Geert Baeyaert

@Sergey,

That's not random! With your solution, you get precisely one number from each chunk, so for instance the set[1, 2, 3, whatever..] could not occur, because 1, 2 and 3 are all in the same chunk.

02 Oct 2013
14:30 PM

Joseph Daigle

Here is one approach.

Start with a linked list containing all numbers in the range you care about (e.g. -2^31 through (2^31) - 1). Next loop 1000*1000 times. In each loop get a random number between 0 and (2^32) - i, where i = incrementing counter for the loop. Get and remove the number at the position found by the random number, push that number to end of an array for your output.

Yes, you do have to do 1,000,000 "seeks". But it should be fairly constant time. Perhaps there is a better/faster way to randomize the initial set of numbers (random swapping?).

02 Oct 2013
14:40 PM

Judah Gabriel Himango

Here's a super short version. It kind of cheats by ensuring the random number is always greater than the last one. But hey, the numbers are still random, and the code is dead simple. :-)

https://gist.github.com/JudahGabriel/6794792

(It's been a long time since I wrote C!)

02 Oct 2013
15:17 PM

Ayende Rahien

Judah, But this will generate a random set of _sequential numbers_. That is what we don't want.

02 Oct 2013
15:29 PM

Peter Ibbotson

I have a sneaky suspicion that one answer to this is to use a 32 bit LFSR and just generate 1000000 numbers, since it's shift register based there can't be any duplicates (assuming you don't start from 0).

Need to think some more about what to do if you're using rand_s instead.

02 Oct 2013
15:31 PM

Sergey Shumov

@Geert, good point! It can be fixed by introducing randomness in the chunk allocation logic. Here is the code: https://gist.github.com/SHSE/6795548

02 Oct 2013
17:38 PM

Thor Arne Johansen

An LFSR representing a primitive polynomial will generate numbers from the whole range.

So a primitive order 32 polynomial will have period of 2^32-1 and go through all 32 bit numbers (excluding 0)

02 Oct 2013
22:31 PM

Jesús López

https://gist.github.com/jesuslpm/6801439

03 Oct 2013
00:50 AM

Keith

I'd probablly go with the following :-

http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/

Can't get much faster than that. The only question you have to ask is how random do you want your random to be :)

03 Oct 2013
06:57 AM

Vytis

https://gist.github.com/xytis/6805873

A similar answer to Judah Gabriel Himango, but with additional features: 1. Ensures that random numbers uniformly distribute among range [0..MAX_INT] 2. Shuffles the obtained set.

With one drawback: If implementation of rand() has RAND_MAX < 1000000, then set shuffle does not work correctly. (May be fixed by using offsets from current location, instead of absolute indices...)

Also, it is actually slower than Keith's answer, but in memory consumption it performs better than C# example.

03 Oct 2013
07:38 AM

kb2011

I haven't done much C (do C# for a living) but I took a stab at it. There are four files denoted by /** NAME.EXT **/ comments.

Didn't write it to fail if it took longer than 10 seconds since the example C# didn't. Also assumed by C std lib you meant C standard libraries, and not just stdlib.h.

http://pastebin.com/AkWDNuXr

03 Oct 2013
11:20 AM

Jesús López

I have another implementation, it takes 3 milliseconds to complete.

https://gist.github.com/jesuslpm/6808179

03 Oct 2013
11:27 AM

Jonathan North

http://pastebin.com/iVDd9cYW

Basic algorithm: generate more values than needed and dedupe them. If there are too few results, fill in the empty part of the buffer and repeat. Shuffle and take the first N values. This wastes 0.8MB of RAM in exchange for being easily understood and easily implemented.

I could have used less extra space, or even no extra space, but that would devolve into O(#rounds * n log n). As is, it'll be rare to have less than the desired number of values in the initial run.

04 Oct 2013
03:13 AM

Juan Lopes

I'd use a segtree with lazy propagation (or a rbtree) to be able to answer the nth unused number in O(log^2 n).

But I'm not messing arround with mallocs today

04 Oct 2013
09:02 AM

ori

What do you mean by random here? That each set of 1m unique ints has an equal probability?

04 Oct 2013
10:16 AM

Matt

I'd cheat.

n_(x+1) = n_x + rnd(1, int32.max/10^6)

for x = 0 to 10^6, n_0 = 0

04 Oct 2013
10:28 AM

matt

Actually, n_0 should be -1 if I'm generating random numbers between 1 and int32.max/10^6. And I'd work for x=0 to (10^6 + 1) and drop the first result.

Should probably round the int32.max/10^6 up and the last result should be calculated from the between the previous number and int32.max. otherwise the the possible numbers would be around 0 to int32.max - 10^6.

07 Oct 2013
03:06 AM

Igor

A simple approach is to maintain a binary search tree of values generated so far. Generate a random integer and check the tree. If the tree already contains the value, retry. Otherwise, add the number into the result array and insert it into the tree.

Since the values are random, balancing the tree wouldn't be necessary in practice.

Also, the retries are expected to be rare, so their impact should be negligible. Even when generating the last 1,000,000th number, the probability of collision with a number already selected is < 0.05%.

As an implementation note, it is unnecessary to allocate memory for each tree node - i.e., we don't need 1,000,000 malloc calls. We can just pre-allocate an array of 1,000,000 tuples [int, int, int], whose meaning is [value, left_child_index, _right_child_index].

08 Oct 2013
02:13 AM

Jay Sullivan

Generate sequence 1..1000000, then shuffle the results. https://gist.github.com/notfed/6878293

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB