hardest question I got in an interview!

maxrum wrote:

Quote:
________________________________________

The mean of 5 randomly picked numbers is:

(%5Cmu_1%20=%20%5Cfrac%7B%5Csum_%7Bi=1%7D%5E%7Bi=5%7D%7Bx_i%7D%7D%7B5%7D)

The mean of a uniform distribution:

(%5Cmu_2%20=%20%5Cfrac%7Bc%7D%7B2%7D)

If we are assume that (%5Cmu_1%20=%20%5Cmu_2) we will get result I posted above.
________________________________________

Here's why the mean may not be useful. What if all first smaller values are extremely small upon inspection, while the largest is very big. Taking the mean would not make sense if, say the first four are 1,2,3,4 and the largest is 1,000,000,000,000. Here's a situation that speaks of "know one by the friends (those close-by) one keeps". The largest is the only meaningful and significant number that matters. All the rest serve to identify the largest and are useless beyond that.
 
First thing comes to my mind is to use order statistics to do part 2. (because 5 numbers are given).

And the first estimator comes to my mind is 6/5 * X(5) --- where X(5) is the maximum number of the 5.

In general, (n+1)/n * X(n) is an unbiased and consistent estimator of c. (correct me if I'm wrong). And also, it is the MVUE I guess...
 
Here's why the mean may not be useful. What if all first smaller values are extremely small upon inspection, while the largest is very big.

I understand that. This is the essence of statistics. I'm trying to believe that a 5 numbers sample is enough to determine mean of distribution. And I'm right in this case because I don't have any other information. My assumption matches yours when you say that: "the five numbers will be equally spaced along the interval (0,c)". I don't think we can do better than this with what was given. I might be wrong, though.
 

bob

Faculty (Undercover)
So I'll sketch out my reasons why I don't think this has a pat answer.

BTW, I was mistaken earlier in the expression I gave for finding the MLE. The expression I gave for the conditional probability is actually the CDF, not the PDF of that r.v.

If the r.v. X is the max of the 5 numbers chosen, and the r.v. C is the upper bound chosen by the folks running the lottery, then:
(F_{X|C}(x) = P(X < x | C = c) = \frac{x^5}{c^5}, 0 < x < c)
(f_{X|C}(x)=F'_{X|C}(x) = \frac{5x^4}{c^5}, 0 < x < c)

This is the correct expression. Its sup, however, also occurs at c.

What we want is the conditional probability of C given X, so the natural way to get from one to the other is the joint PDF.
(f_{X|C}(x) = \frac{f_{X,C}(x,c)}{f_C(c)})
(f_{X,C}(x,c) = f_{X|C}(x)f_C(c))
Where:
(f_{X,C}(x,c)=P(X = x, C = c))
(f_C(c) = P(C = c))

Of course we know the conditional probability of X given C, but in order to get the joint PDF, we also have to know the distribution the folks running the lottery are drawing their C from. This makes sense intuitively, since if they're choosing a number at random between 10 and 100 uniformly, we'll want to guess rather differently than we would if they're choosing a random number from an exponential distribution, say, with mean 100.

If you happen to discover the distribution of C, then the rest of the calculation is easy in concept.
(f_{C|X}(c) = \frac{f_{X,C}(x,c)}{f_X(x)}= \frac{f_{X|C}(x) f_C(c)}{f_X(x)})
With:
(f_X(x) = \int_x^{+\infty}f_{X,C}(x,c) dc)
...although you have to be careful with the limits on this integral, and as a practical matter it might be rather nasty to perform.

Assuming this can all be done, all that remains is to find c* which minimizes the conditional expectation:
(E_X[|C - c*|])

Overall, the point is that without the information of C's distribution, we're flailing around. Information on past winning C's might help give an inkling of what distribution is being used if the lottery organizers are too lazy to change their distribution.

...But all this is just a way to determine some ideal guess, which is only really half the problem. The real problem is to make our guess capture as much of the conditional distribution as possible so that we maximize our chances of winning, which may mean stepping some distance away from the "optimal" C if the other entrants' bets tend to cluster all around our c*...assuming one can get this information, that is. It seems quite likely, for example, that if there are many thousands of entrants in the lottery, the optimal strategy overall will be to make a guess far above our estimate of c* where the other entrants' guesses are more widely spaced.
 
I understand that. This is the essence of statistics. I'm trying to believe that a 5 numbers sample is enough to determine mean of distribution. And I'm right in this case because I don't have any other information. My assumption matches yours when you say that: "the five numbers will be equally spaced along the interval (0,c)". I don't think we can do better than this with what was given. I might be wrong, though.

Maybe you noticed, maybe not, that I did 'reject' my own answer by saying "But then, I don't think this answer is correct." My example of data points 1,2,3,4,1000000000000 clearly shows that the first lowest values 1,2,3,4 are all useless. Certainly one should just ignore the lower values as c could not be equal to an average that is less than 10^12. So, the average is useless. In determining c, I think the only relevant numbers are the number of data points (which is 5 in our case) and the largest data point L among these data points.
 

bob

Faculty (Undercover)
Actually, on third or fourth thought, this isn't as complicated as it looks.

Suppose that your strategy, given X the biggest draw on any particular day, is to guess c* = r*X for some r* you decide is best.

So the r.v. R is
(R = \frac{C}{X})
...and we'd like to have the distribution of R in order to inform our guess.

Then:
(P(R < r) = P(\frac{C}{X} < r) = P(\frac{X}{C} > \frac{1}{r}) = 1 - P(\frac{X}{C} < \frac{1}{r}))

This last probability is the one we know already: the probability that all five of our samples are less than some particular distance along the interval, normalized by the interval length. Clearly, for one sample this is 1 / r. So for all five (or all n--it's easy to generalize), we have:
(P(\frac{C}{X} < r) = 1-r^{-5})

Giving the marginal:
(P(\frac{C}{X} = r) = \frac{d}{dr}P(\frac{C}{X} < r) = 5r^{-6}, r >= 1)

So this is what you need in order to minimize the expectation:
(E[|R - r*|])

I haven't gone more than a few steps into that process, but it looks like it boils down to finding the min of a rational function. I don't really have time to devote to the exact solution now, but it does indeed seem that the optimal guess in the end doesn't depend upon the method of choosing c.
 

Bastian Gross

German Mathquant
The Answer

It's a very easy question, so I hope to get this in my interview.

If all four different types of toys have equal probability (\frac{1}{4}):

(E[X] = \frac{1+\sum_{j=0}^{\infty}(j+4)(\frac{3}{4})^{j}}{\sum_{i=0}^{\infty}(\frac{3}{4})^{i}} = \frac{25}{3})

If all four different types of toys have different probability (p_{i} i \in {1,2,3,4}):

(E[X] = \frac{\sum_{j=0}^{\infty}(j+4)(\frac{1}{4}\sum_{i=1}^{4}(1-p_{i}))^{j}}{\sum_{k=0}^{\infty}(\frac{1}{4}\sum_{i=1}^{4}(1-p_{i}))^{k}} )
 
Hi Bastian,

I think your answer is off the mark. The correct answer (in the case of equal probabilities) is 4*{1+(1/2)+(1/3)+(1/4)}=25/3.

See my two solutions (as well as that of Han Z).

The original problem did explicitly say the probabilities are equal. I have not done the case in which the toys have unequal probabilities, and in which case the problem is, of course, much more difficult, and I am already upon this new challenge. I will let you know my answer when I am done with it.
 
Instead of cereal boxes, below I consider an n-sided biased die whose side k comes up with probability pk. The question is: on average how many time does one need to roll such a die until all the numbers 1,2,...,n come up? The formula for this problem is monstrously unyielding and will not be shown here as it involves several embedded sigma sums, one of them summing over permutations on the subscripts of the probabilities pk.

Here's an outline of my approach for the case n=3:

Let's say the probabilities for sides 1,2,3 are p1,p2,p3 respectively. On the first roll we get side k with probability pk. So far we have rolled once. Now after having gotten the first side, how many rolls will it take to get the next different side. That depends on what side we have rolled so far. Assuming that we got side k on the first roll, on the second roll we get a new side (other than k) with a probability (1-pk), so it will take 1/(1-pk) rolls before we get a side different from the side k of the first roll. So far, to get the first two different sides, the number of rolls on average is

1+SUM{pk/(1-pk), [k,1,3]}.

Now after having gotten the first two different sides, how many rolls will it take to get the last side (that is to be different from the sides we have gotten so far)? That depends on what sides we have rolled so far. The formula at this time gets too complicated too soon. Suffice it to say that I end up using permutations on the subscripts of the probabilities pk and then summing over the permutations (for a total of six summands). Once this latter sum is computed and added to the quantity above, we get 7.3 for the number of rolls needed on average to get all the three different sides.
 
It's a very easy question, so I hope to get this in my interview.

If all four different types of toys have equal probability (\frac{1}{4}):

(E[X] = \sum_{j=0}^{\infty}(j+1)(\frac{3}{4})^{j} = 39)

If all four different types of toys have different probability (p_{i} i \in {1,2,3,4}):

(E[X] = \sum_{j=0}^{\infty}(j+1)(\frac{1}{4}\sum_{i=1}^{4}(1-p_{i}))^{j} )


I guess you are too confident. It is easy but not as easy as you made it out to be.
 
Top