Observing Dark Worlds

Observing Dark Worlds is a Kaggle recruiting competition sponsored by Winton Capital Management. Evidently, Winton hires individuals with scientific backgrounds to develop financial models and algorithms. I think the idea of companies hosting competitions to discover talent is wonderful. It’s a win-win. Winton gets to assess capability of potential candidates, a fortunate few will walk away with prizes of $12,000,$5,000, and \$3,000, and some may even get hired.

Regardless, it provides a fun an challenging problem to work on for a couple of months!

This competition is about developing a method to predict the location of dark matter from an image of galaxies. (I know, how hard can that be, right?) Without getting into too much detail, a consequence of Einstein’s General Theory of Relativity, dark matter bends incoming light from a distant galaxy. So, the light arriving from galaxy circular behind a mass of dark matter would be sheared appear elliptical to an observer.

If all galaxies were circular, this wouldn’t be much of a challenge. You would just need to look for elliptical galaxies. But, as you probably guessed, galaxies  aren’t circular – they have inherent ellipticity. How can you tell whether you are looking at a random distribution of elliptical galaxies or if there is a dark matter “halo” distorting what we see.

That is the challenge.

What I really like about this challenge is that it is a real problem in the field of astronomy, detecting what is referred to as gravitational lensing. In fact, the developers of the competition even provide the current state of the art in solving the problem, and challenge the contestants to do better. If you want a crash course, I recommend Lenstool for Dummies.

The pressing question with this challenge is how to bring something to the table that dozens of really smart astrophysicists have not thought of already.

Like I said before, how hard can that be?

The competition provided a number of benchmark methods that work well when you only have a single dark matter halo in the image. A fairly straight forward method is looking for the location in the image which has the amount of tangential ellipticity.

Tangential ellipticity is shown visually below, where the orange circle is representative of a dark matter halo.

The contest provided 300 training skies – which provided the location of the galaxies, their components of ellipticity, and the location of between 1-3 dark matter halos in the sky.

Exploring the Tangential Ellipticity Signal Benchmark

The organizers of this competition provided two simple benchmarks to get the participants started. The first is mapping the tangential ellipticity signal. The idea behind this method is to traverse the sky and and create a gridded map of the total tangential ellipticity observed from each point of the map. The area that has the highest signal would be an excellent candidate for the location of a dark matter halo.

Once I worked through all the equations to calculate ellipticity (I’m omitting those details. You can find plenty of discussion about these equations in the competition forum), it was pretty straight forward to code an algorithm to do the calculation. I used Matlab.

Here is an example of one of the training skies with one halo (Training Sky 4). The galaxies are drawn in white. The red area corresponds to areas that have a high degree of tangential ellipticity. The back circle is the actual location of the single dark matter halo in the sky. As you can see, the tangential ellipticity signal algorithm does a good job of predicting the location of the halo.

Before we get overly excited with the this method, let’s take a look at an image that has 2 dark matter halos. As you can see, there is a halo smack-dab in the middle of region with the highest tangential ellipticity. But, the other halo is not even close! Obviously, this method will need some work.

Optimizing the Gridded Map Algorithm

An immediate challenge using a gridded map to find the maximum signal location is the squared dependence on the grid bin width. Each sky in the training set has dimensions of 4,200 by 4,200. If you wanted to calculate the signal at each 1×1 square location of the sky, you’re looking at close to 18 million calculations of a tangential ellipticity signal. And, to calculate the tangential ellipticity at a single point, you need to sum across all the galaxies in the sky (between 300 – 740 per sky). So, you’re we’re looking on the order of 10 billion calculations per sky.

That’s not very efficient.

Of course, you don’t necessarily need to calculate a signal at every 1×1 location in the sky. Could you get by with a coarser grid?

I ran the tangential signal benchmark using different bin sizes to see when the algorithm would start to lose accuracy. This figure shows the results of using 10 x 10 total bins (red dot), 100 x 100 bins (yellow dot), and 1,000 x 1,000 bins. Not surprisingly  bin width does affect accuracy, but interestingly enough, the 100 x 100 grid gave a prediction closest to the halo (large blue dot).

Since a very coarse grid got us fairly close to the location of the maximum tangential ellipticity signal, it is possible to speed up the algorithm significantly by starting with a coarse grid, then calculating values using a finer grid around the initial guess.

I used that approach with success, but, looking ahead to the potential needs of a predicting the locations of multiple halos in a sky, decided to use the coarse grid location as the initial condition of a global solver. Once I had an approximate location, I used a simulated annealing algorithm to fine tune the prediction.

This figure shows the location of the initial guess, calculated using a 10 x 10 grid (red dot), and the location of the refined calculation using the simulated annealing algorithm (green dot), compared to the actual halo location (large blue circle).

This updated Gridded Map algorithm was both quick and accurate in predicting halo locations (for the skies with only 1 halo!)

Improving the Tangential Signal Benchmark

The next step I took in this process was to see if I could get some “quick win” improvements for the tangential signal method.

It seemed fairly obvious that the tangential ellipticity signal from a distant galaxy should be given less weight than one close to the proposed center of a dark matter halo. This could be accomplished by scaling the tangential ellipticity by some factor $1/r^k$ where $k$ could be determined empirically or by theoretical arguments. A second option would be to simply define an $r_c$ outside of which the tangential ellipticity signal would be ignored.

While a smooth function scaling function $1/r^k$ is clearly more physical, there are some advantages to also defining an $r_c$, (e.g., cropping off those really the really long distances caused by points in opposite corners of the sky). In order to maintain flexibility, I included both in “new and improved” tangential ellipticity signal model.

The following is a graph of the Dark Worlds Metric (the scoring algorithm provided by the competition organizers) as a function of $k$ and $r_c$ for the 100 single-halo training skies. The lower the score (darker blue in the figure), the close the predicted values of the halo position to the actual.

The best single-halo predictions occur at $k$ of 0.6 and $r_c$ of 3000 (circled in the figure above), but the score is not a smooth function of these parameters. This could either be genuine noise from the data, or noise due to using a stochastic global optimization routine. In addition, except for the region of small $r_c$, the surface is fairly flat, suggesting that adding these parameters does not significantly improve the method. Lastly, adding these parameters does nothing to improve predictions for the skies with more than one halo.

FIX AXES LABELS

Finding the 2nd (and 3rd) Halo

How does one go about determining the location of additional dark matter halos in a sky? One possibility is figuring out how to decouple the effect of multiple halos.

There are two concepts that seem critical.

The first is that there is a radius effect, in that, the effect of a dark matter halo decreases the further you are radially from the halo’s center.

The second is that multiple halos interact linearly.

It seems that, with this information, one could develop a routine to “erase” the influence of the first halo in order to calculate the location of the second. This would, most likely, need to be an iterative process. To start, though, we’ll keep things simple and only use a single pass.

Removing Halos – Approach 1

A very simple approach to removing the effect of a halo is to scale the ellipticity components as a function of distance from the dark matter halo. In other words,

$e_1 = e_1 r^k \\ e_2 = e_2 r^k$

In this (and other) instances of scaling by $r$, I normalize $r$ by 6000, which is (just a little longer than) the maximum distance two objects can be in the training or test skies.

The following figure shows the Dark World Metric (computed only for training Skies 101-200, i.e., the skies with only 2 dark matter halos) as a function of $k$.

Almost all the improvement to the metric occurs before $k=1$, although there appears to be slight improvements up until about $k=3$.

Here we return to Sky 134 with the predicted halo positions (white solid circles) versus the actual (black circles) using $k=3$.

In the figure above, you can see that the first halo is predicted well, but that there is a large discrepancy between the actual and predicted location of the second halo.

The signal map for this figure was generated after the ellipticity correction. Notice that the galaxies are fairly uniform in shape.

It appears that this simple correction is not going to get us far.

Removing Halos – Approach 2

A more sophisticated approach is to target the tangential ellipticity instead of $e_1$ and $e_2$ directly. For example, a simple approach would just be to set $e_{tan} = 0$ for all of the galaxies.

The question is then how to update the corresponding $e_1$ and $e_2$ parameters? There are an infinite number of possible $e_1$ – $e_2$ pairs that would would set $e_{tan} = 0$.

For example, you can set $e_{tan} = 0$ by setting $e_1 = e_2 = 0$, but then you have nothing to use to find the second galaxy.

One approach is to choose a new $e_1$ and $e_2$ that minimizes the distance between the current $e_1$ – $e_2$ point, and the possible $e_1$ – $e_2$ line.

This is easily extendable so that, instead of using $e_{tan} = 0$, you can use $e_{tan} = f(r^k)$.

NEW EQUATIONS

The the parameters that gave the lower Dark World Metric were $k=0.6$ and $r_c=4,000$. Figure XX shows the results of this method on Sky 134. There is definitely an improvement, but it is difficult to see how the “remove-halo-by-updating-ellipticity” approach will get us much farther, particularly once we add another halo.

Another Approach – Maximum Likelihood

The creators of the competition also provides a second approach to finding the locations of dark matter. With this approach, you propose a model for the interaction of dark matter on nearby galaxies. You then traverse the sky and find a location for the dark matter halo that, using the proposed model, would best explain the observed galaxy data. In other words, what is the most likely position of a halo given a model the data.

The key to this approach, it should be obvious, is formulating a representative model.

The sample model given by the contest organizers was a simple expression of total ellipticity as a function of distance from the dark matter, e.g.,

$e_{tot} = 1 / r$

From there, you can back-out the ellipticity components,

$e_1 = – e_{tot} \cos(2\phi)$
$e_2 = – e_{tot} \sin(2\phi)$

where $\phi$ is the angle of a particular galaxy relative to the halo location.

Lenstool: Peeking Under the Hood

The Lenstool program is written in C, and is approximately 42,000 lines spread across 250 files.