Cataloguing all (and only) the members of a cluster is a major challenge in many ways similar to the “needle in the haystack” parable. One must identify the rare cluster members (the “needles”, typically a few hundreds) within the overwhelming multitude of field stars and background galaxies (the “haystack”, millions of interlopers!).

The cluster members selection is also a field in which COSMIC-DANCE is going to profoundly transform our ability to interpret the mass function, by delivering luminosity functions with proper uncertainties.

Methodology: finding the needles in the haystack

Until recently, the samples involved in studies of nearby clusters were relatively small. The unprecedented scale of the COSMIC-DANCE database, including tens of millions of entries in multiple astrometric and photometric dimensions, cannot be comprehended by humans directly and makes standard selection techniques completely obsolete. Finding the needles in the haystack and turning the extraordinarily rich COSMIC-DANCE data collections into knowledge is a complex hyper-dimensional and Big-Data problem that we propose to solve using the most advanced methods from the areas of Data Mining and Probabilistic Learning.

The objective is to decide on the cluster membership of sources and at the same time to derive the cluster’s fundamental properties (luminosity function, spatial distribution). The two problems must be solved concurrently because the membership of a source depends on the cluster properties, and the cluster properties can only be inferred from the members properties.

A simple example can illustrate the nested nature of this problem: a source located near the cluster core is more likely to be a member than a source located far from it. The spatial location of a source with respect to the cluster is telling us something about its membership, and we should use this important information to optimize the selection of members and minimize contamination. But to know the spatial distribution (e.g core location in this case) of the cluster, we first need to know its members. Hierarchical models are designed to deal with this kinf of "chicken-and-the-egg" problems. All of this must be accomplished in a high-dimensional space (typically ≥10-D of proper motions, colours and luminosities) and including a rigorous treatment of uncertainties and incomplete data.

© H. Bouy

COSMIC DANCE will use modern data-mining techniques to work in a multi-dimensional space including ALL the available dimensions simultaneously. This ensures that we make an optimal use of every single second of precious telescope time,and use ALL the information available  to maximise the completeness and minimise the contamination of the selection.

Thanks to half a century of intense research, our knowledge of the nearby clusters is already well advanced, and good (although incomplete) samples of high-probability members exist based in particular on spectroscopic studies. Using that knowledge to define the prior distribution of our hierarchical models parameters facilitates the convergence of the analysis, ensures that only physically realistic parameters are probed, and makes the selection independent of evolutionary models. Special care is taken to make this technique scalable, as information and prior knowledge increase with new data (e.g. radial velocities, distances, rotational periods,…). In particular, it will be immediately applicable to the Gaia and Gaia-ESO catalogues, which we will use to complement COSMIC-DANCE and ensure a complete coverage from the fragmentation limit to the massive OB stars.

© Last Update: 06-10-2017 by H. Bouy