A Machine Learning Challenge in the Science of Science
- Capture the evolution of scientific concepts.
- Predict emerging research topics.
Mario Krenn, Michael Kopp, David Kreil, Rose Yu, Moritz Neun, Christian Eichenberger, Markus Spanring, Henry Martin, Dirk Geschke, Daniel Springer, Pedro Herruzo, Marvin McCutchan, Alina Mihai, Toma Furdui, Gabi Fratica, Miriam Vázquez, Aleksandra Gruca, Johannes Brandstetter, Sepp Hochreiter
An official competition within the 2021 IEEE BigData Cup Challenges.
In our information era, the volume of scientific literature grows at an ever-increasing speed. In the field of Artificial Intelligence (AI) and Machine Learning (ML), the number of papers grows exponentially and doubles approximately every 23 months. In this overflow of information, researchers have to specialize in narrow sub-disciplines, making it challenging to uncover scientific concepts and connections beyond their own area of research. To explore beyond the specialized areas, research ideas need to transcend the individual focus bubbles. A tool that could offer such meaningful, personalized scientific ideas would open new avenues of research.
Our Science4cast competition directly addresses this challenge. The competition goal is to capture the evolution of scientific concepts and predict which research topics will emerge in the coming years. We created a semantic network characterizing the content of scientific literature in AI since 1994. The network contains 64,000 nodes, each representing an AI concept. Edges between nodes are drawn when two concepts are investigated together in a scientific paper. Competitors need to predict future states of this exponentially growing semantic network.
The compiled unique dataset will be instrumental in the pursuit of a wide range of exciting questions in the area of ML for Science of Science. These questions include end-to-end trained concept discovery, predictions of concept emergence, predictions of interdisciplinary interactions, and suggestions of personalized research ideas. Solutions to our current competition combined with this extensive dataset will set us on the way to answer these vital questions.
The number of papers in AI and ML increases exponentially.
Evolving knowledge networks like these are a common research topic in the Science of Science. Specifically, related semantic networks have been built in other disciplines in the natural sciences. Examples involve biochemistry, where no machine learning has been applied, and quantum physics with a much smaller semantic network (our network has 10 times more nodes and 50 times more edges and grows significantly faster). Our dataset provides an order of magnitude larger network.
- Santo Fortunato, Carl T. Bergstrom, Katy Börner, James A. Evans, Dirk Helbing, Staša Milojević, Alexander M. Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, Alessandro Vespignani, Ludo Waltman, Dashun Wang, Albert-László Barabási, Science of Science, Science 359(6379), eaao0185 (2018).
- Dashun Wang, Albert-László Barabási. The science of science. Cambridge University Press, 2021.
- James A. Evans, Jacob G. Foster, Metaknowledge, Science, 331(6018), 721-725, (2011).
- Andrey Rzhetsky, Jacob G. Foster, Ian T. Foster, James A. Evans, PNAS 112(47) 14569-14574 (2015).
- Mario Krenn, Anton Zeilinger, Predicting research trends with semantic and neural networks with an application in quantum physics, PNAS 117(4) 1910-1916 (2020).
The main competition consists of predicting new links in the semantic network. We provide the semantic network from 1994-2017, with a discretization of days (which represents the publication date of the underlying papers).
Therefore, we provide approximately 8,400 snapshots of the growing semantic network – one snapshot for each day from the beginning of 1994 to the end of 2017, and participants are welcome to use more coarse-grained snapshots. The evolution shows how the links between 64,000 nodes are drawn. The precise goal of the task is to predict the future links formed between 2017-2020 in the semantic network, which do not exist yet in 2017. Equivalently, this task asks for the prediction of which scientific pairs of concepts will be investigated by scientists over three years.
Approximatively 1.5% of the semantic network (~1,000 nodes).
Scientific concepts and edges formed before 2012 (blue and grey) and between 2012 and 2015 (green).
Technical Formulation of the Task
- full_dynamic_graph_sparse: a dynamic graph (list of edges and their creation date) until a time t1.
- unconnected_vertex_pairs: a list of 1,000,000 vertex pairs that are unconnected by time t1.
Source files: /Competition/
- Evaluate_Model.py: Evaluating the models
- SimpleModelFull.py: Baseline model
- How to read and visualize data
- How to run a baseline model
- How to create predictions for validation and competition data
data files at IARAI website: Science4Cast_data.zip contains the following three files:
- TrainSet2014_3.pkl: Semantic network until 2014, for predicting 2017
- TrainSet2014_3_solution.pkl: which edges are connected in 2017
- CompetitionSet2017_3.pkl: Semantic network until 2017, used for evaluation
Copy those date files directly into the directory of the source files and tutorial.
The Evaluation Metric
For the evaluation, we use a subset of all 57,000 vertices with a nonzero degree per the end of 2017. We define the set K of vertex pairs that are not connected yet by an edge at the end of 2017 (in the extreme case, K contains roughly 3.2 billion vertex pairs, i.e. possible edges. In our case, K contains half a million vertex pairs). Every k in K will either be connected or not connected by 2020. The goal is to predict whether the two vertices will be connected or not.
For evaluating the model, we use the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate at various threshold settings. Our evaluation metric is the commonly used metric Area under the Curve (AUC) of the ROC curve. One advantage of AUC over mean-square-error (MSE) is its independence of the data distribution. Specifically, in our case, where the two classes are highly asymmetrically distributed (with only about 1-3% of newly connected edges), and the distribution changing over time, the AUC provides a meaningful and operational interpretation. For perfect predictions, AUC=1, while random predictions give AUC=0.5. Operationally, it gives the percentage that a random true element is higher ranked than a random false one.
ROC curve for a random model and a model with an AUC=0.767. The AUC is our evaluation metric.
Participants can upload their predictions on the test dataset (CompetitionSet2017_3.pkl) to the leaderboard of the competitions until the submission deadline.
Besides the submissions to the leaderboard , submission of working code, learned parameters, and a short scientific paper (4-6 pages) to be published in the IEEE BigData workshop with a sufficiently detailed description of the approach used is required to be awarded a prize. The scientific quality of the submitted paper will be verified by the competition committee.
After the competition, we plan to write a perspective/summary paper and invite all participants to contribute.
All times and dates are Anywhere on Earth (UTC -12).
- Data Release: 25. August 2021
- Competition ends (submission deadline): 3. November 2021
- Abstract submission deadline: 17. November 2021
- Announcement of the winners: 2. December 2021
- IEEE BigData 2021: 15.-18. December 2021
|2||MK’s Baseline Model solution||Mario Krenn - Mario Krenn||0.87978679928556|
The competition offers the following prizes, for the top three winners:
- 1st Prize: 8,000 EUR
- 2nd Prize: 6,000 EUR
- 3rd Prize: 2,000 EUR
In addition, special prizes will be awarded to outstanding or creative solutions, should they exist. We will also potentially include a fellowship position at Institute of Advanced Research in Artificial Intelligence, Vienna, Austria.