A Machine Learning Challenge in the Science of Science
- Capture the evolution of scientific concepts.
- Predict emerging research topics.
Mario Krenn, Michael Kopp, David Kreil, Rose Yu, Moritz Neun, Christian Eichenberger, Markus Spanring, Henry Martin, Dirk Geschke, Daniel Springer, Pedro Herruzo, Marvin McCutchan, Alina Mihai, Toma Furdui, Gabi Fratica, Miriam Vázquez, Aleksandra Gruca, Johannes Brandstetter, Sepp Hochreiter
An official competition within the 2021 IEEE BigData Cup Challenges.
In our information era, the volume of scientific literature grows at an ever-increasing speed. In the field of Artificial Intelligence (AI) and Machine Learning (ML), the number of papers grows exponentially and doubles approximately every 23 months. In this overflow of information, researchers have to specialize in narrow sub-disciplines, making it challenging to uncover scientific concepts and connections beyond their own area of research. To explore beyond the specialized areas, research ideas need to transcend the individual focus bubbles. A tool that could offer such meaningful, personalized scientific ideas would open new avenues of research.
Our Science4cast competition directly addresses this challenge. The competition goal is to capture the evolution of scientific concepts and predict which research topics will emerge in the coming years. We created a semantic network characterizing the content of scientific literature in AI since 1994. The network contains 64,000 nodes, each representing an AI concept. Edges between nodes are drawn when two concepts are investigated together in a scientific paper. Competitors need to predict future states of this exponentially growing semantic network.
The compiled unique dataset will be instrumental in the pursuit of a wide range of exciting questions in the area of ML for Science of Science. These questions include end-to-end trained concept discovery, predictions of concept emergence, predictions of interdisciplinary interactions, and suggestions of personalized research ideas. Solutions to our current competition combined with this extensive dataset will set us on the way to answer these vital questions.
The number of papers in AI and ML increases exponentially.
Layer 6 AI
A voucher or cash prize worth 8,000 EUR and free IEEE 2021 conference registration for all team members
Ngoc Mai Tran(1), Yangxinyu Xie(1)
1 University of Texas at Austin
A voucher or cash prize worth 6,000 EUR and free IEEE 2021 conference registration for all team members
Milad Aghajohari(1), Mohammad Sadegh Akhondzadeh(2), Saleh Ashkboos(3), Kamran Chitsaz(4)
1 University of Montreal & MILA
2 Saarland University & CISPA Helmholtz
3 ETH Zurich
4 Polytechnique Montreal
A voucher or cash prize worth 2,000 EUR and free IEEE 2021 conference registration for all team members
For a surprising, pure network-theoretical solution:
João P. Moutinho, Bruno Coutinho, Lorenzo Buffoni (Instituto de Telecomunicacoes Lisbon)
For an exciting, dynamical embedding and applications of Transformers:
Harlin Lee, Rishi Sonthalia, Jacob G. Foster (UCLA)
For an interesting low-compute solution:
For an interesting Graph Neural Network solution:
Francisco Andrades, Ricardo Nanculef (Federico Santa Maria Technical University Santiago, Chile)
For an interesting LSTM-based solution:
10.11.2021 – result submission deadline today AoE!
We extend the paper submission deadline by one week. Please submit your short scientific paper (3-6pages) by 24.November 2021 (updates on submission follow in the next few days). The paper should explain in detail your method and results. We will invite all participants (with non-trivial results) to participate to our dataset paper which is scheduled shortly after the competition workshop. Thus, your paper can be a first draft of your contribution to our dataset paper. Looking forward to your great results!
We extend the deadline by one week – you can submit solutions until 10.November 2021. Furthermore, we will invite all participants to contribute to the dataset paper (which will be written shortly after the competition ends, more details later.).
Evolving knowledge networks like these are a common research topic in the Science of Science. Specifically, related semantic networks have been built in other disciplines in the natural sciences. Examples involve biochemistry, where no machine learning has been applied, and quantum physics with a much smaller semantic network (our network has 10 times more nodes and 50 times more edges and grows significantly faster). Our dataset provides an order of magnitude larger network.
- Santo Fortunato, Carl T. Bergstrom, Katy Börner, James A. Evans, Dirk Helbing, Staša Milojević, Alexander M. Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, Alessandro Vespignani, Ludo Waltman, Dashun Wang, Albert-László Barabási, Science of Science, Science 359(6379), eaao0185 (2018).
- Dashun Wang, Albert-László Barabási. The science of science. Cambridge University Press, 2021.
- James A. Evans, Jacob G. Foster, Metaknowledge, Science, 331(6018), 721-725, (2011).
- Andrey Rzhetsky, Jacob G. Foster, Ian T. Foster, James A. Evans, PNAS 112(47) 14569-14574 (2015).
- Mario Krenn, Anton Zeilinger, Predicting research trends with semantic and neural networks with an application in quantum physics, PNAS 117(4) 1910-1916 (2020).
The main competition consists of predicting new links in the semantic network. We provide the semantic network from 1994-2017, with a discretization of days (which represents the publication date of the underlying papers).
Therefore, we provide approximately 8,400 snapshots of the growing semantic network – one snapshot for each day from the beginning of 1994 to the end of 2017, and participants are welcome to use more coarse-grained snapshots. The evolution shows how the links between 64,000 nodes are drawn. The precise goal of the task is to predict the future links formed between 2017-2020 in the semantic network, which do not exist yet in 2017. Equivalently, this task asks for the prediction of which scientific pairs of concepts will be investigated by scientists over three years.
Approximatively 1.5% of the semantic network (~1,000 nodes).
Scientific concepts and edges formed before 2012 (blue and grey) and between 2012 and 2015 (green).
Technical Formulation of the Task
- full_dynamic_graph_sparse: a dynamic graph (list of edges and their creation date) until a time t1.
- unconnected_vertex_pairs: a list of 1,000,000 vertex pairs that are unconnected by time t1.
Source files: /Competition/
- Evaluate_Model.py: Evaluating the models
- SimpleModelFull.py: Baseline model
- How to read and visualize data
- How to run a baseline model
- How to create predictions for validation and competition data
data files at IARAI website: Science4Cast_data.zip contains the following three files:
- TrainSet2014_3.pkl: Semantic network until 2014, for predicting 2017
- TrainSet2014_3_solution.pkl: which edges are connected in 2017
- CompetitionSet2017_3.pkl: Semantic network until 2017, used for evaluation
Copy those date files directly into the directory of the source files and tutorial.
The Evaluation Metric
For the evaluation, we use a subset of all 57,000 vertices with a nonzero degree per the end of 2017. We define the set K of vertex pairs that are not connected yet by an edge at the end of 2017 (in the extreme case, K contains roughly 3.2 billion vertex pairs, i.e. possible edges. In our case, K contains half a million vertex pairs). Every k in K will either be connected or not connected by 2020. The goal is to predict whether the two vertices will be connected or not.
For evaluating the model, we use the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate at various threshold settings. Our evaluation metric is the commonly used metric Area under the Curve (AUC) of the ROC curve. One advantage of AUC over mean-square-error (MSE) is its independence of the data distribution. Specifically, in our case, where the two classes are highly asymmetrically distributed (with only about 1-3% of newly connected edges), and the distribution changing over time, the AUC provides a meaningful and operational interpretation. For perfect predictions, AUC=1, while random predictions give AUC=0.5. Operationally, it gives the percentage that a random true element is higher ranked than a random false one.
ROC curve for a random model and a model with an AUC=0.767. The AUC is our evaluation metric.
Participants can upload their predictions on the test dataset (CompetitionSet2017_3.pkl) to the leaderboard of the competitions until the submission deadline.
Besides the submissions to the leaderboard , submission of working code, learned parameters, and a short scientific paper (4-6 pages) to be published in the IEEE BigData workshop with a sufficiently detailed description of the approach used is required to be awarded a prize. The scientific quality of the submitted paper will be verified by the competition committee.
After the competition, we plan to write a perspective/summary paper and invite all participants to contribute.
All times and dates are Anywhere on Earth (UTC -12).
- Data Release: 25. August 2021
- Competition ends (submission deadline): 10. November 2021 at midnight (= noon 11. November 2021 UTC)
- Short Scientific Paper (3-6 pages) submission deadline: 24. November 2021
- Announcement of the winners: 2. December 2021
- IEEE BigData 2021: 15.-18. December 2021
Special Session program (17.12.2021)
1st Place: Yichao Lu((Team oahciy)): Predicting Research Trends in Artificial Intelligence with Gradient Boosting Decision Trees and Time-aware Graph Neural Networks
16.40-17.00 (13+ 7):
Special Prize: Team mondegoscroc
Invited Speaker - Rose Yu: Dynamics Learning with Graph Neural Networks
Special Prize: Harlin Lee (Team harlin): Dynamic Embedding-based Methods for Link Prediction in Machine Learning Semantic Network
2nd Place: Ngoc Tran (Team Hash Brown): Random walk rankings with feature selection and imputation
18.45-19.05 (13+ 7)
Special Prize: Francisco Andrades (Team fandrades): A Method to Predict Semantic Relations on Artificial Intelligence Papers
19.05-19.20 (13+ 7)
Special Prize: Nima Sanjabi (Team nimasanjabi): Efficiently Predicting Scientific Trends Using Node Centrality Measures
Invited Speaker: Jacob Foster
Special Prize: João Moutinho (Team joaopmoutinho): Network-based link prediction of scientific concepts
3rd Place: Milad Aghajohari (Team SanatisFinests2): Degree-based Feature Is All You Need
|1||Submission #4||oahciy - oahciy||0.92838861960445||Nov 11, 2021 11:45|
|2||Submission #3||oahciy - oahciy||0.92823997314912||Nov 11, 2021 10:58|
|3||Submission #2||oahciy - oahciy||0.92739359144926||Nov 11, 2021 01:58|
|4||ee5a||Hash Brown - princengoc||0.92738657972574||Nov 11, 2021 01:54|
|5||JNUTOJ||SanatisFinests2 - sadegh||0.92212365472956||Nov 9, 2021 21:29|
|6||Bacalhau à Lagareiro||Bacalhink - joaopmoutinho||0.91853311492525||Nov 10, 2021 18:29|
|7||S4||nimasanjabi - nimasanjabi||0.91845580006827||Nov 10, 2021 16:58|
|8||05 048 3 6||asdnixu||0.9184217368478||Nov 10, 2021 18:30|
|9||M3||nimasanjabi - nimasanjabi||0.91840363535239||Nov 3, 2021 04:53|
|10||s4||sungbinchoi - sungbinchoi||0.91806434210276||Nov 11, 2021 11:09|
|11||s3||sungbinchoi - sungbinchoi||0.91772982833076||Nov 11, 2021 11:50|
|12||s5||sungbinchoi - sungbinchoi||0.91762713872587||Nov 11, 2021 11:52|
|13||s1||sungbinchoi - sungbinchoi||0.91759111971979||Nov 11, 2021 11:44|
|14||70f24eef9b3c441c4f83a2e53ef390ff892f19d2.json||Hash Brown - xinyu2021||0.91734288204778||Nov 11, 2021 11:43|
|15||model_2||nimasanjabi - nimasanjabi||0.91525072533846||Oct 27, 2021 12:30|
|16||Bacalhau com Todos||Bacalhink - joaopmoutinho||0.91385150367375||Nov 10, 2021 11:26|
|17||Test Submission (V2)||xaiguy||0.9130503501038||Sep 25, 2021 20:41|
|18||s2||sungbinchoi - sungbinchoi||0.91157040982971||Nov 11, 2021 11:46|
|19||Yota instinct v2||ArgoCS - thomas27||0.90883506975933||Nov 10, 2021 02:52|
|20||Yota instinct||ArgoCS - thomas27||0.90870570786594||Nov 10, 2021 02:42|
|21||Ultra instinct||ArgoCS - thomas27||0.90848177152718||Nov 10, 2021 01:14|
|22||ArgoNet3||ArgoCS - hedi||0.90701962931712||Nov 10, 2021 12:18|
|23||Bacalhau com Natas||Bacalhink - joaopmoutinho||0.90364755264919||Nov 10, 2021 11:27|
|24||Royce||Team Jacob - harlin||0.90236015458578||Nov 7, 2021 18:27|
|25||TestSubmission1||hyperteam||0.90046004318704||Oct 2, 2021 17:56|
|26||Simple Test||ArgoCS - hedi||0.89998693318996||Oct 7, 2021 16:10|
|27||Leo_model_1||leo2021||0.89921126642699||Oct 26, 2021 09:02|
|28||Bacalhau à Brás||Bacalhink - joaopmoutinho||0.8971536961553||Nov 10, 2021 11:25|
|29||Ultron-NN||nick - nick||0.89545500007176||Oct 31, 2021 12:49|
|30||crocsdead||mondegoscroc||0.89254508747723||Nov 11, 2021 10:10|
|31||Ultra-NN||nick - nick||0.89221657430449||Oct 31, 2021 12:22|
|32||teste_vitor||vitoralmeida777 - vitoralmeida777||0.88966227277996||Nov 8, 2021 13:08|
|33||crocsdinamite||mondegoscroc||0.88822522307617||Nov 11, 2021 10:10|
|34||crocsbuddy||mondegoscroc||0.88785668525578||Nov 10, 2021 18:53|
|35||gogo||raffaelbdl||0.8843339678194||Oct 10, 2021 11:38|
|36||Ultree-NN||nick - nick||0.88127851775876||Oct 31, 2021 13:25|
|37||MK’s Baseline Model solution||Mario Krenn - Mario Krenn||0.87978679928556||Aug 25, 2021 12:42|
|38||sub2||fandrades - fandrades||0.87763277456673||Nov 11, 2021 11:18|
|39||snaity||uoguelph_mlrg - bknyazev||0.87722377750109||Nov 3, 2021 15:00|
|40||from tutorial||graphsufi - tareqmahmood||0.87555545446748||Oct 23, 2021 15:43|
|41||leo_baseline7||leo2021||0.8743629435919||Oct 25, 2021 02:52|
|42||Bacalhau à Gomes de Sá||Bacalhink - joaopmoutinho||0.87091645154164||Nov 10, 2021 11:25|
|43||sub1||fandrades - fandrades||0.86927044291181||Nov 8, 2021 23:36|
|44||Logistic Regression 2||pires||0.84704015756769||Nov 9, 2021 14:17|
|45||Stacked LSTM – Equipe 2||joaomanoel||0.84673222296476||Nov 5, 2021 11:10|
|46||Random Forest 2||pires||0.84459528010844||Nov 10, 2021 10:13|
|47||Random Forest||pires||0.84187464500399||Nov 10, 2021 10:12|
|48||ArgoNet2||ArgoCS - hedi||0.8411969962578||Oct 31, 2021 11:52|
|49||Baseline||ArgoCS - thomas27||0.84113724217193||Oct 31, 2021 15:43|
|50||D.02.11 P.11.34||graphsufi - tareqmahmood||0.8231368869078||Nov 2, 2021 17:35|
|51||D.02.11 P.09.24||graphsufi - tareqmahmood||0.82285008701302||Nov 2, 2021 15:26|
|52||base_submission||smeznar - smeznar||0.81078522290096||Nov 2, 2021 07:56|
|53||improved-1||uoguelph_mlrg - bknyazev||0.8104972664267||Nov 9, 2021 03:00|
|54||ArgoNet||hedi - hedi||0.80916052220601||Oct 24, 2021 10:11|
|55||LR-test||nick - nick||0.78807442329955||Oct 31, 2021 10:27|
|56||sub0||fandrades - fandrades||0.7750607606504||Oct 31, 2021 17:53|
|57||D.03.11 P.01.34||graphsufi - tareqmahmood||0.75106315090655||Nov 2, 2021 19:33|
|58||RFTestV4||santana||0.64975142110317||Nov 8, 2021 15:12|
|59||sub_test2||I&I Group - ibtihal||0.50636387264013||Nov 10, 2021 20:02|
|60||RNTestV2||santana||0.50368010294146||Nov 10, 2021 11:53|
|61||RNTestV3||santana||0.50368010294146||Nov 10, 2021 15:55|
|62||Baseline – Equipe 2||ddrc||0.50338051629384||Nov 9, 2021 19:58|
|63||sub-1||prem||0.5012578309574||Oct 28, 2021 02:29|
|64||sub-1||prem||0.49978714733275||Oct 28, 2021 03:39|
|65||Test||HarrisAbdulMajid - harrisabdulmajid||0.49950260408363||Oct 23, 2021 10:48|
|66||random-guess2||nehzux - nehzux||0.49940172303779||Oct 29, 2021 08:50|
|67||random-guess||nehzux - nehzux||0.49927962685169||Oct 13, 2021 06:36|
|68||RF-test||nick - nick||0.49922240057618||Oct 31, 2021 09:42|
|69||Random||fandrades - fandrades||0.48963363321708||Oct 21, 2021 01:56|
|70||pyggy v1||jmostoller||0.10993527002878||Oct 2, 2021 00:50|
|71||g_trimmed_3.5||jmostoller||0.10950990906728||Nov 8, 2021 20:56|
|72||g_trimmed_3.0||jmostoller||0.10232777675228||Nov 5, 2021 18:15|
The competition offers the following prizes, for the top three winners:
- 1st Prize: 8,000 EUR
- 2nd Prize: 6,000 EUR
- 3rd Prize: 2,000 EUR
In addition, special prizes will be awarded to outstanding or creative solutions, should they exist. We will also potentially include a fellowship position at Institute of Advanced Research in Artificial Intelligence, Vienna, Austria.