Science4cast

A Machine Learning Challenge in the Science of Science

  • Capture the evolution of scientific concepts.
  • Predict emerging research topics.

Science4cast Competition

Mario Krenn, Michael Kopp, David Kreil, Rose Yu, Moritz Neun, Christian Eichenberger, Markus Spanring, Henry Martin, Dirk Geschke, Daniel Springer, Pedro Herruzo, Marvin McCutchan, Alina Mihai, Toma Furdui, Gabi Fratica, Miriam Vázquez, Aleksandra Gruca, Johannes Brandstetter, Sepp Hochreiter

An official competition within the 2021 IEEE BigData Cup Challenges.

In our information era, the volume of scientific literature grows at an ever-increasing speed. In the field of Artificial Intelligence (AI) and Machine Learning (ML), the number of papers grows exponentially and doubles approximately every 23 months. In this overflow of information, researchers have to specialize in narrow sub-disciplines, making it challenging to uncover scientific concepts and connections beyond their own area of research. To explore beyond the specialized areas, research ideas need to transcend the individual focus bubbles. A tool that could offer such meaningful, personalized scientific ideas would open new avenues of research.

Our Science4cast competition directly addresses this challenge. The competition goal is to capture the evolution of scientific concepts and predict which research topics will emerge in the coming years. We created a semantic network characterizing the content of scientific literature in AI since 1994. The network contains 64,000 nodes, each representing an AI concept. Edges between nodes are drawn when two concepts are investigated together in a scientific paper. Competitors need to predict future states of this exponentially growing semantic network.

The compiled unique dataset will be instrumental in the pursuit of a wide range of exciting questions in the area of ML for Science of Science. These questions include end-to-end trained concept discovery, predictions of concept emergence, predictions of interdisciplinary interactions, and suggestions of personalized research ideas. Solutions to our current competition combined with this extensive dataset will set us on the way to answer these vital questions.

The number of papers in AI and ML increases exponentially.

Prizes

oahciy

Team members

Yichao Lu

Affiliation

Layer 6 AI

Prize

A voucher or cash prize worth 8,000 EUR and free IEEE 2021 conference registration for all team members

Hash Brown

Team members

Ngoc Mai Tran(1), Yangxinyu Xie(1)

Affiliation

1 University of Texas at Austin

Prize

A voucher or cash prize worth 6,000 EUR and free IEEE 2021 conference registration for all team members

SanatisFinests2

Team members

Milad Aghajohari(1), Mohammad Sadegh Akhondzadeh(2), Saleh Ashkboos(3), Kamran Chitsaz(4)

Affiliations

1 University of Montreal & MILA
2 Saarland University & CISPA Helmholtz
3 ETH Zurich
4 Polytechnique Montreal

Prize

A voucher or cash prize worth 2,000 EUR and free IEEE 2021 conference registration for all team members

Special Prizes

For a surprising, pure network-theoretical solution: 

João P. Moutinho, Bruno Coutinho, Lorenzo Buffoni (Instituto de Telecomunicacoes Lisbon)

2,000 Euro

For an exciting, dynamical embedding and applications of Transformers:

Harlin Lee, Rishi Sonthalia, Jacob G. Foster (UCLA)

2,000 Euro

For an interesting low-compute solution:

Francisco Valente

500 Euro

For an interesting Graph Neural Network solution:

Francisco Andrades, Ricardo Nanculef (Federico Santa Maria Technical University Santiago, Chile)

500 Euro

For an interesting LSTM-based solution:

Nima Sanjabi

500 Euro

Updates

10.11.2021 – result submission deadline today AoE!

We extend the paper submission deadline by one week. Please submit your short scientific paper (3-6pages) by 24.November 2021 (updates on submission follow in the next few days). The paper should explain in detail your method and results. We will invite all participants (with non-trivial results) to participate to our dataset paper which is scheduled shortly after the competition workshop. Thus, your paper can be a first draft of your contribution to our dataset paper. Looking forward to your great results!

29.10.2021

We extend the deadline by one week – you can submit solutions until 10.November 2021. Furthermore, we will invite all participants to contribute to the dataset paper (which will be written shortly after the competition ends, more details later.).

The Challenge

The main competition consists of predicting new links in the semantic network. We provide the semantic network from 1994-2017, with a discretization of days (which represents the publication date of the underlying papers).

Therefore, we provide approximately 8,400 snapshots of the growing semantic network – one snapshot for each day from the beginning of 1994 to the end of 2017, and participants are welcome to use more coarse-grained snapshots. The evolution shows how the links between 64,000 nodes are drawn. The precise goal of the task is to predict the future links formed between 2017-2020 in the semantic network, which do not exist yet in 2017. Equivalently, this task asks for the prediction of which scientific pairs of concepts will be investigated by scientists over three years.

Semantic network

Approximatively 1.5% of the semantic network (~1,000 nodes).
Scientific concepts and edges formed before 2012 (blue and grey) and between 2012 and 2015 (green).

Technical Formulation of the Task

In the competition you get:
  • full_dynamic_graph_sparse: a dynamic graph (list of edges and their creation date) until a time t1.
  • unconnected_vertex_pairs: a list of 1,000,000 vertex pairs that are unconnected by time t1.
Your task in the competition is to predict which edges of unconnected_vertex_pairs will form until a time t2. Specifically, you sort the list of potential edges in unconnected_vertex_pairs from most likely to most unlikely. The result will be computed via the AUC of the ROC curve. See more details in the tutorial.

Files

Source files: /Competition/

  • Evaluate_Model.py: Evaluating the models
  • SimpleModelFull.py: Baseline model
Detailed tutorial: /Tutorial/tutorial.ipynb
  • How to read and visualize data
  • How to run a baseline model
  • How to create predictions for validation and competition data

data files at IARAI website: Science4Cast_data.zip contains the following three files:

  • TrainSet2014_3.pkl: Semantic network until 2014, for predicting 2017
  • TrainSet2014_3_solution.pkl: which edges are connected in 2017
  • CompetitionSet2017_3.pkl: Semantic network until 2017, used for evaluation

Copy those date files directly into the directory of the source files and tutorial.

The Evaluation Metric

For the evaluation, we use a subset of all 57,000 vertices with a nonzero degree per the end of 2017. We define the set K of vertex pairs that are not connected yet by an edge at the end of 2017 (in the extreme case, K contains roughly 3.2 billion vertex pairs, i.e. possible edges. In our case, K contains half a million vertex pairs). Every k in K will either be connected or not connected by 2020. The goal is to predict whether the two vertices will be connected or not.

For evaluating the model, we use the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate at various threshold settings. Our evaluation metric is the commonly used metric Area under the Curve (AUC) of the ROC curve. One advantage of AUC over mean-square-error (MSE) is its independence of the data distribution. Specifically, in our case, where the two classes are highly asymmetrically distributed (with only about 1-3% of newly connected edges), and the distribution changing over time, the AUC provides a meaningful and operational interpretation. For perfect predictions, AUC=1, while random predictions give AUC=0.5. Operationally, it gives the percentage that a random true element is higher ranked than a random false one.

ROC curve for a random model and a model with an AUC=0.767. The AUC is our evaluation metric.

Submissions

Participants can upload their predictions on the test dataset (CompetitionSet2017_3.pkl) to the leaderboard of the competitions until the submission deadline.

Besides the submissions to the leaderboard , submission of working code, learned parameters, and a short scientific paper (4-6 pages) to be published in the IEEE BigData workshop with a sufficiently detailed description of the approach used is required to be awarded a prize. The scientific quality of the submitted paper will be verified by the competition committee.

After the competition, we plan to write a perspective/summary paper and invite all participants to contribute.

Competition Timeline

All times and dates are Anywhere on Earth (UTC -12).

  • Data Release: 25. August 2021
  • Competition ends (submission deadline): 10. November 2021 at midnight (= noon 11. November 2021 UTC)
  • Short Scientific Paper (3-6 pages) submission deadline: 24. November 2021
  • Announcement of the winners: 2. December 2021
  • IEEE BigData 2021: 15.-18. December 2021

Special Session program (17.12.2021)

Time (CET)

Title

16.00-16.10

Short Intro

16.10-16.40 (20+10)

1st Place: Yichao Lu((Team oahciy)): Predicting Research Trends in Artificial Intelligence with Gradient Boosting Decision Trees and Time-aware Graph Neural Networks

16.40-17.00 (13+ 7):

Special Prize: Team mondegoscroc

17.00-17.45 (35+10)

Invited Speaker - Rose Yu: Dynamics Learning with Graph Neural Networks

Break


17.55-18.20 (15+10)

Special Prize: Harlin Lee (Team harlin): Dynamic Embedding-based Methods for Link Prediction in Machine Learning Semantic Network

18.20-18.45 (15+10)

2nd Place: Ngoc Tran (Team Hash Brown): Random walk rankings with feature selection and imputation

18.45-19.05 (13+ 7)

Special Prize: Francisco Andrades (Team fandrades): A Method to Predict Semantic Relations on Artificial Intelligence Papers

19.05-19.20 (13+ 7)

Special Prize: Nima Sanjabi (Team nimasanjabi): Efficiently Predicting Scientific Trends Using Node Centrality Measures

Break

19.30-20.15 (35+10)

Invited Speaker: Jacob Foster

20.15-20.40 (15+10)

Special Prize: João Moutinho (Team joaopmoutinho): Network-based link prediction of scientific concepts

20.40-21.05 (15+10)

3rd Place: Milad Aghajohari (Team SanatisFinests2): Degree-based Feature Is All You Need

21.05-21.15

Conclusion

Leaderboard

Pos. Name Team/User Score Date(UTC)
1 Submission #4 oahciy - oahciy 0.92838861960445 Nov 11, 2021 11:45
2 Submission #3 oahciy - oahciy 0.92823997314912 Nov 11, 2021 10:58
3 Submission #2 oahciy - oahciy 0.92739359144926 Nov 11, 2021 01:58
4 ee5a Hash Brown - princengoc 0.92738657972574 Nov 11, 2021 01:54
5 JNUTOJ SanatisFinests2 - sadegh 0.92212365472956 Nov 9, 2021 21:29
6 Bacalhau à Lagareiro Bacalhink - joaopmoutinho 0.91853311492525 Nov 10, 2021 18:29
7 S4 nimasanjabi - nimasanjabi 0.91845580006827 Nov 10, 2021 16:58
8 05 048 3 6 asdnixu 0.9184217368478 Nov 10, 2021 18:30
9 M3 nimasanjabi - nimasanjabi 0.91840363535239 Nov 3, 2021 04:53
10 s4 sungbinchoi - sungbinchoi 0.91806434210276 Nov 11, 2021 11:09
11 s3 sungbinchoi - sungbinchoi 0.91772982833076 Nov 11, 2021 11:50
12 s5 sungbinchoi - sungbinchoi 0.91762713872587 Nov 11, 2021 11:52
13 s1 sungbinchoi - sungbinchoi 0.91759111971979 Nov 11, 2021 11:44
14 70f24eef9b3c441c4f83a2e53ef390ff892f19d2.json Hash Brown - xinyu2021 0.91734288204778 Nov 11, 2021 11:43
15 model_2 nimasanjabi - nimasanjabi 0.91525072533846 Oct 27, 2021 12:30
16 Bacalhau com Todos Bacalhink - joaopmoutinho 0.91385150367375 Nov 10, 2021 11:26
17 Test Submission (V2) xaiguy 0.9130503501038 Sep 25, 2021 20:41
18 s2 sungbinchoi - sungbinchoi 0.91157040982971 Nov 11, 2021 11:46
19 Yota instinct v2 ArgoCS - thomas27 0.90883506975933 Nov 10, 2021 02:52
20 Yota instinct ArgoCS - thomas27 0.90870570786594 Nov 10, 2021 02:42
21 Ultra instinct ArgoCS - thomas27 0.90848177152718 Nov 10, 2021 01:14
22 ArgoNet3 ArgoCS - hedi 0.90701962931712 Nov 10, 2021 12:18
23 Bacalhau com Natas Bacalhink - joaopmoutinho 0.90364755264919 Nov 10, 2021 11:27
24 Royce Team Jacob - harlin 0.90236015458578 Nov 7, 2021 18:27
25 TestSubmission1 hyperteam 0.90046004318704 Oct 2, 2021 17:56
26 Simple Test ArgoCS - hedi 0.89998693318996 Oct 7, 2021 16:10
27 Leo_model_1 leo2021 0.89921126642699 Oct 26, 2021 09:02
28 Bacalhau à Brás Bacalhink - joaopmoutinho 0.8971536961553 Nov 10, 2021 11:25
29 Ultron-NN nick - nick 0.89545500007176 Oct 31, 2021 12:49
30 crocsdead mondegoscroc 0.89254508747723 Nov 11, 2021 10:10
31 Ultra-NN nick - nick 0.89221657430449 Oct 31, 2021 12:22
32 teste_vitor vitoralmeida777 - vitoralmeida777 0.88966227277996 Nov 8, 2021 13:08
33 crocsdinamite mondegoscroc 0.88822522307617 Nov 11, 2021 10:10
34 crocsbuddy mondegoscroc 0.88785668525578 Nov 10, 2021 18:53
35 gogo raffaelbdl 0.8843339678194 Oct 10, 2021 11:38
36 Ultree-NN nick - nick 0.88127851775876 Oct 31, 2021 13:25
37 MK’s Baseline Model solution Mario Krenn - Mario Krenn 0.87978679928556 Aug 25, 2021 12:42
38 sub2 fandrades - fandrades 0.87763277456673 Nov 11, 2021 11:18
39 snaity uoguelph_mlrg - bknyazev 0.87722377750109 Nov 3, 2021 15:00
40 from tutorial graphsufi - tareqmahmood 0.87555545446748 Oct 23, 2021 15:43
41 leo_baseline7 leo2021 0.8743629435919 Oct 25, 2021 02:52
42 Bacalhau à Gomes de Sá Bacalhink - joaopmoutinho 0.87091645154164 Nov 10, 2021 11:25
43 sub1 fandrades - fandrades 0.86927044291181 Nov 8, 2021 23:36
44 Logistic Regression 2 pires 0.84704015756769 Nov 9, 2021 14:17
45 Stacked LSTM – Equipe 2 joaomanoel 0.84673222296476 Nov 5, 2021 11:10
46 Random Forest 2 pires 0.84459528010844 Nov 10, 2021 10:13
47 Random Forest pires 0.84187464500399 Nov 10, 2021 10:12
48 ArgoNet2 ArgoCS - hedi 0.8411969962578 Oct 31, 2021 11:52
49 Baseline ArgoCS - thomas27 0.84113724217193 Oct 31, 2021 15:43
50 D.02.11 P.11.34 graphsufi - tareqmahmood 0.8231368869078 Nov 2, 2021 17:35
51 D.02.11 P.09.24 graphsufi - tareqmahmood 0.82285008701302 Nov 2, 2021 15:26
52 base_submission smeznar - smeznar 0.81078522290096 Nov 2, 2021 07:56
53 improved-1 uoguelph_mlrg - bknyazev 0.8104972664267 Nov 9, 2021 03:00
54 ArgoNet hedi - hedi 0.80916052220601 Oct 24, 2021 10:11
55 LR-test nick - nick 0.78807442329955 Oct 31, 2021 10:27
56 sub0 fandrades - fandrades 0.7750607606504 Oct 31, 2021 17:53
57 D.03.11 P.01.34 graphsufi - tareqmahmood 0.75106315090655 Nov 2, 2021 19:33
58 RFTestV4 santana 0.64975142110317 Nov 8, 2021 15:12
59 sub_test2 I&I Group - ibtihal 0.50636387264013 Nov 10, 2021 20:02
60 RNTestV2 santana 0.50368010294146 Nov 10, 2021 11:53
61 RNTestV3 santana 0.50368010294146 Nov 10, 2021 15:55
62 Baseline – Equipe 2 ddrc 0.50338051629384 Nov 9, 2021 19:58
63 sub-1 prem 0.5012578309574 Oct 28, 2021 02:29
64 sub-1 prem 0.49978714733275 Oct 28, 2021 03:39
65 Test HarrisAbdulMajid - harrisabdulmajid 0.49950260408363 Oct 23, 2021 10:48
66 random-guess2 nehzux - nehzux 0.49940172303779 Oct 29, 2021 08:50
67 random-guess nehzux - nehzux 0.49927962685169 Oct 13, 2021 06:36
68 RF-test nick - nick 0.49922240057618 Oct 31, 2021 09:42
69 Random fandrades - fandrades 0.48963363321708 Oct 21, 2021 01:56
70 pyggy v1 jmostoller 0.10993527002878 Oct 2, 2021 00:50
71 g_trimmed_3.5 jmostoller 0.10950990906728 Nov 8, 2021 20:56
72 g_trimmed_3.0 jmostoller 0.10232777675228 Nov 5, 2021 18:15

Prizes

The competition offers the following prizes, for the top three winners:

  • 1st Prize:  8,000 EUR
  • 2nd Prize: 6,000 EUR
  • 3rd Prize:  2,000 EUR

In addition, special prizes will be awarded to outstanding or creative solutions, should they exist. We will also potentially include a fellowship position at Institute of Advanced Research in Artificial Intelligence, Vienna, Austria.

Questions, Suggestions, Issues

Please raise an GitHub issue if you have questions or problems, or send an e-Mail to Mario Krenn.

©2023 IARAI - INSTITUTE OF ADVANCED RESEARCH IN ARTIFICIAL INTELLIGENCE

Imprint | Terms and conditions

Log in with your credentials

or    

Forgot your details?

Create Account