# Modeling arXiv Submissions

** Date:**

**Updated:**

NOTE: The arXiv surpassed 10000 submissions in October 2016; see the announcement here.

For an alternative modeling approach, check out this post by my colleague Jonathan Gross.

## Introduction/Background

The arXiv is an online repository for so-called “preprints” of scientific papers. Originally started in 1991 to provide a place where physicists could more easily share their work, the arXiv has grown to house papers from fields such as Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. The arXiv helped pioneer the idea of “open science”; its success seems to have encouraged the introduction of preprint servers in the fields of chemistry and biology.

Recently, the arXiv reached two milestones - its 1 millionth submission (2015 January 12; video here), and its 25th birthday. In light of this, I was curious to know more about the number of papers submitted each month. It turns out the arXiv administrators provide regular updates each month regarding these and other numbers, which are available here.

As a result, I wanted to know if it was possible to build a simple model to describe how the number of submissions changes in time, and use it to forecast when the arXiv will reach yet another milestone - 10,000 submissions in one month. As I show below, it’s very possible that the number of submissions per month will soon exceed that number.

To get a sense of what’s going on, and why this is an interesting problem, let’s look at a time series plot of the number of monthly submissions.

As we can see, there has been steady and robust growth in monthly submissions. It turns out that, over the history of the arXiv, the maximum number of submissions in any one month was 9792, in 2016 May. In turn, since we are so close to having 10,000 submissions in one month, a natural question arises:

How long do we have to wait?

Let’s find a way to answer it!

## Building a Model - Tools and Methodology

Let \(S(t)\) denote the number of monthly submissions, where \(t\) indexes the number of months since the beginning of the arXiv. arXiv submission data is indexed to the beginning of each month; thus, the start 1991/07/01 is set to \(t=0\), 1991/08/01 is set to \(t=1\), and so on.

The tools I use to do data visualization/build the model are pretty standard in the Python software stack:

`matplotlib`

: for plotting and visualizing data`seaborn`

: to make`matplotlib`

plots a bit prettier`pandas`

: used for wrangling data, especially time series`numpy`

: great support for scientific computation; lots of handy, miscellaneous functions

As noted on the usage statistics page for this data, there were some “submissions” which were not actually submissions per se, but were instead articles which were “migrated to the arXiv”. In my analysis, I made sure to adjust the number of submissions accordingly, so as to reflect the number of submissions which were actually put on the arXiv directly.

I will use two approaches predict when we will get to 10,000 submissions. The first is a basic least-squares regression fit to the data, and the second is a slight extension of the first, where we incorporate some randomness.

## Simple Regression Model

To build a simple regression model, I started by smoothing the data. This is mostly because I wanted to remove any spurious/random fluctuations as best I could, without cherry-picking for outliers.

To do the smoothing, I computed a three-month rolling average. (You might be familiar with these kinds of plots from looking at other time series data, such as stock market indices). Each data point \(S(t)\) gets transformed as

\[S(t) \rightarrow \frac{S(t) + S(t + 1) + S(t + 2)}{3}\]The reason for using a three-month rolling average, as opposed to some other number of months, is mostly a matter of choice. As you can see below, we get a somewhat nicer time series plot.

The smoothed data looks a little cleaner than the original, and makes it pretty clear \(S(t)\) seems to be quadratic in the number of months since 1991/07/01. I chose to use that functional form, without doing any rigorous model selection. We’ll simply make a fit and see how well it does.

At this point, it should be noted that doing any kind of regression on data requires several assumptions. Since we are doing a linear regression (i.e, curve fitting), one useful assumption is that, if the true function describing the data is \(f(t)\), our observations are \(S(t) = f(t) + \epsilon(t)\), where \(\epsilon(t)\) is a normally-distributed random variable with mean 0. Another powerful assumption is that \(\epsilon\) is independent of time (aka, *homoscedasticity*.) We make both these assumptions here.

With these assumptions, we can do do the regression using the `numpy.polyfit()`

method. This function does a *least squares* regression fit. What does that mean? Recall we have our observations \(S(t)\), which are noisy versions of a true function \(f(t)\), and we want to make a model for \(f\). We suppose \(f\) has a quadratic dependence on time. Then, the least squares estimated coefficients \((a, b, c)\) are those such that the sum of the squares of the differences between our data and our model is minimized:

Then, our model for \(f\), denoted as \(\hat{f}\), is simply

\[\hat{f}(t) = at^{2}+ bt + c\](By way of a brief aside - one way to rephrase the assumptions above is as follows: the observations \(S(t)\) are drawn from a Gaussian distribution \(\mathcal{N}(f(t), \epsilon^{2})\). Then, it follows that the *maximum likelihood estimate* of the coefficients \((a, b, c)\) in our model are identically equal to those given by the least-squares regression. This is because the likelihood function is a product of Gaussians in this case, and maximizing it is equivalent to minimizing the sum of the squares of the differences.)

How well does this fit do? Below, I plot the (unsmoothed) data, the quadratic fit, and a linear fit. It’s clear the linear fit does a bad job of modeling the data, and the quadratic fit does reasonably well.

According to `numpy`

, the best fit is given by

Although the coefficient of \(t^{2}\) is somewhat small-ish, it’s clear from the plot above that we have a bad model if we don’t include it. At this point, I think we are ready to try and predict when the arXiv will see 10000 submissions in one month. In the next sections, I present two ways of doing so.

## Predicting 10K Submissions - Naive Approach

With our simple model developed above, let’s see if we can make a sensible prediction as to when the arXiv will see 10K submissions in one month. What if we simply extrapolate \(\hat{f}(t)\) into the future? We end up with a plot like this:

Finding where \(\hat{f}(t) = 10000\), we see the simple model predicts that sometime near **November 2017**, the arXiv will see 10K submissions in one month. Although this result is satisfying in that we got something sensible out of making it, I don’t think we will have to wait that long! Can we do better?

## Predicting 10K Submissions - More Sophisticated Approach

One of the assumptions which went into our model was that \(S(t) = f(t) + \epsilon\) (i.e., that the observed and true values can be related by a normally-distributed random variable). Could we use that information to provide us with a more sophisticated way of making a prediction?

Yes. We can do so by observing that it’s possible to *simulate* possible future values for the number of submissions. If we take \(\hat{f}(t)\) and add some Gaussian random noise to it, we get a model for one possible set of future values that could be observed. (Such a set of values is often called a *trajectory*, and that’s the terminology we will use here.)

The key idea here is that the number of submissions each month is not exactly deterministic, so we need some randomness. That randomness is provided by the assumptions of our model; namely, by adding Gaussian noise. Now, you might ask “What’s the mean and standard deviation of those noise values?”. A great question! Our model assumes that the Gaussian noise has mean 0, but, as for its variance, I make the following assumption:

The variance of the noise is approximately equal to the variance of the distribution of residuals (differences between our naive fit and the observed values).

Although I did not plot it here, the code for looking at the distribution of residuals is available in the Jupyter notebook I’ve made available online. That distribution had \(\sigma = 229\), and that’s what I used as the standard deviation of the noise.

To visualize what’s going on, let’s some trajectories. Below, I plot, using a solid orange line, \(\hat{f}(t)\), and in other colors, with dashed lines, various trajectories. It’s important for what follows to notice that some trajectories exceed 10,000 submissions well before November 2017. By incorporating randomness, we get different results for when 10,000 submissions will happen.

Now that we have some trajectories, we can now build up a histogram of the months when the number of submissions first reached 10,000. For each trajectory \(T(t)\), let \(M\) be given as

\[M = \underset{t}{\text{argmin}}~(T(t) >= 10000)\]In the parlance of random walk theory, \(M\) is known as the *hitting time*. To build up a histogram of hitting times, I use the following algorithm:

- Generate many trajectories
- Calculate the hitting time for each
- If the hitting time is in the past, ignore that trajectory. (Essentially, I’m ignoring
*counterfactual*trajectories, where we predict we should have already seen 10,000 submissions.)

Since I don’t think we want to reason in terms of “number of months since 1991/07/01”, I’ve changed the \(x\) axis labels in the histogram below to be dates of the form month/year.

The histogram shows us that, under the assumptions of our model, we could reasonably expect to have to wait a year before we ever get 10K submissions in one month. However, it’s worth noting a non-trivial number of trajectories crossed that mark around **2017 February**, which is only 6 months away! On average though, we might have to wait until **2017 August** to reach this milestone.

## Limitations of the Model

Admittedly, this model is rather rudimentary. As such, we shouldn’t really read too much into its predictions, possibly aside from observing that they jive with an intuition that use of the arXiv will continue to grow, at least for a while, and that at some point in the near future, we will see 10K submissions in one month. Other than that, this model really cannot say anything too quantitative about that topic.

Some explicit limitations include:

- I (arbitrarily) picked a functional form for \(\hat{f}(t)\). There was no (formal) model selection involved.
- I assumed the data could be modeled as some true signal plus homoscedastistic Gaussian noise.
- I assumed that the standard deviation of empirial distribution of residuals was a good proxy for standard deviation of the noise.
- I arbitrarily picked an endpoint for the simulations (2018/01/01). In turn, this implies an assumption that by 2018 January, we will have seen 10K submissions. If I picked a farther-out end date, the histogram above would change.
- I ran an arbitrary number of simulations of the trajectories. Even with the assumptions above, it’s possible I simply didn’t observe enough trajectories to get an accurate histogram.

## Concluding Thoughts

It is clear that #openscience services like the arXiv, among others, have become invaluable to researchers, research institutions, and research funding groups. The robust growth in terms of monthly submissions is a testament to that. The success of the arXiv has probably helped pique the interest of other disciplines in practicing open science, be it the establishment of the biorxiv for biological research, or the announcement of the intent of the American Chemical Society to establish a preprint server for chemistry.

Equally useful, the arXiv runs itself in a (relatively) transparent manner. (For example, see the arXiv User Survey.) By providing usage data, it’s possible to find out new things about the arXiv and its users. Here, I have done a very simple investigation to determine when the arXiv will see 10,000 submissions in one month. Based on a simple least squares regression, the predicted month is 2017 November; with a slighly more sophisticated predictor, that prediction is brought closer, to 2017 August, though it also suggests it’s entirely conceivable the arXiv will cross that threshold within the next 6 months or so.

Part of the reason for doing this post was to learn something basic about “big data” type problems - finding data, working with it, visualizing it, and putting it to work. I hope you found this post useful! A big thanks to the arXiv administrators - without them, the usage data wouldn’t be available, and this analysis could not be done!

If you like to play with code, here’s the Jupyter notebook I used to do this analysis.

If you like topics and content like this, you might like my Twitter feed.