The AppScanner module allows to automatically download apps from the Google Play Store, extract relevant information, and save the data in an open interoperable json format (AIF, the App Interchange Format).
This module has been built for portability and ease of use, so it is composed by a stand-alone Python program. All you need is to have Python 3 installed on your system, then you can just download the code and run it. The behavior of the module can be tuned by using some command options that are described below. Being a Python program, it is also easy to inspect the code, see how it works, and possibly modify it if you want so.
Command line: appscanner n
This tells appscanner to process n
apps from the Google Play Store: the store is crawled, app information is downloaded, and then relevant information is extracted and outputted into the json app interchange format. Note that the numerical parameter n
is optional: if you don't specify any number appscanner will process a default number of apps (currently 100).
The appscanner behaviour can be modified via the following command line options (shorter or longer formats):
-h, --help show this help message and exit
-i, --incremental incrementally process apps from a previous session
-r, --recalculate recalculate the analysis from the saved pages
-m, --meta add meta data (app store and extraction date)
-n, --nosave don't save save html page sources
-d, --delete delete the previously processed apps and sources
-s SOURCES, --sources SOURCES
subdirectory where to save page sources (default: google/sources)
-o OUTPUT, --output OUTPUT
the output json subdirectory (default: google/results)
-g, --generate generate app indexes (current visited and unvisited) from scratch
-a, --absolute absolute mode: number of analyzed pages is used rather than apps
--seeds SEEDS alternate starting seeds file
-l, --lazy lazy analysis (disable json link extraction, faster processing)
-u, --unformatted disable pretty json output (smaller files)
-p PAUSE, --pause PAUSE
extra pause between network requests (default: 0s)
-v, --verbose verbose info/warnings output
Some more information about the options:
-i, --incremental
n
apps. This is possible because the program stores the information about the crawl into a persistent index (a file), including apps that are queued and that can be analyzed later.-r, --recalculate
-m, --meta
-n, --nosave
-d, --delete
-s, --sources
-o, --output
-g, --generate
-a, --absolute
n
, as said before, specifies how many apps have to analyzed. Note that this n
number is not equal to the number of pages that is actually downloaded: given that appscanner performs a crawling, it actually crawls also other kind of pages of the app store (for instance the home page, and other pages like lists of apps). This means that the number of downloaded pages (and of corresponding network calls) will be bigger than n
. So, for example, calling appscanner with n=2000
could take much more than double the time it takes to run with n=1000
.
Option -a
allows to have a finer control on the network calls: when using this option, the n
number specifies the number of pages that are downloaded, not the number of apps processed. This finer control can be helpful for instance for putting a limit on the number of network calls, and so also on the time needed to end the computation. Going back to the example before, with option -a
activated running appscanner with n=2000
will now take approximately double the time it takes to run it with n=1000
.
--seeds
seeds: they are the starting points from where information is progressively extracted. Appscanner has a default set of seeds for the Google Play Store, but this set can be modified by using this option, indicating a file with a different set of seeds. This allows to run appscanner starting from specific pages, and so likely obtaining quite different results (in terms of what apps are collected and analyzed) than running appscanner with the default seeds. The format of the seeds file is quite easy: it is just a list of page addresses (one per line). The addresses can also be relative, in which case they will be turned to absolute web addresses by using the Google Play store address as base.
-l, --lazy
lazilyignore this smart extraction: this will result in faster computation, at the expense of a different (likely smaller) pool of apps to be queued.
-u, --unformatted
-p, --pause
-v, --verbose
-v -v
or just -vv
) in order to get even more warnings information (like missing information for each app).The source code of the module can be directly downloaded here (note that at public launch we will also provide a GitHub link).