The AppScanner Module

image showing a mobile phone and apps

Description

The AppScanner module allows to automatically download apps from the Google Play Store, extract relevant information, and save the data in an open interoperable json format (AIF, the App Interchange Format).

Installation

This module has been built for portability and ease of use, so it is composed by a stand-alone Python program. All you need is to have Python 3 installed on your system, then you can just download the code and run it. The behavior of the module can be tuned by using some command options that are described below. Being a Python program, it is also easy to inspect the code, see how it works, and possibly modify it if you want so.

Quick Usage

Command line: appscanner n
This tells appscanner to process n apps from the Google Play Store: the store is crawled, app information is downloaded, and then relevant information is extracted and outputted into the json app interchange format. Note that the numerical parameter n is optional: if you don't specify any number appscanner will process a default number of apps (currently 100).

Options

The appscanner behaviour can be modified via the following command line options (shorter or longer formats):


  -h, --help            show this help message and exit
  -i, --incremental     incrementally process apps from a previous session
  -r, --recalculate     recalculate the analysis from the saved pages
  -m, --meta            add meta data (app store and extraction date)
  -n, --nosave          don't save save html page sources
  -d, --delete          delete the previously processed apps and sources
  -s SOURCES, --sources SOURCES
                        subdirectory where to save page sources (default: google/sources)
  -o OUTPUT, --output OUTPUT
                        the output json subdirectory (default: google/results)
  -g, --generate        generate app indexes (current visited and unvisited) from scratch
  -a, --absolute        absolute mode: number of analyzed pages is used rather than apps
  --seeds SEEDS         alternate starting seeds file
  -l, --lazy            lazy analysis (disable json link extraction, faster processing)
  -u, --unformatted     disable pretty json output (smaller files)
  -p PAUSE, --pause PAUSE
                        extra pause between network requests (default: 0s)
  -v, --verbose         verbose info/warnings output
Some more information about the options:
-i, --incremental
With each launch, appscanner processes apps from scratch. In some occasions you might want instead to add app information, running appscanner several times instead of only once with a big number of apps. The incremental option allows to go on with the analysis, processing other n apps. This is possible because the program stores the information about the crawl into a persistent index (a file), including apps that are queued and that can be analyzed later.
-r, --recalculate
By default, appscanner saves the html sources of the app store, so that the analysis can be performed at any later time even when offline, with no need to download all the apps information again (which is the most consuming time part). So, this option allows to recompute the analysis and regenerate the json app information at any time and offline.
-m, --meta
This option adds some meta information to the json output information of each app, specifically the name of the store and also the day the information was downloaded. Note this option is off by default to save space.
-n, --nosave
As said before, appscanner saves all the sources, so you can recompute everything later, even when offline. In case you don't want to save the sources (for instance, to save disk space) you can select this option. Note that using this option will obviously not allow to use other options (like -r) that depend on having the sources stored in the system.
-d, --delete
This option allows to launch appscanner starting from scratch: every information from previous launches is deleted. Note that this option is off by default, therefore allowing to run appscanner multiple times (even without the -i option), in which case the (possibly overlapping) results of each launch will be merged together.
-s, --sources
This option allows to specify the directory where to save the sources of the pages downloaded from the app store.
-o, --output
Similarly, this option allows to specify the output directory of the analysis, where the json files of each app will be written.
-g, --generate
As said when explaining option -r, appscanner saves the crawling status into index files. This also means that if these files were lost, deleted or anyway corrupted, it would be impossible to proceed incrementally (with option -i). This option allows to generate the index files just using the sources of the downloaded pages, so allowing further incremental runs of appscanner.
-a, --absolute
The main parameter n, as said before, specifies how many apps have to analyzed. Note that this n number is not equal to the number of pages that is actually downloaded: given that appscanner performs a crawling, it actually crawls also other kind of pages of the app store (for instance the home page, and other pages like lists of apps). This means that the number of downloaded pages (and of corresponding network calls) will be bigger than n. So, for example, calling appscanner with n=2000 could take much more than double the time it takes to run with n=1000. Option -a allows to have a finer control on the network calls: when using this option, the n number specifies the number of pages that are downloaded, not the number of apps processed. This finer control can be helpful for instance for putting a limit on the number of network calls, and so also on the time needed to end the computation. Going back to the example before, with option -a activated running appscanner with n=2000 will now take approximately double the time it takes to run it with n=1000.
--seeds
Like every web crawler, appscanner needs to know where to start its crawl to download pages from the store. Such pages are called seeds: they are the starting points from where information is progressively extracted. Appscanner has a default set of seeds for the Google Play Store, but this set can be modified by using this option, indicating a file with a different set of seeds. This allows to run appscanner starting from specific pages, and so likely obtaining quite different results (in terms of what apps are collected and analyzed) than running appscanner with the default seeds. The format of the seeds file is quite easy: it is just a list of page addresses (one per line). The addresses can also be relative, in which case they will be turned to absolute web addresses by using the Google Play store address as base.
-l, --lazy
Each page that is downloaded by Appscanner is searched for links to other apps, that can so be queued and processed later. Appscanner goes deep and tries to extract this link information not only from normal html link fields, but also from the javascript part present in each page: this allows for more links to be found, and so more apps to be processed. You can anyway disable this option, and having Appscanner just lazily ignore this smart extraction: this will result in faster computation, at the expense of a different (likely smaller) pool of apps to be queued.
-u, --unformatted
The information extracted from each app is outputted into a json file, which is appropriately formatted for human readability. In some occasions you might not care about this, and just want smaller files: activating this option turns off the pretty printing, and produces a smaller json file. Note some minor formatting is still applied, so the resulting json files will be smaller but still not totally unreadable.
-p, --pause
Appscanner continuously crawls the Google Play Store to download pages and then processes them. In some cases (depending also on the speed of your internet connection, or where you connect from) this might put a strain on the Google servers, or anyway you might be identified as a crawler that is too aggressive and temporarily suspended. This option allows to play nice and temporarily pause the network calls by introducing an additional delay (expressed in seconds) between each page processing.
-v, --verbose
The default output of Appscanner is minimal, which also means that during the computation you might not get any output at all until the end. This option tells Appscanner to be more verbose, printing more information during the computation. Using this option will progressively show what apps have been computed, allowing to monitor the progress of the analysis. There are actually two levels of verbosity, so you can also apply this option twice (-v -v or just -vv) in order to get even more warnings information (like missing information for each app).

The Code

The source code of the module can be directly downloaded here (note that at public launch we will also provide a GitHub link).