The AppEnricher Module

image showing a mobile phone and apps

Description

The AppEnricher module processes app information in the app json interchange format (for instance produced by AppScanner) and enrich it with additional information not present in the original sources of the app store, producing new app information always expressed in the app json interchange format. The extra information that is calculated and added to each app is of three kinds. First, the total number of reviews. Second, the new uses data structure. Third, the score information (relative to the measures defined in the module): note no ranking information is added, this is done in the appranker module (we keep the components separate because the rank information is relative to the used global dataset).

Installation

This module has been built for portability and ease of use, so it is composed by a stand-alone Python program. All you need is to have Python 3 installed on your system, then you can just download the code and run it. The behavior of the module can be tuned by using some command options that are described below. Being a Python program, it is also easy to inspect the code, see how it works, and possibly modify it if you want so.

Quick Usage

Command line: appenricher
The Appenricher module can be directly launched without any parameters: if so, it will proceed by using default locations for the input and the output directories (see below).

Options

The AppEnricher behaviour can be modified via the following command line options (shorter or longer formats):


  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        the input json directory (default: google/results)
  -o OUTPUT, --output OUTPUT
                        the output json directory (default: google/enriched)
  -b BASE, --base BASE  base directory for relative directory paths (default: .)
  -k, --keep            existing files from output directory are kept (not deleted) (default: False)
  -v, --verbose         verbose informative output (default: True)
Some more information about the options:
-i, --input
This option allows to modify the location of the input app json files from the current default (the relative directory google/results). Note that the directory location can be absolute or relative.
-o, --output
Similary, this option allows to modify the location of the output app json files (the current default is the relative directory google/results). Again, the directory location can be absolute or relative.
-b, --base
This option allows to modify the base directory that is used to eventually produce the absolute paths of the input and output directory. Note that this is effective only when the input or output directory is a relative path, otherwise it is ignored. The default base directory is the one containing the AppEnricher program.
-k, --keep
By default, the output directory is always created from scratch, so to only contain the result of the current analysis. On many occasions, however, you might want to have the opportunity to merge results coming from previous analyses: this is possible by activating this option, which does not delete the output directory and allow merging (overwriting, in case of conflicts) of results.
-v, --verbose
Similarly to the other modules of the Mobosearch framework, the default output of Appenricher is minimal, which also means that during the computation you might not get any output at all until the end. This option tells Appenricher to be more verbose, printing more information during the computation. Using this option will progressively show what apps have been computed, allowing to monitor the progress of the analysis.

Measures

Score measures can be easily defined by using special high-level definitions that allow to write general scoring measures (so, you can easily write your own measures combining privacy, security and also other information of the app).

In order to understand how to write a measure, we can start with a simple example of measure and describe how it works. Take for instance the sample measure Privacy Aware as currently defined in appenricher:

        "measure": {
            "name": "Privacy Aware",
            "description": {
                "short": "Avoid implicit marketing tracking",
                "full": "This measure penalizes apps for every user information that is"
                        "obligatorily gathered for advertisement or marketing"
            },
            "id": "org.mobosearch.scores.aware",
            "conditions": [
                {
                    "match": {
                        "uses": {
                            "Advertising or marketing": {
                                "mandatory": "*"
                            }
                        }
                    }
                },
                {
                    "score": 65536,
                    "match": {
                        "uses": {
                            "Undefined": True
                        }
                    }
                }
            ]
        }
	

First, from the example we can see the basic information structure of a measure: it is defined by a measure property that contains the name of the measure (via the name property). Then there is a textual description of the measure, expressed by the description property and its two subproperties short (the short textual description of the app) and full (the longer textual description). Then, the id property contains a string with the unique identifier of the measure.

After this information, the conditions property contains the actual definition of the measure. At high level, the idea is to use a special kind of lazy structural matching in order to compute the final score. Let's explain how it works by discussing the previous measure. The conditions property contains a list of conditions, each expressed by a match rule (given by the match property), and optionally a numerical score (given by the score property). When no explicit score information is provided, it is assumed by default to be 1: therefore, in the above example, writing the first match property without a score is equivalent to writing "score": 1.

The value of each match property is a json structure: this structure is used as for structural pattern matching in the app json, and if a match is found the corresponding score is added: the final score is the sum of all the occurring matches (if there are no matches, the default zero value is returned). Let's see how it works going back to the previous example.

The first match condition is checking whether, among the various data uses of an app, there is the use Advertising or marketing, containing a mandatory part of any kind. Here, the string value "*" is used as structural wildcard pattern matcher, so indicating any substructure up to their final value. In natural language terms, the condition is therefore checking whether the app has some data that is unconditionally (mandatorily) used for ads/marketing purposes: for any such data, the rule matches and so produces a corresponding score (in this case, the default score 1), which is added to the overall score. So, after this rule is executed, the score is equal to the number of data items that are used for advertisement or marketing purposes (either collected by the developer or shared, without distinction).

The second matching condition, as we can see, is instead checking whether the data uses are undefined: in this case (obviously terrible from a privacy viewpoint, given that the developer is not disclosing what data is collected!) we give a very high penalty score (65536).

Note how this definition of measures allows to easily write general or particular rules that depend on structural information present in the app. Just to proceed with the previous example, we could for instance modify the measure by adding another rule:

               {
                    "score": -1,
                    "match": {
                        "price": {
                            "amount": 0
                        }
                    }
                }

This rule is checking whether the price of the app is 0 (so, a free app), and if this is the case it is giving a -1 score to add. In other words, if the app is free we are giving a bonus point, subtracting 1 to the amount of penalties calculated by the rest of the measure. This simple example illustrates the general fact that a measure can actually reason globally on any information component of an app, not just on its privacy parts.

The Code

The source code of the module can be directly downloaded here (note that at public launch we will also provide a GitHub link).