S 4.393 Comprehensive input and output validation for web applications

Initiation responsibility: Head of IT, IT Security Officer

Implementation responsibility: Developer, Administrator

All data passed to the web application, regardless of the code or the form of transmission, must be deemed potentially hazardous and filtered accordingly. Efficient protection against common attacks can be achieved by reliably and thoroughly filtering the input and output data with the help of a validation concept. In this, both the input data of users to the web application and the output data of the web application to the client should be filtered and transformed (output encoding). This way, it is ensured that only expected and no malicious data is processed and output by the web application.

If a less restrictive use of data filters is necessary for individual functions, this must be defined and documented expressly when accessing the data. Additionally, it is possible to use context-sensitive filters in the business logic of the application or in the background systems.

The following items should be taken into consideration when implementing and configuring the validation component in order to securely process the data.

Identification of the data

In order that input and output data of a web application can be validated comprehensively, all data structures to be processed (e.g. email address) and the values admissible therein must be identified initially. A corresponding validation routine should be implemented for each data structure. Along with the data structure, the data processing method should also be included (e.g. forwarding to an interpreter, return to the client, storage in a database).

Consideration of all data and data formats

The validation component should take into consideration all data formats to be processed and interpreters used. Typical data formats for web applications include personal data (name, phone number, zip code), photos, PDF files, and formatted texts, for example. Typical interpreters for data processed or output by web applications include HTML renderers, SQL, XML, LDAP interpreters, and the operating system, for example.

The validity of the data can be checked using different techniques. For example, the validation component may check the value range of the input or regular printouts may be used in order to validate admissible characters and the length of the data to be expected. The validity of XML data may be checked with the help of the corresponding XML scheme, amongst other things. Furthermore, frameworks and libraries provide corresponding validation features for common data formats.

The following characters are normally interpreted by programs used in web applications and may therefore be used for injecting malicious code. This is why they should be taken into consideration during filtering.

Nil value, newline, carriage return, inverted commas, commas, slashes, space characters, tab characters, higher than and lower than, XML, and HTML tags.

This list is by no means complete. Furthermore, the interpreter character sets (e.g. SQL syntax) may vary for different products. Examples for critical characters are listed in section Potentially dangerous characters for interpreters in Resources for the Web application module.

Along with the actual user data (e.g. form parameters in GET or POST variables), data of different origin (secondary data) must be validated as well. Such precautions include, for example:

Names of form variables (these may be manipulated randomly just like the value of the form variables),
HTTP header fields (header fields in HTTP requests and responses should only contain ASCII characters and no newline characters such as \r\n, for example),
SessionIDs (e.g. from cookies).

Automated calls by the client, e.g. by Ajax and/or Flash scripts or JSON requests, must also be validated.

On the background systems, the data must be validated (possibly again). This is also applicable if data is read out after a successful writing process into the database, because the data also may have been changed in the database in the meantime.

Furthermore, attack techniques are known within the framework of which malicious code is transmitted using a channel that cannot be controlled by the web application (e.g. FTP, NFS). If an attacker is able to change or create files integrated by the web application using these services, malicious code may be embedded using this detour. Within the framework of so-called cross-channel scripting, JavaScript code is introduced this way executed by the browser similarly to persistent cross-site scripting. Therefore, all data of the web application should always be validated before being output to the users, regardless of the source.

Server-side validation

Normally, the users use generic clients (e.g. web browser) in order to access the web application. These clients are not covered by the security context of the web application, but are controlled by the users. Therefore, the data validation must be implemented as a server-side security mechanism on a trustworthy IT system.

If data is additionally processed on the client by code of the web application (e.g. JavaScript code), this data should also be validated on the client. In so doing, the delivered scripts of the web application should also provide the corresponding validation routines. If the data is sent to the server within the framework of the downstream processing process, it must be taken into consideration that the client-side validation may not be a substitute of the server-side validation.

Validation approach

Data validation is differentiated in a white list approach and a black list approach.

Within the framework of the white list approach, only the data contained in the list is admissible. In this, rules are developed on the basis of an amount of characters as low as possible, admitting data within a defined range of characters and rejecting data containing deviating characters. Here, complex rules should be imaged by sequentially using simple rules.

Conversely, within the framework of a black list approach the data contained in the list is deemed inadmissible and rejected . All data not expressly inadmissible is accepted within the framework of this approach.

However, the black list approach entails the risk that not all variations of inadmissible data are taken into consideration and recognised. Therefore, the white list approach should be preferred over the black list approach.

Harmonisation before validation

Data may have various encodings (e.g. UTF-8, ISO 8859-1) and notations (e.g. for UTF-8 "." = "2E" = "C0 AE"). Depending on the encoding scheme applied, one value may have different interpretations. If the data is validated without taking into consideration the encoding and the notation, malicious data is possibly not detected and filtered. Therefore, all data should be brought to a uniform, standardised form before validation. This process is called the harmonisation of data. The data represented this way is then further processed. Additionally, when using AJAX, the innerText property should be used for subsequent loading instead of innerHTML, because innerText automatically encodes.

Moreover, the encoding scheme should be configured explicitly when data is delivered by the web application (e.g. using the content type header: charset=ISO-8859-1).

Context-sensitive data masking

If potentially malicious data must be processed by the web application (e.g. characters with a meaning for used interpreters) and filtering is therefore not possible, this data must be masked and converted to another form of representation. In this masked form, the data is no longer interpreted as executable code. Since masking is interpreter-specific, all interpreters used must be taken into consideration (e.g. SQL, LDAP). Accordingly, masking must be performed in a context-sensitive manner for the expected input and output formats and the interpreter language. Due to the complexity and the specific requirements of different interpreter languages, it is recommendable to use specialised libraries for masking.

All characters classified as insecure for the intended interpreter should be masked. This includes, e.g.

unexpected JavaScript and HTML to be delivered to the client (web browser),
SQL statements inserted into the database in an unauthorised manner (e.g. from inputs in form fields),
commands for the operating system (e.g. in manipulated HTTP variables).

Masking can be performed by converting the affected data and/or meta-characters of the respective interpreter language into so-called character references. The following example illustrates selected HTML characters with the corresponding character references:

& => &
< => <
>=> >
"=> "
'=> '

It must be noted that & characters are replaced in the first run, since this character is re-used as a meta-character in other character references.

Using a separate markup for filtering HTML tags

If the web application requires HTML formatting tags in user input (e.g. for formatting user contributions), admissible HTML tags should be differentiated and filtered from problematic tags (see also section Context-sensitive data masking).

This approach entails the high risk of overlooking problematic tags (for example <script>). Therefore, the alternative approach of defining separate markup tags for the markup of the user (e.g. BBCode) should be preferred. These markup tags are then compiled by the application into the related HTML tags. Traditional tags and/or problematic characters are filtered continuously.

Using { and } instead of < and > constitutes one possible method if a simple markup is to be admitted. In this case, bold would be {F}This is bold{/F} and a photo could be positioned this way: {img src=/images/img.gif width=1 height=1 img}.

Here, the conversion in HTML must not only replace curved brackets by angle brackets, but each tag must be considered as a whole:

{img to <img,
img} to >,
src=Datei to src="Datei" (with file to be filtered additionally).

If HTML tags are admissible, it must be observed as a matter of principle that no iFrame tags are allowed. With the help of iFrames, any content can be inserted into the website. Therefore, iFrames must not be used.

Handling erroneous input (sanitising)

Instead of rejecting data due to an unexpected file format, erroneous input may be corrected and transformed automatically (sanitising). This is intended to allow a user-friendly data input in different writing styles. For further processing, the data can be freed from unexpected characters (e.g. the phone number (0049)-201-12345678 may be converted to the numerical format 004920112345678).

This process may include deleting, replacing, or masking characters (see also section Context-sensitive data masking).

Sanitising entails the risk that changes to the data result in a new complexity, new attack vectors, or misinterpretation. Therefore, sanitising should be avoided where possible and only used in cases where any misuse of sanitising can be ruled out.

If the web application requires data correction, the deliberate manipulation of data (e.g. of the SessionID by an attacker) should not be corrected, but rejected. Furthermore, input data made impossible when the browser is operated as intended should be rejected as a matter of principle. This includes, e.g.:

additional or missing form parameters,
session cookies with unexpected characters or invalid length,
unexpected values for the transmission of form data from predefined HIDDEN, SELECT, or CHECKBOX fields,
the transmission route of the parameter (e.g. GET, POST, cookie) does not match the specifications made by the application.

When cleaning the data, the nested input of attack vectors should be taken into consideration. For example, the filter s/<script>//g; seemingly reasonable at the first glance is problematic (here written in Perl RegEx syntax) in order to delete <script> tags from the input. However, this filter can be bypassed by means of a nested input (e.g. <sc<script>ript>). Therefore, recursive filtering is required. In cases of doubt, the input data must be rejected.

As a matter of principle, the requested activity should also be cancelled and a neutral error message should be issued if the data is rejected (see also S 4.400 Restrictive disclosure of security-relevant information in web applications). The session should be made invalid additionally for web applications with high protection requirements.

Review questions:

Are all data (input and output data) and data streams of the web application (e.g. between user, web applications and background systems) included in the validation?
Does the web application also include secondary data (such as SessionIDs) in the validation?
Does the web application validate the data on the server side on a trustworthy IT system?
Does the web application harmonise the data before validation?
Is the data validated in a context-sensitive manner taking into consideration the data interpreter to be expected within the web application?
For web applications with automatic erroneous input handling (sanitising): Is sanitising implemented securely?