Content Hash List API (v2.0.0)

The Content Hash Fingerprinting API provides robust access to a comprehensive database of content fingerprints designed for advanced content moderation systems. The API leverages multiple hashing algorithms including MD5, SHA256, SHA512, and PDQ to ensure maximum coverage and accuracy in content identification.

Key features and use cases

The TCAP Archive is a repository of known terrorist or violent extremist content (TVEC) media files, including images, videos and documents. The TCAP Archive Hash List API allows platforms to ingest the hashes produced from these media files in bulk, so they can use them in their content moderation processes.

The TCAP Archive’s hashes are distinct from any existing TVEC hash lists, as they complement Tech Against Terrorism’s proactive monitoring of terrorist internet usage by its team of open-source intelligence specialists. By leveraging this expertise, alongside a suite of automated monitoring capabilities, the TCAP Archive hash list reflects content created and uploaded over a number of years by a range of violent Islamist and violent far-right terrorist entities.

Authentication

In order to use any of the Hash List endpoints you will need to be an on-boarded Hash List TCAP user. These endpoints sit inside the main TCAP backend so you will be able to access all with a standard TCAP JWT token.

To obtain a token make a request to the TCAP authentication endpoint with your username and password.

POST https://beta.terrorismanalytics.org/token-auth/tcap/
{
  username: YOUR_TCAP_USERNAME,
  password: YOUR_TCAP_PASSWORD,
}

Response

The Authentication endpoint returns the following data on each request:

  • token : String Token to be used to on each request you make the the API as a Bearer token

  • user : User Your system user information

Hash List by Ideology

GET /hash-list/v2/all

This Hash List API endpoint retrieves a hash list file filtered by a specified ideology.

Parameters

  • ideology: (Type: String, Required: Yes): Specifies the ideology to filter the results.
GET /api/hash-list/:ideology  <'islamist' | ' far-right ' | 'all' >

Response

The Hash List endpoint returns the following data on each request:

  • count : Integer Total number of hash records available

  • next : String URL of the next page results

  • previous : String URL of the previous page results (null if first page)

  • checkpoint : String Timestamp-based checkpoint identifier for synchronization

  • results : Array Array of Hash objects

Hash Object Fields

  • hash_digest : String The computed hash value

  • algorithm : "MD5" | "SHA256" | "SHA512" | "PDQ" The algorithm used to generate the hash

  • ideology : 'islamist' | 'far-right' | 'all' Content classification category

  • file_type : 'String' Source file format

  • deleted : 'Boolean' If the file has been removed from the system

  • updated_on : 'Float' Unix timestamp of the last update

Pagination

The API implements cursor-based pagination using timestamp and ID pairs. Results can be traversed using the next and previous URLs provided in the response.

Query Parameters

  • limit: Number of results per page (default: 1000)

  • offset: Starting position for pagination

  • order: Sort order for results (asc/desc)

  • after: Cursor value for pagination (format: timestamp,id)

Implementation Notes

  • Multiple hashing algorithms provide redundancy and enhanced detection capabilities

  • Checkpoint field enables efficient delta updates for client-side caching

  • Each hash entry includes metadata for content categorization and tracking

  • Real-time updates reflected through updated_on timestamps

  • Deleted flag allows for soft deletion while maintaining hash history

Example Response

{
    "count": 19676,
    "next": "http://beta.terrorismanalytics.org/hash-list/v2/all?<params>,
    "previous": null,
    "checkpoint": "1730213563.621023,29152",
    "results": [
        {
            "hash_digest": "baf781254eb82811cdf3fe4751240eb8",
            "algorithm": "MD5",
            "ideology": "Far-right",
            "file_type": "mp4",
            "deleted": false,
            "updated_on": 1730204429.302388,
            "id": 28023
        }
        ...
    ]
}

Best Practices

  1. Implement local caching using the checkpoint mechanism

  2. Process updates incrementally using the pagination system

  3. Consider implementing parallel processing for multiple hash algorithms

  4. Store hash values in their original format to maintain precision

  5. Monitor the deleted flag for deprecated hash values

Usage With Metas Threat Exchange

If you want to use the Hash List through Meta's Threat Exchange you can create a collaboration configuration for our API, fetch and compare PDQ Image and MD5 video hashes.

Step 1 - Install threat exchange

$ pip install threatexchange

Step 2 - Configure the default credentials

$ threatexchange config api tat --credentials '<TCAP_USERNAME>' '<TCAP_PASSWORD>'

Step 3 - Set up config

$ threatexchange config collab edit tat --create 'TAT'

Step 4 - Fetch hashes with verbose logging

$ threatexchange -v fetch

Step 5 - View dataset

$ threatexchange dataset

Step 6 - Match a piece of content

$ threatexchange match ~/path/to/image.jpg

ThreatExchange docs