Impresso Python

Impresso Project Logo

Impresso is a library designed to facilitate interaction with the Impresso dataset. It offers a comprehensive set of classes for API interaction and a variety of tools to streamline data manipulation and analysis.

Installation and prerequisites

The Impresso python library can be installed using pip:

pip install impresso

The library requires Python version 3.10 or higher. It also depends on several packages commonly found in Jupyter environments, such as matplotlib and pandas.

At a glance

Create a session

from impresso import connect
client = connect()
results = client.search.find(term="moon landing")
results

results will display a summary of the result including a preview of a pandas data frame with the result data. Use df property to access the full data frame:

results.df

Pagination

Monthly Quota

Every Impresso user has a monthly quota of the content items they can access. The quota is currently set at 200,000 content items. Paginating through a large result set may see you hitting the quota limit fairly soon. Make sure to check the size of the full result set before fetching all pages.

By default every result object is the first page of the full result set. Use the following code to go through the rest of the pages:

import pandas as pd
# Get first page with 100 items per page
results = impresso.search.find(term="landing", limit=100)
print(f"Full result contains {results.total} items.")

full_df = results.df

# Iterate through all pages
for page in results.pages():
    full_df = pd.concat([full_df, page.df])

full_df

Accessing transcripts

Content item transcripts can be large and are not returned by default. To access a transcript, request it by content item ID:

result = client.content_items.get("NZG-1877-10-20-a-i0024")
result.df['text.content'][0]

See content item on Web App (shortcut)

To see a specific content item in the Web App, look for the link "See this result in the Impresso App" in the rendered result summary:

result = client.content_items.get("NZG-1877-10-20-a-i0024")
result

Create a session

impresso.connect(public_api_url=None, persisted_token=True)

Connect to the Impresso API and return a client object.

from impresso import connect

impresso = connect()
Parameters:
  • public_api_url (str, default: None ) –

    The URL of the Impresso API to connect to. By default using the default URL set in the config file (~/.impresso_py.yml) or the Impresso default URL ({DEFAULT_API_URL}).

  • persisted_token (bool, default: True ) –

    Whether to read and write token to the user directory (~/.impresso_py.yml). This is useful to avoid having to re-enter the token each time the Jupiter notebook is restarted.

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Copyright (C) 2024 The Impresso team.

License

This program is provided as open source under the GNU Affero General Public License v3 or later.