Question

how to save scraped data to digital ocean with scrapy python

i have created a code to scrape website and uploaded it to github and Run Scrapy Spiders On Digital Ocean Droplet With ScrapeOps. but i want to save the scrapped data back to digital ocean.anyone could guide me the steps what to enter in scrapy setting and how to create a space in digital ocean and how to save scrappped data in digital ocean space?


Submit an answer


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Bobby Iliev
Site Moderator
Site Moderator badge
March 7, 2024

Hey!

To save the scraped data to DigitalOcean Spaces using Scrapy and ScrapeOps, you could do the following:

  1. Create a DigitalOcean Spaces Bucket:

    • Log in to your DigitalOcean account.
    • Go to the ‘Spaces’ section and create a new space. During the creation process, you’ll choose a data center region, a unique name for your space, and whether it should be public or private.
    • Once the space is created, note down your Space name and the endpoint URL.

    https://docs.digitalocean.com/products/spaces/how-to/create/

  2. Get Your Access Keys:

    • You need to generate Spaces access keys to authenticate your requests. Go to the API section in the DigitalOcean control panel.
    • Generate a new Spaces access key and secret. Note these down securely.
  3. Configure Scrapy to Use DigitalOcean Spaces: Scrapy supports Amazon S3 storage, and since DigitalOcean Spaces is compatible with S3, you can use Scrapy’s S3 feed exporters. Configure your Scrapy settings.py to export data to your Space:

    https://docs.scrapy.org/en/latest/topics/feed-exports.html#storages

    FEEDS = {
        's3://your_space_name/your_folder/%(name)s/%(time)s.json': {
            'format': 'json',
            'store_empty': False,
            'encoding': 'utf8',
            'uri_params': {
                'endpoint_url': 'https://your_region.digitaloceanspaces.com',  # Replace your_region with the actual region
                'aws_access_key_id': 'your_access_key',
                'aws_secret_access_key': 'your_secret_key',
            },
        },
    }
    

    Replace placeholders (your_space_name, your_folder, etc.) with your actual Space details.

  4. Install AWS SDK Packages: Scrapy utilizes the AWS SDK to interact with S3-compatible services. Install the required packages:

    pip install boto3 botocore
    

With the settings properly configured, execute your Scrapy spider. The scraped data will be uploaded to your specified DigitalOcean Space in JSON format.

Let me know how it goes!

Best,

Bobby

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more
DigitalOcean Cloud Control Panel