Tutorial

How To Obtain a pandas DataFrame from an Unordered API Endpoint

How To Obtain a pandas DataFrame from an Unordered API Endpoint

The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

Introduction

Similar to an Excel spreadsheet or SQL table, a pandas DataFrame is a two-dimensional data structure with data presented in rows and columns. DataFrames are used primarily in data-concentrated fields such as scientific learning, data science, machine learning, and data analysis.

Using Python, you can convert data from CSV, Excel, or API endpoints into a pandas DataFrame and then work on the obtained DataFrame. While converting Excel and CSV files tends to be straightforward, API endpoints can vary, so making DataFrames from that information can be challenging.

Some API endpoints do exist in a reader-friendly format of dictionaries, where the keys represent column names and the values represent the elements that belong to these columns. However, some API endpoints do not represent column names and their values in a clear context, and converting an unordered API endpoint into a DataFrame can be challenging.

In this tutorial, you will obtain data from an unordered endpoint and present it in a suitable format to pandas DataFrame. First, you will extract data from some conventional API formats to provide some context for converting ordered API data into a DataFrame. Then, you will work with an unordered API endpoint. You will extract column names and their respective elements to create a dictionary. Finally, you will pass this dictionary into a pandas DataFrame that you can use for data analysis.

Prerequisites

To follow this tutorial, you will need:

Step 1 — Preparing Your Jupyter Notebook Environment

For this tutorial, you’ll use Jupyter Notebook to work with the data. Jupyter Notebook is helpful for iterative coding since it allows you to write a small snippet of code, run it, and return the result.

To follow this tutorial in your Jupyter Notebook, you will need to open a new Notebook and install the required dependencies, which you will do in this step.

In the prerequisites, you set up a Jupyter Notebook environment on your server. Once you are logged into your server, activate the virtual environment:

  1. source ~/environments/my_env/bin/activate

Then run the Jupyter Notebook application to start the application:

  1. jupyter notebook

Note: If you are following the tutorial on a remote server, you can use port forwarding to access your Jupyter Notebook in the browser of your local machine.

Open a new terminal and enter the following command:

  1. ssh -L 8888:localhost:8888 your_non_root_user@your_server_ip

Upon connecting to the server, navigate to the link provided by the output to access your Jupyter Notebook. Keep this terminal open throughout the remainder of this tutorial.

After running and connecting to it, you will access a browser-based user interface. From the New dropdown menu, select the Python3(ipykernel) option, which will open a new tab with an untitled Python notebook. Give your file a name; in this tutorial, the file is named convert_data.ipynb.

Then, in the first cell of your browser’s Jupyter Notebook, use pip to install the necessary dependencies:

convert_data.ipynb
pip install pandas
pip install requests

The DataFrame data structure is only accessible through the pandas module. The requests module helps you gain access to a URL.

After adding the commands to your Jupyter Notebook, press Run to execute them.

Your Jupyter Notebook will provide a running output to indicate that the dependencies are being downloaded. A new input cell will be available below this output, which is where you will run the next lines of code.

Next, import the modules you just installed by running the following commands:

convert_data.ipynb
import pandas as pd
import requests

pd will serve as a shortcut for you to call the pandas module. You can call the pandas module by writing just pd.

In this step, you have prepared your Jupyter Notebook environment. Next, you will perform some basic conversions to create DataFrames from dictionaries.

Step 2 — Converting Ordered API Data into pandas DataFrame

Before obtaining data from an unordered API endpoint, you will first create DataFrames from dictionaries of different formats and organized API data at a URL.

In many API endpoints, the keys usually represent the column names, and the values of those keys represent the elements belonging to the respective columns. However, some API endpoints are not presented in this format, which means you will need a method for extracting the column names and their corresponding data to create a readable dictionary from which a pandas DataFrame can be created. Reviewing some common methodologies for converting conventional API endpoints into a pandas DataFrame will help you with this task.

In this step, you will use some common methods to create DataFrames from dictionaries of different formats.

Method 1 — Creating a DataFrame from a List of Dictionaries

In this method, each dictionary within the list represents a row of data in the final DataFrame. The DataFrame structure will feature keys that represent the column names with values for the elements belonging to the columns.

For this method, you’ll create a DataFrame with the columns Age, ID, and Name along with the information of three individuals.

In a new cell, run the following code to format a list of dictionaries assigned to the variable dict1 (feel free to replace the name with your own):

convert_data.ipynb
dict1=[{"Age":41,"ID":1234,"Name":"Bella"},{"Age":24,"ID":5678,"Name":"Mark"},{"Age":26,"ID":2341,"Name":"John"}]

data=pd.DataFrame(dict1)

data.head()

You assign a list of dictionaries to the dict1 variable. Each dictionary within the list represents a single row of the resultant DataFrame. The keys Age, ID, and Name represent the column names. The values assigned to each key represent that column’s value for that row. You then pass this dictionary to the pandas module to create a DataFrame.

The following DataFrame will print:

Output
Age ID Name 0 41 1234 Bella 1 24 5678 Mark 2 26 2341 John

This DataFrame was created from the list of dictionaries. The keys you defined in each dictionary within the dict1 list are the column names in the obtained DataFrame, while the values in these dictionaries represent the row-wise data.

Method 2 — Creating a DataFrame from a Single Dictionary

In the previous method, you defined the data based on rows, which led to typing out several dictionaries in a list. Each dictionary represented row-wise data, where the keys are the column names and the values are the elements belonging to the columns.

Now you will define the data in a column-wise manner. This time, you’ll use a dictionary whose keys represent column names but whose values are lists representing the vertical data for that column.

To use this method, you can use the same data you used in the previous section. Run the following lines in a new cell:

convert_data.ipynb
dict2={"Age":[41,24,26],"ID":[1234,5678,2341],"Name":["Bella","Mark","John"]}

data=pd.DataFrame(dict2)

data.head()

Here, you define column-wise data using a dictionary. The keys of the dict2 dictionary are the column names: Age, ID, and Name. The value of each key is a list representing the data for each column.

The following DataFrame will print to the screen:

Output
Age ID Name 0 41 1234 Bella 1 24 5678 Mark 2 26 2341 John

The keys of the dict2 dictionary represent the column names (Age, ID, and Name), and each column is populated with the list you defined as the value for each key. For example, the Age column has the values 41, 24, and 26.

As expected, both methods produce the same result. Though the outputs of the first two methods are the same, you can use both row-wise and column-wise approaches to create a DataFrame.

Method 3 — Creating a DataFrame with API Data at a URL

In this method, you will create a DataFrame with API data extracted from a URL. Within API-related URLs, the data is often presented in a list of dictionaries similar to the data used in Method 1. But instead of hard-coding the data, you’ll access API data at a URL using the requests module, which helps you gain access to URLs.

For this method, you’ll use a URL for a JSON API: Countries with Capitals, which presents a list of countries and their capitals in an API format. To view the API, visit the following URL in your browser:

  1. https://raw.githubusercontent.com/dariusk/corpora/master/data/geography/countries_with_capitals.json

Here is the beginning portion of the API data presented by the URL:

List of Countries with Capitals (API Data from the Countries with Capitals URL)
{
  "description": "A list of countries and its respective capitals.",
  "countries": [
    {"name":"Afghanistan", "capital":"Kabul"},
    {"name":"Albania", "capital":"Tirana"},
    {"name":"Algeria", "capital":"Algiers"},
    {"name":"Andorra", "capital":"Andorra la Vella"},
    {"name":"Angola", "capital":"Luanda"},
    {"name":"Antigua & Barbuda", "capital":"St. John's"},
    {"name":"Argentina", "capital":"Buenos Aires"},
    {"name":"Armenia", "capital":"Yerevan"},
    {"name":"Australia", "capital":"Canberra"},
    {"name":"Austria", "capital":"Vienna"},
...

The API is presented in a recognizable format: the countries key in the outer dictionary contains a list of dictionaries, where name and capital are the column names and their corresponding values are the row-wise data.

To create a pandas DataFrame from this URL, run the following code in a new cell:

convert_data.ipynb
response=requests.get('https://raw.githubusercontent.com/dariusk/corpora/master/data/geography/countries_with_capitals.json').json()

df1=pd.DataFrame(response['countries'])

df1.head()

The requests module allows you to gain access to a URL. In this case, you read the URL as a json file and assign the read data to the variable response. Next, you present this list of dictionaries (provided as a value to the countries key) to pandas, which in turn provides a DataFrame. df1 will be the DataFrame obtained upon successful extraction of the data from the given API endpoint.

After running the above code, you’ll receive the following DataFrame:

Output
name capital 0 Afghanistan Kabul 1 Albania Tirana 2 Algeria Algiers 3 Andorra Andorra la Vella 4 Angola Luanda

This is the DataFrame you obtain after passing the list of countries and capitals to pandas DataFrame. When you ran this code, you passed the list of dictionaries (present as a value to the countries key) to the pandas DataFrame module. The column names are name and capital since they are present as keys in every dictionary in this list, and their values represent the row-wise data (since every dictionary represents a row).

So far you have created DataFrames from basic dictionaries and structured API endpoints. When dealing with URLs containing structured APIs, the data is commonly presented in the format used in Method 1: as a list of dictionaries, with each dictionary representing row-wise data. In cases where the API data is unstructured, you can extract column names and their data using the formats in Methods 1 or 2: in a row-wise or column-wise manner. In the next step, you’ll combine some of the strategies from this step to obtain data from an unstructured API endpoint.

Step 3 — Converting Unordered API Data into pandas DataFrame

Until now, you’ve converted ordered data into pandas Dataframe using several methods. Now, you will work with data in a different format. The API data in this new URL is not in a conventional format, making it difficult to identify the column names and their data.

You will have to take additional steps to create a DataFrame from this unordered data. First, you will extract the column names and the values of all the columns and then define the column names as keys and the column elements (in a list) as values in a dictionary. Ultimately, you will obtain a presentable dictionary, similar to the format used in Method 2, and present it to pandas for conversion into a DataFrame.

This step uses data from the NY School Demographics and Accountability Snapshot.

Tracking Column Names Within the API Endpoint

In this section, you will track information related to the columns of the data enclosed within the API endpoint.

To access the sample data, open this URL via a web browser of your choice:

  1. https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.json

The data in this URL is less presentable than the data you worked with in Step 2. The NY Snapshot data presents dictionaries of one form or another, but they are not dictionaries where the keys represent the column names and the values represent the column’s elements. There don’t seem to be any straightforward dictionaries that depict row-wise or column-wise data.

You’ll begin by finding the information related to the columns for the DataFrame. When accessing the API URL in your browser, type CTRL+F and search for columns:

The Columns Section (API data from the NY School Snapshot URL)
...
"columns" : [ {
        "id" : -1,
        "name" : "sid",
        "dataTypeName" : "meta_data",
        "fieldName" : ":sid",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "id",
        "dataTypeName" : "meta_data",
        "fieldName" : ":id",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "position",
        "dataTypeName" : "meta_data",
        "fieldName" : ":position",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
...

There are several dictionaries within the value list of the columns key. Within each dictionary, notice the values of the dataTypeName keys. For the first eight dictionaries, the value of the dataTypeName key is meta_data. Metadata is data that describes data. In this case, metadata isn’t the data contained within the table; rather, it is the data that describes the table as a whole. Therefore, these eight dictionaries (that have meta_data as the value of their dataTypeName key) do not represent information regarding the columns present within the table.

However, when you continue further in this list, note the first apperance of a dataTypeName key that does not have meta_data as its value:

The Columns Section (API data from the NY School Snapshot URL)
...
}, {
       "id" : -1,
       "name" : "meta",
       "dataTypeName" : "meta_data",
       "fieldName" : ":meta",
       "position" : 0,
       "renderTypeName" : "meta_data",
       "format" : { },
       "flags" : [ "hidden" ]
     }, {
       "id" : 45985351,
       "name" : "DBN",
       "dataTypeName" : "text",
       "fieldName" : "dbn",
       "position" : 1,
       "renderTypeName" : "text",
       "tableColumnId" : 8032537,
       "cachedContents" : {
         "non_null" : "10075",
...

The dictionary with the value of 45985351 for the id is the first dictionary that does not have meta_data as the value to the dataTypeName key. Instead, the value is text. This dictionary depicts information about the first column of your eventual dataset. The elements of this column are in a text format (in other words, a string). This first dictionary has a DBN value for the name key, indicating that DBN is the name of your first column. All the dictionaries (within the columns key) following this dictionary do not have meta_data as the value to their dataTypeName key, indicating that all these dictionaries represent information about the next columns of your DataFrame.

Now that you have a basic understanding of where to find information about your table’s columns, you will obtain the names of these columns in the next step.

Finding Column Names

In this section, you will add the names of all the columns from the API endpoint into a list called columns. Add the following lines to a new cell in your Jupyter Notebook and run the cell:

convert_data.ipynb
  1. response=requests.get('https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.json').json()
  2. columns=[]
  3. for i in response['meta']['view']['columns']:
  4. if(i['dataTypeName']=='meta_data'):
  5. continue
  6. else:
  7. columns.append(i['name'])

Similar to what you did in Method 3 in Step 2, you’ll read the data present in this URL via the requests module and assign it to a variable named response.

In line 3, you define the columns list. In line 5, you define the columns key within the values of the view key, which is in turn defined within the values of the meta key.

The for loop iterates through every dictionary present in the list attached to the columns key and checks the values of the dataTypeName key of those dictionaries. If the value of the current dictionary’s dataTypeName key equals meta_data, you skip to the next dictionary. If it isn’t, you append the name key value to the columns list.

Next, check the contents of the columns list. Run the following line in a new cell:

convert_data.ipynb
columns

The column names will print to the screen:

Output
['DBN', 'Name', 'schoolyear', 'fl_percent', 'frl_percent', 'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4', 'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11', 'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent', 'ctt_num', 'selfcontained_num', 'asian_num', 'asian_per', 'black_num', 'black_per', 'hispanic_num', 'hispanic_per', 'white_num', 'white_per', 'male_num', 'male_per', 'female_num', 'female_per']

The output confirms that any column whose dataTypeName is not meta_data has been added to the columns list.

In this section, you obtained a list of the column names in the unordered API endpoint, which you’ll need to create your final DataFrame. Next, you will extract the data that belongs to the columns defined in the columns list.

Defining a List of n Sublists

In this section, you will determine the number of columns in the API and define a list of sublists corresponding to this number.

Ultimately, you are using Method 2 of Step 2 to obtain a pandas DataFrame from a dictionary whose keys represent the column names and where the values of those keys are lists that represent elements of the respective columns. After determining the number of columns in the API, you will define a list of sublists of the same number, where each sublist contains data for the respective columns.

Once you’ve defined a dictionary using this operation, you will be able to make additions to the dictionary dynamically by looping through the columns list and the list of sublists. For example, the sublist at index 0 will serve as a list of elements for the column name at index 0 of the columns list. Similarly, the sublist at index 1 will serve as a list of elements for the column name at index 1 of the columns list.

In the previous step, you obtained a list of the column names. Now, you will check the length of this list by running this line in a new cell:

convert_data.ipynb
len(columns)

len(columns) returns the number of items in the columns list.

The number of columns present in the API will print to the screen:

Output
38

Since you have 38 columns, you will define a list of 38 sublists, where each sublist will contain the data for each column.

Run the following code in a new cell to create the list (named d) of 38 sublists:

convert_data.ipynb
d = [[] for x in  range(len(columns))]

d

To create the list, first name it (in this case, d). Each array in the dataset ([]) will create a new sublist containing the data from the range of 38 columns you just found.

The following output shows the list of 38 sublists:

Output
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

These sublists are initially empty because you have not yet appended any elements to them.

You have now created 38 sublists within the list d. The contents of these 38 sublists will represent column-wise data for the 38 columns. You’ll add that content next.

Obtaining the Values of Columns

In this section, you will extract the column-wise data from the API endpoint and append that data to the 38 sublists you created in the last step. To do this, you will first need to find out where the data key is present in the entire API. The value(s) of the data key will represent the table’s contents.

When accessing the API URL in your browser, type CTRL+F and search for data. There will be around 73 occurrences. Locate the last occurrence, where you will find the following:

The Data Portion Showing Sublists (API Data from the NY School Snapshot URL)
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
, [ "row-ugmn-sxmy_fyiu", "00000000-0000-0000-2070-ABC0343F1148", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20102011", "   ", "96.5", "203", "13", "37", "35", "33", "30", "30", "25", null, null, null, null, null, null, null, "30", "14.8", "46", "22.7", "21", "9", "13", "6.4", "75", "36.9", "110", "54.2", "4", "2", "113", "55.7", "90", "44.3" ]
, [ "row-a4rf-xqy7~fvkh", "00000000-0000-0000-AF7F-41FA5EB8E91E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE                       ", "20112012", null, "89.4", "189", "13", "31", "35", "28", "25", "28", "29", "    ", "    ", "    ", "    ", "    ", "    ", "    ", "20", "10.6", "40", "21.2", "23", "7", "12", "6.3", "63", "33.3", "109", "57.7", "4", "2.1", "97", "51.3", "92", "48.7" ]
...

Each sublist within the list assigned to the data key indicates the values of a single row of your prospective DataFrame. To determine where the actual row-wise data begins, you will need to compare the number of columns against the number of elements within a single sublist of the data section.

Confirm the number of columns by running the following line in a new cell:

convert_data.ipynb
len(columns)

The following output will print the number of columns:

Output
38

Next, check the length of the first sublist within data, which is the number of values in the first row:

convert_data.ipynb

len(response['data'][0])

You specify the first sublist within data with its ID number of 0.

The following output will print the number of elements in the first sublist within data:

Output
46

You may think that since the data is row-wise, the number of elements in the first sublist (46) would equal the number of columns (38). However, the length of each sublist is 46, which is 8 more than 38. Recall that you found 8 dictionaries within the columns key that had meta_data as the value for the dataTypeName key. Therefore, these extra 8 values that you see at the beginning of these sublists indicate metadata values and do not represent the contents of your table.

Since indexes in lists start at 0, you now know that the values of these sublists in the range 0 to 7 (inclusive) do not represent the DataFrame contents but represent metadata values. Therefore, the contents of the row-wise data begin at index 8.

You will append elements from within the data sublists with the help of a loop starting from index 46-38 (8) to index 46. To test this out, you’ll extract the data from just the first row (rather than extracting all the data from all the sublists). Run the following lines in a new cell:

convert_data.ipynb
count=0

for i in response['data'][0][len(response['data'][0])-len(columns):len(response['data'][0])]: #from index 8 to 45

    d[count].append(i)

    count=count+1

d

In this case, the variable count helps append the columnar data of the first row to the nth sublist of d.

Then the code iterates through the first sublist within data to get elements starting from index 8 and ending at index 45 (which results in 38 elements).

The length(response['data'][0]) represents the length of the first sublist within the data key, whereas len(columns) is the length of the list columns, indicating the number of columns of the final DataFrame.

Here’s an overview of what happens in the first few iterations. The count is initially 0. When you enter the for loop, i is response['data'][0][8], the first non-metadata value in the first sublist. You then append this i value to d[0]; the 0 in the brackets is the current count value. As a result, you append this value to the first sublist within the list d.

The count is incremented by 1. When i is response['data'][0][9] (the second non-metadata value in the first sublist), you append i value to d[1] (the 0 is now 1 because count was incremented by 1 in the previous iteration). As a result, this value is appended to the second sublist within the list d. This process continues until you reach index 45. The last line of the code above will display the updated d list of sublists.

The output displays the data of the first row:

Output
[['01M015'], ['P.S. 015 ROBERTO CLEMENTE'], ['20052006'], ['89.4'], [None], ['281'], ['15'], ['36'], ['40'], ['33'], ['38'], ['52'], ['29'], ['38'], [None], [None], [None], [None], [None], [None], ['36'], ['12.8'], ['57'], ['20.3'], ['25'], ['9'], ['10'], ['3.6'], ['74'], ['26.3'], ['189'], ['67.3'], ['5'], ['1.8'], ['158'], ['56.2'], ['123'], ['43.8']]

To check if the first row contains the correct data, compare it to the first sublist within the list that serves as value to the data key in the API URL:

Elements of the First Sublist (API Data from the NY School Snapshot URL)
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
...

You may need to scroll to the right of the box above to view the complete data snippet.

The first few values (from index 0 to 7) represent information related to the metadata columns. The highlighted values within the first sublist (from index 8 to 45) are the values of the first row, which you have extracted successfully.

To perform the same task for all the rows, you will need to update the previous code. First, redefine the d list of 38 sublists by running the following line in a new cell:

convert_data.ipynb
d = [[] for x in  range(len(columns))]

You are redefining the d list because the list of sublists d currently contains all the elements of the first row. Now your aim is to perform the same operation, but for all the remaining rows, including the first row.

Now that you have created a new list of 38 sublists, run the following lines in a new cell, which performs data extraction for all the rows of the prospective DataFrame:

convert_data.ipynb
  1. for i in response['data']:
  2. count=0
  3. for j in range(len(response['data'][0])-len(columns),len(response['data'][0])):
  4. d[count].append(i[j])
  5. count+=1

In line 1, for i in response['data'] means that i is being assigned to each sublist one by one.

Here’s an overview of a few iterations. You start with i having the first sublist within the list assigned to the data key. The count is 0.

Line 3 allows you to iterate through the elements from index 8 to 46 of all the sublists. In Python, the upper limit of for loops is exclusive, so you’ll get elements from indexes 8 through 45.

The d[count].append(i[j]) means d[0].append(i[8]), meaning you are appending the element at index 8 of the first sublist (currently assigned to i) to the first sublist of the list d. count is incremented by 1. You continue to be in the inner loop. j is now 9.

Line 4 is now d[1].append(i[9]), meaning the element at index 9 of the first sublist (still assigned to i as per the outer loop) is appended to the second sublist of the list d. The count is incremented. The inner loop continues until you are done appending all the values of a single row to all the sublists of d.

At the end of the inner loop, the operation of appending the last value of the first row would look like this: d[37].append(i[45]). At the end of the first iteration of i, the list d looks the same as the previous output (having only one element in all the sublists of d).

When you break out of the inner loop, you go to the outer loop. i is assigned the next sublist (second sublist) within the list assigned to the data key. The count value is now back to 0.

When you break out of the inner loop in this second iteration of i, all the sublists of d will now have two values each. The second value of each sublist represents data belonging to the second row. This goes on for all the sublists present within the list assigned to the data key.

To give you an idea of how the obtained data looks in the sublists of d, you will retrieve the first five elements of the first three sublists of d. In the resulting DataFrame, these values will be the first five elements of the first three columns.

Begin by reviewing the first five elements of the sublist at index 0 of d:

convert_data.ipynb
d[0][0:5]

The output will display the following:

Output
['01M015', '01M015', '01M015', '01M015', '01M015']

You can verify the output by comparing it with the data section of the API when viewed in your browser:

Elements of the First Sublist (API Data from the NY School Snapshot URL)
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...

You may need to scroll to the right of the box above to view the complete data snippet.

The values you obtained match the highlighted values, which are at index 8 of each sublist. The elements of these sublists in the index range 0-7 (inclusive) represent metadata values, not the prospective DataFrame values. Ultimately, these values will represent the first five elements of the column at index 0 of the columns.

Next, you can review the first five elements of the sublist at index 1 of d:

convert_data.ipynb
d[1][0:5]

The first five elements of the second sublist will print to the screen:

Output
['P.S. 015 ROBERTO CLEMENTE', 'P.S. 015 ROBERTO CLEMENTE', 'P.S. 015 ROBERTO CLEMENTE', 'P.S. 015 ROBERTO CLEMENTE', 'P.S. 015 ROBERTO CLEMENTE']

These values represent the first five elements of the column at index 1 of the columns list: Name. Verify the output with the data section of the API URL:

Elements of the Second Sublist (API Data from the NY School Snapshot URL)
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...

You may need to scroll to the right of the box above to view the complete data snippet. The values you obtained match the highlighted values.

Finally, review the first five elements of the sublist at index 2 of d:

convert_data.ipynb
d[2][0:5]

The following output will print to the screen:

Output
['20052006', '20062007', '20072008', '20082009', '20092010']

Verify the output with the data section of the API URL:

Elements of the Second Sublist (API data from the NY School Snapshot URL)
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...

You may need to scroll to the right of the box above to view the complete data snippet. The values you obtained match the highlighted values. Ultimately, these values represent the first five elements of the column at index 2 of the columns list: schoolyear.

As you reach the end of this step, you have successfully extracted data for all the rows for all 38 columns and appended them into the 38 sublists of list d. You are now ready for the final step to create a pandas DataFrame from the data you have obtained so far.

Creating a Dictionary for the Dataframe

As of this step, you have extracted the names and values for each of the 38 columns, which you will now pass into a pandas DataFrame in a dictionary format. In this section, you will create a dictionary with the keys as the column names (from the list columns) and their values will be the 38 sublists taken from d.

Run the following code in a new cell:

convert_data.ipynb
json_dict={}

for i in  range(0,len(columns)):
	json_dict.update({columns[i]:d[i]})

You loop through the range of 0 up to (but not including) 38 and append key-value pairs to the final dictionary, json_dict, which you will pass to pandas DataFrame for conversion. So, for example, when i is 0, you assign the sublist of d at index 0 as a value to the key DBN, represented by columns[0], and it carries forth for the remaining indexes.

To get an idea of how the obtained keys of the dictionary will look, review all the keys present in json_dict:

convert_data.ipynb
json_dict.keys()

The list of keys for the final dictionary will print to the screen:

Output
dict_keys(['DBN', 'Name', 'schoolyear', 'fl_percent', 'frl_percent', 'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4', 'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11', 'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent', 'ctt_num', 'selfcontained_num', 'asian_num', 'asian_per', 'black_num', 'black_per', 'hispanic_num', 'hispanic_per', 'white_num', 'white_per', 'male_num', 'male_per', 'female_num', 'female_per'])

The 38 sublists of d have been assigned to the above 38 keys in json_dict. Therefore, you have obtained a dictionary in which the keys are the column names and the values are lists containing the elements of each column.

Now pass this dictionary to pandas for the generation of a DataFrame. This DataFrame can be called by a variable name of your choosing (this tutorial uses data):

convert_data.ipynb
data=pd.DataFrame(json_dict)

Next, review a portion of the resultant DataFrame:

convert_data.ipynb
data.head()

The first five rows of the DataFrame will print to the screen:

Output
DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per 0 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 None 281 15 36 40 33 ... 74 26.3 189 67.3 5 1.8 158 56.2 123 43.8 1 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 None 243 15 29 39 38 ... 68 28 153 63 4 1.6 140 57.6 103 42.4 2 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 None 261 18 43 39 36 ... 77 29.5 157 60.2 7 2.7 143 54.8 118 45.2 3 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 None 252 17 37 44 32 ... 75 29.8 149 59.1 7 2.8 149 59.1 103 40.9 4 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 96.5 208 16 40 28 32 ... 67 32.2 118 56.7 6 2.9 124 59.6 84 40.4

You may need to scroll to the right of the box above to view the complete DataFrame head. You have successfully created a DataFrame by passing the data you extracted from the URL.

Conclusion

In this tutorial, you used some common methods for creating a pandas DataFrame from dictionaries of various formats and a URL. You also created a DataFrame from an API endpoint in which the columns and their respective values are not displayed in a straightforward manner.

This tutorial should help you understand how you can use unique methods to create DataFrames with data extracted from unconventional API endpoints or dictionaries. For more about pandas DataFrames, check out our pandas tutorials.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


About the authors

Default avatar

Technical Editor


Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more
DigitalOcean Cloud Control Panel