The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.
Similar to an Excel spreadsheet or SQL table, a pandas
DataFrame is a two-dimensional data structure with data presented in rows and columns. DataFrames are used primarily in data-concentrated fields such as scientific learning, data science, machine learning, and data analysis.
Using Python, you can convert data from CSV, Excel, or API endpoints into a pandas
DataFrame and then work on the obtained DataFrame. While converting Excel and CSV files tends to be straightforward, API endpoints can vary, so making DataFrames from that information can be challenging.
Some API endpoints do exist in a reader-friendly format of dictionaries, where the keys represent column names and the values represent the elements that belong to these columns. However, some API endpoints do not represent column names and their values in a clear context, and converting an unordered API endpoint into a DataFrame can be challenging.
In this tutorial, you will obtain data from an unordered endpoint and present it in a suitable format to pandas
DataFrame. First, you will extract data from some conventional API formats to provide some context for converting ordered API data into a DataFrame. Then, you will work with an unordered API endpoint. You will extract column names and their respective elements to create a dictionary. Finally, you will pass this dictionary into a pandas
DataFrame that you can use for data analysis.
To follow this tutorial, you will need:
pip
, and the Python venv
module installed on your machine. For Ubuntu and other Linux distributions, follow Steps 1 and 2 of our tutorial on How To Install Python 3 and Set Up a Programming Environment. For macOS, follow Steps 1–5 of our tutorial on How To Install Python 3 and Set Up a Local Programming Environment on macOS.for
loops, functions, and data structures (in particular, lists and dictionaries).pandas
module for data manipulation and analysis.For this tutorial, you’ll use Jupyter Notebook to work with the data. Jupyter Notebook is helpful for iterative coding since it allows you to write a small snippet of code, run it, and return the result.
To follow this tutorial in your Jupyter Notebook, you will need to open a new Notebook and install the required dependencies, which you will do in this step.
In the prerequisites, you set up a Jupyter Notebook environment on your server. Once you are logged into your server, activate the virtual environment:
- source ~/environments/my_env/bin/activate
Then run the Jupyter Notebook application to start the application:
- jupyter notebook
Note: If you are following the tutorial on a remote server, you can use port forwarding to access your Jupyter Notebook in the browser of your local machine.
Open a new terminal and enter the following command:
- ssh -L 8888:localhost:8888 your_non_root_user@your_server_ip
Upon connecting to the server, navigate to the link provided by the output to access your Jupyter Notebook. Keep this terminal open throughout the remainder of this tutorial.
After running and connecting to it, you will access a browser-based user interface. From the New dropdown menu, select the Python3(ipykernel) option, which will open a new tab with an untitled Python notebook. Give your file a name; in this tutorial, the file is named convert_data.ipynb
.
Then, in the first cell of your browser’s Jupyter Notebook, use pip
to install the necessary dependencies:
pip install pandas
pip install requests
The DataFrame data structure is only accessible through the pandas
module. The requests
module helps you gain access to a URL.
After adding the commands to your Jupyter Notebook, press Run to execute them.
Your Jupyter Notebook will provide a running output to indicate that the dependencies are being downloaded. A new input cell will be available below this output, which is where you will run the next lines of code.
Next, import the modules you just installed by running the following commands:
import pandas as pd
import requests
pd
will serve as a shortcut for you to call the pandas
module. You can call the pandas
module by writing just pd
.
In this step, you have prepared your Jupyter Notebook environment. Next, you will perform some basic conversions to create DataFrames from dictionaries.
pandas
DataFrameBefore obtaining data from an unordered API endpoint, you will first create DataFrames from dictionaries of different formats and organized API data at a URL.
In many API endpoints, the keys usually represent the column names, and the values of those keys represent the elements belonging to the respective columns. However, some API endpoints are not presented in this format, which means you will need a method for extracting the column names and their corresponding data to create a readable dictionary from which a pandas
DataFrame can be created. Reviewing some common methodologies for converting conventional API endpoints into a pandas
DataFrame will help you with this task.
In this step, you will use some common methods to create DataFrames from dictionaries of different formats.
In this method, each dictionary within the list represents a row of data in the final DataFrame. The DataFrame structure will feature keys that represent the column names with values for the elements belonging to the columns.
For this method, you’ll create a DataFrame with the columns Age
, ID
, and Name
along with the information of three individuals.
In a new cell, run the following code to format a list of dictionaries assigned to the variable dict1
(feel free to replace the name with your own):
dict1=[{"Age":41,"ID":1234,"Name":"Bella"},{"Age":24,"ID":5678,"Name":"Mark"},{"Age":26,"ID":2341,"Name":"John"}]
data=pd.DataFrame(dict1)
data.head()
You assign a list of dictionaries to the dict1
variable. Each dictionary within the list represents a single row of the resultant DataFrame. The keys Age
, ID
, and Name
represent the column names. The values assigned to each key represent that column’s value for that row. You then pass this dictionary to the pandas
module to create a DataFrame.
The following DataFrame will print:
OutputAge ID Name
0 41 1234 Bella
1 24 5678 Mark
2 26 2341 John
This DataFrame was created from the list of dictionaries. The keys you defined in each dictionary within the dict1
list are the column names in the obtained DataFrame, while the values in these dictionaries represent the row-wise data.
In the previous method, you defined the data based on rows, which led to typing out several dictionaries in a list. Each dictionary represented row-wise data, where the keys are the column names and the values are the elements belonging to the columns.
Now you will define the data in a column-wise manner. This time, you’ll use a dictionary whose keys represent column names but whose values are lists representing the vertical data for that column.
To use this method, you can use the same data you used in the previous section. Run the following lines in a new cell:
dict2={"Age":[41,24,26],"ID":[1234,5678,2341],"Name":["Bella","Mark","John"]}
data=pd.DataFrame(dict2)
data.head()
Here, you define column-wise data using a dictionary. The keys of the dict2
dictionary are the column names: Age
, ID
, and Name
. The value of each key is a list representing the data for each column.
The following DataFrame will print to the screen:
OutputAge ID Name
0 41 1234 Bella
1 24 5678 Mark
2 26 2341 John
The keys of the dict2
dictionary represent the column names (Age
, ID
, and Name
), and each column is populated with the list you defined as the value for each key. For example, the Age
column has the values 41
, 24
, and 26
.
As expected, both methods produce the same result. Though the outputs of the first two methods are the same, you can use both row-wise and column-wise approaches to create a DataFrame.
In this method, you will create a DataFrame with API data extracted from a URL. Within API-related URLs, the data is often presented in a list of dictionaries similar to the data used in Method 1. But instead of hard-coding the data, you’ll access API data at a URL using the requests
module, which helps you gain access to URLs.
For this method, you’ll use a URL for a JSON API: Countries with Capitals, which presents a list of countries and their capitals in an API format. To view the API, visit the following URL in your browser:
- https://raw.githubusercontent.com/dariusk/corpora/master/data/geography/countries_with_capitals.json
Here is the beginning portion of the API data presented by the URL:
{
"description": "A list of countries and its respective capitals.",
"countries": [
{"name":"Afghanistan", "capital":"Kabul"},
{"name":"Albania", "capital":"Tirana"},
{"name":"Algeria", "capital":"Algiers"},
{"name":"Andorra", "capital":"Andorra la Vella"},
{"name":"Angola", "capital":"Luanda"},
{"name":"Antigua & Barbuda", "capital":"St. John's"},
{"name":"Argentina", "capital":"Buenos Aires"},
{"name":"Armenia", "capital":"Yerevan"},
{"name":"Australia", "capital":"Canberra"},
{"name":"Austria", "capital":"Vienna"},
...
The API is presented in a recognizable format: the countries
key in the outer dictionary contains a list of dictionaries, where name
and capital
are the column names and their corresponding values are the row-wise data.
To create a pandas
DataFrame from this URL, run the following code in a new cell:
response=requests.get('https://raw.githubusercontent.com/dariusk/corpora/master/data/geography/countries_with_capitals.json').json()
df1=pd.DataFrame(response['countries'])
df1.head()
The requests
module allows you to gain access to a URL. In this case, you read the URL as a json
file and assign the read data to the variable response
. Next, you present this list of dictionaries (provided as a value to the countries
key) to pandas
, which in turn provides a DataFrame. df1
will be the DataFrame obtained upon successful extraction of the data from the given API endpoint.
After running the above code, you’ll receive the following DataFrame:
Output name capital
0 Afghanistan Kabul
1 Albania Tirana
2 Algeria Algiers
3 Andorra Andorra la Vella
4 Angola Luanda
This is the DataFrame you obtain after passing the list of countries and capitals to pandas
DataFrame. When you ran this code, you passed the list of dictionaries (present as a value to the countries
key) to the pandas
DataFrame module. The column names are name
and capital
since they are present as keys in every dictionary in this list, and their values represent the row-wise data (since every dictionary represents a row).
So far you have created DataFrames from basic dictionaries and structured API endpoints. When dealing with URLs containing structured APIs, the data is commonly presented in the format used in Method 1: as a list of dictionaries, with each dictionary representing row-wise data. In cases where the API data is unstructured, you can extract column names and their data using the formats in Methods 1 or 2: in a row-wise or column-wise manner. In the next step, you’ll combine some of the strategies from this step to obtain data from an unstructured API endpoint.
pandas
DataFrameUntil now, you’ve converted ordered data into pandas
Dataframe using several methods. Now, you will work with data in a different format. The API data in this new URL is not in a conventional format, making it difficult to identify the column names and their data.
You will have to take additional steps to create a DataFrame from this unordered data. First, you will extract the column names and the values of all the columns and then define the column names as keys and the column elements (in a list) as values in a dictionary. Ultimately, you will obtain a presentable dictionary, similar to the format used in Method 2, and present it to pandas
for conversion into a DataFrame.
This step uses data from the NY School Demographics and Accountability Snapshot.
In this section, you will track information related to the columns of the data enclosed within the API endpoint.
To access the sample data, open this URL via a web browser of your choice:
- https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.json
The data in this URL is less presentable than the data you worked with in Step 2. The NY Snapshot data presents dictionaries of one form or another, but they are not dictionaries where the keys represent the column names and the values represent the column’s elements. There don’t seem to be any straightforward dictionaries that depict row-wise or column-wise data.
You’ll begin by finding the information related to the columns for the DataFrame. When accessing the API URL in your browser, type CTRL+F
and search for columns
:
...
"columns" : [ {
"id" : -1,
"name" : "sid",
"dataTypeName" : "meta_data",
"fieldName" : ":sid",
"position" : 0,
"renderTypeName" : "meta_data",
"format" : { },
"flags" : [ "hidden" ]
}, {
"id" : -1,
"name" : "id",
"dataTypeName" : "meta_data",
"fieldName" : ":id",
"position" : 0,
"renderTypeName" : "meta_data",
"format" : { },
"flags" : [ "hidden" ]
}, {
"id" : -1,
"name" : "position",
"dataTypeName" : "meta_data",
"fieldName" : ":position",
"position" : 0,
"renderTypeName" : "meta_data",
"format" : { },
"flags" : [ "hidden" ]
...
There are several dictionaries within the value list of the columns
key. Within each dictionary, notice the values of the dataTypeName
keys. For the first eight dictionaries, the value of the dataTypeName
key is meta_data
. Metadata is data that describes data. In this case, metadata isn’t the data contained within the table; rather, it is the data that describes the table as a whole. Therefore, these eight dictionaries (that have meta_data
as the value of their dataTypeName
key) do not represent information regarding the columns present within the table.
However, when you continue further in this list, note the first apperance of a dataTypeName
key that does not have meta_data
as its value:
...
}, {
"id" : -1,
"name" : "meta",
"dataTypeName" : "meta_data",
"fieldName" : ":meta",
"position" : 0,
"renderTypeName" : "meta_data",
"format" : { },
"flags" : [ "hidden" ]
}, {
"id" : 45985351,
"name" : "DBN",
"dataTypeName" : "text",
"fieldName" : "dbn",
"position" : 1,
"renderTypeName" : "text",
"tableColumnId" : 8032537,
"cachedContents" : {
"non_null" : "10075",
...
The dictionary with the value of 45985351
for the id
is the first dictionary that does not have meta_data
as the value to the dataTypeName
key. Instead, the value is text
. This dictionary depicts information about the first column of your eventual dataset. The elements of this column are in a text
format (in other words, a string). This first dictionary has a DBN
value for the name
key, indicating that DBN
is the name of your first column. All the dictionaries (within the columns
key) following this dictionary do not have meta_data
as the value to their dataTypeName
key, indicating that all these dictionaries represent information about the next columns of your DataFrame.
Now that you have a basic understanding of where to find information about your table’s columns, you will obtain the names of these columns in the next step.
In this section, you will add the names of all the columns from the API endpoint into a list called columns
. Add the following lines to a new cell in your Jupyter Notebook and run the cell:
- response=requests.get('https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.json').json()
-
- columns=[]
-
- for i in response['meta']['view']['columns']:
-
- if(i['dataTypeName']=='meta_data'):
-
- continue
- else:
-
- columns.append(i['name'])
Similar to what you did in Method 3 in Step 2, you’ll read the data present in this URL via the requests
module and assign it to a variable named response
.
In line 3, you define the columns
list. In line 5, you define the columns
key within the values of the view
key, which is in turn defined within the values of the meta
key.
The for
loop iterates through every dictionary present in the list attached to the columns
key and checks the values of the dataTypeName
key of those dictionaries. If the value of the current dictionary’s dataTypeName
key equals meta_data
, you skip to the next dictionary. If it isn’t, you append the name
key value to the columns
list.
Next, check the contents of the columns
list. Run the following line in a new cell:
columns
The column names will print to the screen:
Output['DBN',
'Name',
'schoolyear',
'fl_percent',
'frl_percent',
'total_enrollment',
'prek',
'k',
'grade1',
'grade2',
'grade3',
'grade4',
'grade5',
'grade6',
'grade7',
'grade8',
'grade9',
'grade10',
'grade11',
'grade12',
'ell_num',
'ell_percent',
'sped_num',
'sped_percent',
'ctt_num',
'selfcontained_num',
'asian_num',
'asian_per',
'black_num',
'black_per',
'hispanic_num',
'hispanic_per',
'white_num',
'white_per',
'male_num',
'male_per',
'female_num',
'female_per']
The output confirms that any column whose dataTypeName
is not meta_data
has been added to the columns
list.
In this section, you obtained a list of the column names in the unordered API endpoint, which you’ll need to create your final DataFrame. Next, you will extract the data that belongs to the columns defined in the columns
list.
n
SublistsIn this section, you will determine the number of columns in the API and define a list of sublists corresponding to this number.
Ultimately, you are using Method 2 of Step 2 to obtain a pandas
DataFrame from a dictionary whose keys represent the column names and where the values of those keys are lists that represent elements of the respective columns. After determining the number of columns in the API, you will define a list of sublists of the same number, where each sublist contains data for the respective columns.
Once you’ve defined a dictionary using this operation, you will be able to make additions to the dictionary dynamically by looping through the columns
list and the list of sublists. For example, the sublist at index 0
will serve as a list of elements for the column name at index 0
of the columns
list. Similarly, the sublist at index 1
will serve as a list of elements for the column name at index 1
of the columns
list.
In the previous step, you obtained a list of the column names. Now, you will check the length of this list by running this line in a new cell:
len(columns)
len(columns)
returns the number of items in the columns
list.
The number of columns present in the API will print to the screen:
Output38
Since you have 38
columns, you will define a list of 38
sublists, where each sublist will contain the data for each column.
Run the following code in a new cell to create the list (named d
) of 38
sublists:
d = [[] for x in range(len(columns))]
d
To create the list, first name it (in this case, d
). Each array in the dataset ([]
) will create a new sublist containing the data from the range of 38
columns you just found.
The following output shows the list of 38
sublists:
Output[[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[]]
These sublists are initially empty because you have not yet appended any elements to them.
You have now created 38
sublists within the list d
. The contents of these 38
sublists will represent column-wise data for the 38
columns. You’ll add that content next.
In this section, you will extract the column-wise data from the API endpoint and append that data to the 38
sublists you created in the last step. To do this, you will first need to find out where the data
key is present in the entire API. The value(s) of the data
key will represent the table’s contents.
When accessing the API URL in your browser, type CTRL+F
and search for data
. There will be around 73
occurrences. Locate the last occurrence, where you will find the following:
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", " ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
, [ "row-ugmn-sxmy_fyiu", "00000000-0000-0000-2070-ABC0343F1148", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20102011", " ", "96.5", "203", "13", "37", "35", "33", "30", "30", "25", null, null, null, null, null, null, null, "30", "14.8", "46", "22.7", "21", "9", "13", "6.4", "75", "36.9", "110", "54.2", "4", "2", "113", "55.7", "90", "44.3" ]
, [ "row-a4rf-xqy7~fvkh", "00000000-0000-0000-AF7F-41FA5EB8E91E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE ", "20112012", null, "89.4", "189", "13", "31", "35", "28", "25", "28", "29", " ", " ", " ", " ", " ", " ", " ", "20", "10.6", "40", "21.2", "23", "7", "12", "6.3", "63", "33.3", "109", "57.7", "4", "2.1", "97", "51.3", "92", "48.7" ]
...
Each sublist within the list assigned to the data
key indicates the values of a single row of your prospective DataFrame. To determine where the actual row-wise data begins, you will need to compare the number of columns against the number of elements within a single sublist of the data
section.
Confirm the number of columns by running the following line in a new cell:
len(columns)
The following output will print the number of columns:
Output38
Next, check the length of the first sublist within data
, which is the number of values in the first row:
len(response['data'][0])
You specify the first sublist within data
with its ID number of 0
.
The following output will print the number of elements in the first sublist within data
:
Output46
You may think that since the data is row-wise, the number of elements in the first sublist (46
) would equal the number of columns (38
). However, the length of each sublist is 46
, which is 8
more than 38
. Recall that you found 8
dictionaries within the columns
key that had meta_data
as the value for the dataTypeName
key. Therefore, these extra 8
values that you see at the beginning of these sublists indicate metadata values and do not represent the contents of your table.
Since indexes in lists start at 0
, you now know that the values of these sublists in the range 0
to 7
(inclusive) do not represent the DataFrame contents but represent metadata values. Therefore, the contents of the row-wise data begin at index 8
.
You will append elements from within the data
sublists with the help of a loop starting from index 46-38
(8
) to index 46
. To test this out, you’ll extract the data from just the first row (rather than extracting all the data from all the sublists). Run the following lines in a new cell:
count=0
for i in response['data'][0][len(response['data'][0])-len(columns):len(response['data'][0])]: #from index 8 to 45
d[count].append(i)
count=count+1
d
In this case, the variable count
helps append the columnar data of the first row to the nth sublist of d
.
Then the code iterates through the first sublist within data
to get elements starting from index 8
and ending at index 45
(which results in 38
elements).
The length(response['data'][0])
represents the length of the first sublist within the data
key, whereas len(columns)
is the length of the list columns
, indicating the number of columns of the final DataFrame.
Here’s an overview of what happens in the first few iterations. The count is initially 0
. When you enter the for
loop, i
is response['data'][0][8]
, the first non-metadata value in the first sublist. You then append this i
value to d[0]
; the 0
in the brackets is the current count value. As a result, you append this value to the first sublist within the list d
.
The count is incremented by 1
. When i
is response['data'][0][9]
(the second non-metadata value in the first sublist), you append i
value to d[1]
(the 0
is now 1
because count
was incremented by 1
in the previous iteration). As a result, this value is appended to the second sublist within the list d
. This process continues until you reach index 45
. The last line of the code above will display the updated d
list of sublists.
The output displays the data of the first row:
Output[['01M015'], ['P.S. 015 ROBERTO CLEMENTE'], ['20052006'], ['89.4'], [None], ['281'], ['15'], ['36'], ['40'], ['33'], ['38'], ['52'], ['29'], ['38'], [None], [None], [None], [None], [None], [None], ['36'], ['12.8'], ['57'], ['20.3'], ['25'], ['9'], ['10'], ['3.6'], ['74'], ['26.3'], ['189'], ['67.3'], ['5'], ['1.8'], ['158'], ['56.2'], ['123'], ['43.8']]
To check if the first row contains the correct data, compare it to the first sublist within the list that serves as value to the data
key in the API URL:
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
...
You may need to scroll to the right of the box above to view the complete data snippet.
The first few values (from index 0
to 7
) represent information related to the metadata columns. The highlighted values within the first sublist (from index 8
to 45
) are the values of the first row, which you have extracted successfully.
To perform the same task for all the rows, you will need to update the previous code. First, redefine the d
list of 38
sublists by running the following line in a new cell:
d = [[] for x in range(len(columns))]
You are redefining the d
list because the list of sublists d
currently contains all the elements of the first row. Now your aim is to perform the same operation, but for all the remaining rows, including the first row.
Now that you have created a new list of 38
sublists, run the following lines in a new cell, which performs data extraction for all the rows of the prospective DataFrame:
- for i in response['data']:
- count=0
- for j in range(len(response['data'][0])-len(columns),len(response['data'][0])):
- d[count].append(i[j])
- count+=1
In line 1, for i in response['data']
means that i
is being assigned to each sublist one by one.
Here’s an overview of a few iterations. You start with i
having the first sublist within the list assigned to the data
key. The count is 0
.
Line 3 allows you to iterate through the elements from index 8
to 46
of all the sublists. In Python, the upper limit of for
loops is exclusive, so you’ll get elements from indexes 8
through 45
.
The d[count].append(i[j])
means d[0].append(i[8])
, meaning you are appending the element at index 8
of the first sublist (currently assigned to i
) to the first sublist of the list d
. count
is incremented by 1
. You continue to be in the inner loop. j
is now 9
.
Line 4 is now d[1].append(i[9])
, meaning the element at index 9
of the first sublist (still assigned to i
as per the outer loop) is appended to the second sublist of the list d
. The count
is incremented. The inner loop continues until you are done appending all the values of a single row to all the sublists of d
.
At the end of the inner loop, the operation of appending the last value of the first row would look like this: d[37].append(i[45])
. At the end of the first iteration of i
, the list d
looks the same as the previous output (having only one element in all the sublists of d
).
When you break out of the inner loop, you go to the outer loop. i
is assigned the next sublist (second sublist) within the list assigned to the data
key. The count value is now back to 0
.
When you break out of the inner loop in this second iteration of i
, all the sublists of d
will now have two values each. The second value of each sublist represents data belonging to the second row. This goes on for all the sublists present within the list assigned to the data
key.
To give you an idea of how the obtained data looks in the sublists of d
, you will retrieve the first five elements of the first three sublists of d
. In the resulting DataFrame, these values will be the first five elements of the first three columns.
Begin by reviewing the first five elements of the sublist at index 0
of d
:
d[0][0:5]
The output will display the following:
Output['01M015', '01M015', '01M015', '01M015', '01M015']
You can verify the output by comparing it with the data
section of the API when viewed in your browser:
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", " ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...
You may need to scroll to the right of the box above to view the complete data snippet.
The values you obtained match the highlighted values, which are at index 8
of each sublist. The elements of these sublists in the index range 0-7
(inclusive) represent metadata values, not the prospective DataFrame values. Ultimately, these values will represent the first five elements of the column at index 0
of the columns
.
Next, you can review the first five elements of the sublist at index 1
of d
:
d[1][0:5]
The first five elements of the second sublist will print to the screen:
Output['P.S. 015 ROBERTO CLEMENTE',
'P.S. 015 ROBERTO CLEMENTE',
'P.S. 015 ROBERTO CLEMENTE',
'P.S. 015 ROBERTO CLEMENTE',
'P.S. 015 ROBERTO CLEMENTE']
These values represent the first five elements of the column at index 1
of the columns
list: Name
. Verify the output with the data
section of the API URL:
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", " ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...
You may need to scroll to the right of the box above to view the complete data snippet. The values you obtained match the highlighted values.
Finally, review the first five elements of the sublist at index 2
of d
:
d[2][0:5]
The following output will print to the screen:
Output['20052006', '20062007', '20072008', '20082009', '20092010']
Verify the output with the data
section of the API URL:
...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", " ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...
You may need to scroll to the right of the box above to view the complete data snippet. The values you obtained match the highlighted values. Ultimately, these values represent the first five elements of the column at index 2
of the columns
list: schoolyear
.
As you reach the end of this step, you have successfully extracted data for all the rows for all 38
columns and appended them into the 38
sublists of list d
. You are now ready for the final step to create a pandas
DataFrame from the data you have obtained so far.
As of this step, you have extracted the names and values for each of the 38
columns, which you will now pass into a pandas
DataFrame in a dictionary format. In this section, you will create a dictionary with the keys as the column names (from the list columns
) and their values will be the 38
sublists taken from d
.
Run the following code in a new cell:
json_dict={}
for i in range(0,len(columns)):
json_dict.update({columns[i]:d[i]})
You loop through the range of 0
up to (but not including) 38
and append key-value pairs to the final dictionary, json_dict
, which you will pass to pandas
DataFrame for conversion. So, for example, when i
is 0
, you assign the sublist of d
at index 0
as a value to the key DBN
, represented by columns[0]
, and it carries forth for the remaining indexes.
To get an idea of how the obtained keys of the dictionary will look, review all the keys present in json_dict
:
json_dict.keys()
The list of keys for the final dictionary will print to the screen:
Outputdict_keys(['DBN', 'Name', 'schoolyear', 'fl_percent', 'frl_percent', 'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4', 'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11', 'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent', 'ctt_num', 'selfcontained_num', 'asian_num', 'asian_per', 'black_num', 'black_per', 'hispanic_num', 'hispanic_per', 'white_num', 'white_per', 'male_num', 'male_per', 'female_num', 'female_per'])
The 38
sublists of d
have been assigned to the above 38
keys in json_dict
. Therefore, you have obtained a dictionary in which the keys are the column names and the values are lists containing the elements of each column.
Now pass this dictionary to pandas
for the generation of a DataFrame. This DataFrame can be called by a variable name of your choosing (this tutorial uses data
):
data=pd.DataFrame(json_dict)
Next, review a portion of the resultant DataFrame:
data.head()
The first five rows of the DataFrame will print to the screen:
OutputDBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per
0 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 None 281 15 36 40 33 ... 74 26.3 189 67.3 5 1.8 158 56.2 123 43.8
1 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 None 243 15 29 39 38 ... 68 28 153 63 4 1.6 140 57.6 103 42.4
2 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 None 261 18 43 39 36 ... 77 29.5 157 60.2 7 2.7 143 54.8 118 45.2
3 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 None 252 17 37 44 32 ... 75 29.8 149 59.1 7 2.8 149 59.1 103 40.9
4 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 96.5 208 16 40 28 32 ... 67 32.2 118 56.7 6 2.9 124 59.6 84 40.4
You may need to scroll to the right of the box above to view the complete DataFrame head
. You have successfully created a DataFrame by passing the data you extracted from the URL.
In this tutorial, you used some common methods for creating a pandas
DataFrame from dictionaries of various formats and a URL. You also created a DataFrame from an API endpoint in which the columns and their respective values are not displayed in a straightforward manner.
This tutorial should help you understand how you can use unique methods to create DataFrames with data extracted from unconventional API endpoints or dictionaries. For more about pandas
DataFrames, check out our pandas
tutorials.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.