Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- import findspark
- findspark.init()
- findspark.find()
- import os
- os.environ['HADOOP_CONF_DIR'] = '/etc/hadoop/conf'
- os.environ['YARN_CONF_DIR'] = '/etc/hadoop/conf'
- import pyspark
- from pyspark.sql import SparkSession
- from pyspark.context import SparkContext
- # импортируем оконную функцию и модуль Spark Functions
- from pyspark.sql.window import Window
- import pyspark.sql.functions as F
- spark = SparkSession \
- .builder \
- .master("yarn") \
- .config("spark.driver.cores", "4") \
- .config("spark.driver.memory", "4g") \
- .appName("CreateJob") \
- .getOrCreate()
- # Прочитайте таблицу событий из слоя сырых данных.
- events = spark.read.json("hdfs://rc1a-dataproc-m-dg5lgqqm7jju58f9.mdb.yandexcloud.net/user/master/data/events")
- # При этом сохраните сырые JSON-файлы в формат Parquet, чтобы ускорить процесс чтения данных.
- events.write.option("header",True).partitionBy("date", "event_type").mode("overwrite").parquet("hdfs://rc1a-dataproc-m-dg5lgqqm7jju58f9.mdb.yandexcloud.net/user/ahuretskyi/data/events")
- events.select('event', 'date', 'event_type').orderBy(F.col('date').desc()).show(10)
- Вывод:
- +--------------------+----------+----------+
- | event| date|event_type|
- +--------------------+----------+----------+
- |[,,,, anyone how ...|2022-06-21| message|
- |[,,,, How to acce...|2022-06-21| message|
- |[,,,, ok somebody...|2022-06-21| message|
- |[,,,, any good in...|2022-06-20| message|
- |[,,,, I have been...|2022-06-20| message|
- |[,,,, yes,, 17336...|2022-06-20| message|
- |[,,,, hi!,, 63830...|2022-06-20| message|
- |[,,,, it is just ...|2022-06-20| message|
- |[,,,, alguem a,, ...|2022-06-20| message|
- |[,,,, AFAIK not p...|2022-06-20| message|
- +--------------------+----------+----------+
- only showing top 10 rows
- Ожидаемый вывод:
- +--------------------+----------+------------+
- | event| date| event_type|
- +--------------------+----------+------------+
- |[[19342], 987160,...|2022-05-31| message|
- |[,, 2022-05-31 23...|2022-05-31|subscription|
- |[[26358], 247511,...|2022-05-31| message|
- |[[79792], 748847,...|2022-05-31| message|
- |[,, 2022-05-31 23...|2022-05-31|subscription|
- |[,, 2022-05-31 23...|2022-05-31|subscription|
- |[[151897], 396845...|2022-05-31| message|
- |[,, 2022-05-31 23...|2022-05-31|subscription|
- |[,, 2022-05-31 23...|2022-05-31|subscription|
- |[,, 2022-05-31 23...|2022-05-31|subscription|
- +--------------------+----------+------------+
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement